[Hardware][TPU] Multi-LoRA implementation for the TPU backend #12623

Akshat-Tripathi · 2025-01-31T17:43:24Z

This PR adds a Multi-LoRA implementation that works on the TPU backend, extending the work done in #11100.

Currently this uses pytorch operations for the Punica kernels, but I am going to put up a PR with Pallas kernels soon.

…ter loading a LoRA adapter. Signed-off-by: Oleg Mosalov <[email protected]>

Signed-off-by: Akshat Tripathi <[email protected]>

…` to be called with infinities Signed-off-by: Akshat Tripathi <[email protected]>

Signed-off-by: Akshat Tripathi <[email protected]>

… the adapter and its weights are loaded. Signed-off-by: Oleg Mosalov <[email protected]>

Signed-off-by: Oleg Mosalov <[email protected]>

Signed-off-by: Akshat Tripathi <[email protected]>

github-actions · 2025-01-31T17:43:37Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Akshat Tripathi <[email protected]>

liangfu · 2025-01-31T18:19:25Z

vllm/lora/layers.py

+        if current_platform.is_tpu():
+            # Because nan_to_num_ doesn't work with actual -inf values on TPU
+            neg_inf = torch.finfo(lora_logits.dtype).min
+            pos_inf = torch.finfo(lora_logits.dtype).max
+        else:
+            neg_inf = float("-inf")
+            pos_inf = float("inf")


these if-else conditions will make vLLM hard to maintain.

file an issue with torch-xla ? or abstract this as part of an utility function ?

abstract this as part of an utility function sounds good

Yeah that sounds good I can abstract it away, it was only a problem for that nan_to_num() function though, -inf works properly elsewhere.

abstract it away as a short-term solution is fine.

it would better if we can create an issue in torch-xla repo, as a longer-term solution.

Ok, I've made the issue here: pytorch/xla#8674

jeejeelee · 2025-02-01T05:17:56Z

vllm/lora/ops/xla_ops/lora_ops.py

@@ -0,0 +1,58 @@
+import torch
+
+from ..torch_ops import bgmv_expand, bgmv_expand_slice, bgmv_shrink


It seems the TPU ops are still using PyTorch operators, is it necessary to add the below ops?

The sgmv ops are slightly different here because I'm using repeat_interleave with a static size rather than a dynamic tensor, which reduces the compile time quite a bit because torch_xla can't lower the dynamic version properly.

jeejeelee · 2025-02-01T05:31:54Z

vllm/lora/punica_wrapper/punica_tpu.py

+
+# The platforms that are compatible with the PyTorch-native implementation can
+# inherit this class
+class PunicaWrapperTPU(PunicaWrapperBase):


Why not directly inherit from PunicaWrapperCPU ?

I thought about it, but this code is going to change very soon as I add in the Pallas kernels

Signed-off-by: Akshat Tripathi <[email protected]>

Akshat-Tripathi · 2025-02-05T10:22:57Z

It looks like the Async Engine, Inputs, Utils, Worker Test is failing on multimodal inputs, which is WIP right now.
The TPU test seems to be failing on non lora code. Do these tests pass on main? I'm wondering if they're linked to this PR or something else

miladm · 2025-02-07T18:43:36Z

cc @lsy323 to take a pass

…because xla doesn't allow partial updates Signed-off-by: Akshat Tripathi <[email protected]>

Signed-off-by: Akshat Tripathi <[email protected]>

mosalov and others added 30 commits January 24, 2025 14:56

A simple test to compare named_modules for a base model before and af…

2fc505c

…ter loading a LoRA adapter. Signed-off-by: Oleg Mosalov <[email protected]>

Added non-triton SGMV and BGMV ops (not kernels yet)

e351842

Signed-off-by: Akshat Tripathi <[email protected]>

Made a copy of the layer tests for the TPU. TODO: DRY it out

4628132

Signed-off-by: Akshat Tripathi <[email protected]>

Removed extra print

2054570

Signed-off-by: Akshat Tripathi <[email protected]>

Made some minor shape-based fixes to the kernels

4ca792a

Signed-off-by: Akshat Tripathi <[email protected]>

Added basic lora execution code

14a8f7d

Signed-off-by: Akshat Tripathi <[email protected]>

Replaced einsums with matmuls+reshaping for better xla compilation

cb94436

Signed-off-by: Akshat Tripathi <[email protected]>

Replaced inf/-inf with max/min since XLA doesn't allow `nan_to_num_()…

da1fff9

…` to be called with infinities Signed-off-by: Akshat Tripathi <[email protected]>

Added lora config to _dummy_run()

c668c97

Signed-off-by: Akshat Tripathi <[email protected]>

Changed torch._dynamo config

e165aee

Signed-off-by: Akshat Tripathi <[email protected]>

Quick patch to allow non lora code to run

d031d89

Signed-off-by: Akshat Tripathi <[email protected]>

Updated the test for loading a LoRA adapter, now it better shows when…

6a15233

… the adapter and its weights are loaded. Signed-off-by: Oleg Mosalov <[email protected]>

Better wording.

72fe7e0

Signed-off-by: Oleg Mosalov <[email protected]>

Added arg_parser to test_load_lora_adapter.py.

a554f9e

Signed-off-by: Oleg Mosalov <[email protected]>

Minor fixes

dcbc952

Signed-off-by: Akshat Tripathi <[email protected]>

Replaced einsums with matmuls to allow xla compilation

58d571e

Signed-off-by: Akshat Tripathi <[email protected]>

Removed xla ops for torch ops

17247dd

Signed-off-by: Akshat Tripathi <[email protected]>

Removed old debug log points

9341f0f

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed bgmv/sgmv shape error

825c965

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed lora batching crash in warmup

d7899ce

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed shape issue in add_lora_linear()

3448072

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed dynamic lora tensor shapes

e27e6f6

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed lora_input preparation for actual execution

f14fc34

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed wrong model bug

194f0c9

Signed-off-by: Akshat Tripathi <[email protected]>

Moved if statements outside of for loops in PunicaWrapperTPU

acaab93

Signed-off-by: Akshat Tripathi <[email protected]>

Added early exits to PunicaWrapperTPU lora functions

a39bbff

Signed-off-by: Akshat Tripathi <[email protected]>

Added torch ops for tpu (Static prefill sizes)

13d46cf

Signed-off-by: Akshat Tripathi <[email protected]>

XLA bgmv operations are now imported from the default torch_ops

d44f4c1

Signed-off-by: Akshat Tripathi <[email protected]>

Removed TODOs

5d669c2

Signed-off-by: Akshat Tripathi <[email protected]>

Merge branch 'main' into multi_lora_tpu

81ed389

Signed-off-by: Akshat Tripathi <[email protected]>

Akshat-Tripathi added 2 commits January 31, 2025 17:45

Removed old code

7f8694d

Signed-off-by: Akshat Tripathi <[email protected]>

Linting

7d9723a

Signed-off-by: Akshat Tripathi <[email protected]>

liangfu reviewed Jan 31, 2025

View reviewed changes

jeejeelee reviewed Feb 1, 2025

View reviewed changes

Akshat-Tripathi added 4 commits February 3, 2025 17:52

Fixed import error

b7cadb9

Signed-off-by: Akshat Tripathi <[email protected]>

lint

effc49d

Signed-off-by: Akshat Tripathi <[email protected]>

Abstracted out infinity values

3452eba

Signed-off-by: Akshat Tripathi <[email protected]>

Merge branch 'main' into multi_lora_tpu

aafc92f

Signed-off-by: Akshat Tripathi <[email protected]>

Akshat-Tripathi requested review from liangfu and jeejeelee February 5, 2025 10:28

Akshat-Tripathi added 2 commits February 7, 2025 18:48

Moved and modified bgmv ops from the cpu backend to the tpu backend, …

7cfcdb0

…because xla doesn't allow partial updates Signed-off-by: Akshat Tripathi <[email protected]>

Removed total_size for linting

a8df33f

Signed-off-by: Akshat Tripathi <[email protected]>

miladm requested review from lsy323 and removed request for liangfu February 7, 2025 19:02

Akshat-Tripathi added 2 commits February 7, 2025 19:04

Reverted changes to torch_ops

bfb3770

Signed-off-by: Akshat Tripathi <[email protected]>

Lint

ce49855

Signed-off-by: Akshat Tripathi <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hardware][TPU] Multi-LoRA implementation for the TPU backend #12623

[Hardware][TPU] Multi-LoRA implementation for the TPU backend #12623

Akshat-Tripathi commented Jan 31, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 31, 2025

liangfu Jan 31, 2025 •

edited

Loading

jeejeelee Feb 1, 2025

Akshat-Tripathi Feb 1, 2025

liangfu Feb 3, 2025

Akshat-Tripathi Feb 4, 2025

jeejeelee Feb 1, 2025

Akshat-Tripathi Feb 1, 2025

jeejeelee Feb 1, 2025

Akshat-Tripathi Feb 1, 2025

Akshat-Tripathi commented Feb 5, 2025

miladm commented Feb 7, 2025

		@@ -0,0 +1,58 @@
		import torch

		from ..torch_ops import bgmv_expand, bgmv_expand_slice, bgmv_shrink

[Hardware][TPU] Multi-LoRA implementation for the TPU backend #12623

Are you sure you want to change the base?

[Hardware][TPU] Multi-LoRA implementation for the TPU backend #12623

Conversation

Akshat-Tripathi commented Jan 31, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 31, 2025

liangfu Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Akshat-Tripathi commented Feb 5, 2025

miladm commented Feb 7, 2025

Akshat-Tripathi commented Jan 31, 2025 •

edited by github-actions bot

Loading

liangfu Jan 31, 2025 •

edited

Loading