multi-gpu: fix tensor device placements for various models #35763

dvrogozh · 2025-01-18T04:54:50Z

Fixes: #35762
CC: @SunMarc @ydshieh @faaany

Fixing the following errors in few models: ``` > hidden_states = inputs_embeds + pos_embeds E RuntimeError: Expected all tensors to be on the same device, but found at least two devices, xpu:2 and xpu:3! ``` Fixes: huggingface#35762 Signed-off-by: Dmitry Rogozhkin <[email protected]>

Fixes: huggingface#35762 Signed-off-by: Dmitry Rogozhkin <[email protected]>

dvrogozh · 2025-01-18T04:58:28Z

src/transformers/models/falcon_mamba/modeling_falcon_mamba.py

@@ -309,6 +309,7 @@ def slow_forward(
                )  # [batch, intermediate_size, seq_len]
            else:
                conv_state = cache_params.update_conv_state(self.layer_idx, hidden_states, cache_position)
+                conv_state = conv_state.to(self.conv1d.weight.device)


FYI reviewers, this fix was basically taken from here:

transformers/src/transformers/models/mamba2/modeling_mamba2.py

Line 482 in 5fa3534

conv_states = cache_params.conv_states[self.layer_idx].to(device=self.conv1d.weight.device)

dvrogozh · 2025-01-18T05:01:26Z

src/transformers/models/gpt2/modeling_gpt2.py

@@ -818,7 +818,7 @@ def forward(
        if inputs_embeds is None:
            inputs_embeds = self.wte(input_ids)
        position_embeds = self.wpe(position_ids)
-        hidden_states = inputs_embeds + position_embeds
+        hidden_states = inputs_embeds + position_embeds.to(inputs_embeds.device)


FYI reviewers, this fix was taken from here:

transformers/src/transformers/models/whisper/modeling_whisper.py

Line 1270 in 5fa3534

hidden_states = inputs_embeds + positions.to(inputs_embeds.device)

See #30836 (comment) for associated discussion.

Signed-off-by: Dmitry Rogozhkin <[email protected]>

dvrogozh added 2 commits January 18, 2025 04:46

multi-gpu: fix tensor device placements for various models

0585ffc

Fixes: huggingface#35762 Signed-off-by: Dmitry Rogozhkin <[email protected]>

dvrogozh requested review from zucchini-nlp, ArthurZucker, amyeroberts and qubvel as code owners January 18, 2025 04:54

dvrogozh mentioned this pull request Jan 18, 2025

multi-gpu: test_model_parallel_beam_search tests fail with "RuntimeError: Expected all tensors to be on the same device" #35762

Open

dvrogozh commented Jan 18, 2025

View reviewed changes

fix copies

8ce6e2f

Signed-off-by: Dmitry Rogozhkin <[email protected]>

dvrogozh requested review from eustlb, Cyrilvallez and Rocketknight1 as code owners January 18, 2025 06:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-gpu: fix tensor device placements for various models #35763

multi-gpu: fix tensor device placements for various models #35763

dvrogozh commented Jan 18, 2025

dvrogozh Jan 18, 2025

dvrogozh Jan 18, 2025

multi-gpu: fix tensor device placements for various models #35763

Are you sure you want to change the base?

multi-gpu: fix tensor device placements for various models #35763

Conversation

dvrogozh commented Jan 18, 2025

dvrogozh Jan 18, 2025

Choose a reason for hiding this comment

dvrogozh Jan 18, 2025

Choose a reason for hiding this comment