CollectivePermute support #8815

rpsilva-aws · 2025-03-11T17:51:24Z

🐛 Bug

We currently discourage the use of CollectivePermute in #2384. There is no context behind why this is the case, including whether the motivation was hardware specific. It seems that we have enabled it for All-to-All (#2472) - do we have the same guidance/information from the XLA team?

I can not find any relevant reference to 'all_to_all_emitter'.

cc: @miladm @ManfeiBai @JackCaoG

ManfeiBai · 2025-03-11T20:58:38Z

Hi, @rpsilva-aws, thanks,

IIUC, all_to_all_emitter is defined internally for all_to_all, which helped fixed all_to_all unstable issue;

for collective_permute, we would need to sync with XLA team, cc @ddunl

btw, what's the current failure did you met with collective_permute now? any context or reproduce material?

rpsilva-aws · 2025-03-11T21:33:10Z

Thanks @ManfeiBai. I have not encountered any failure yet - at least with Neuron's TRN1, but this is a generally concerning docstring/call-out when trying to productionize with this collective on XLA. We need to use this instead of P2P send/recv for other HW specific reasons.

IIUC, all_to_all_emitter is defined internally for all_to_all, which helped fixed all_to_all unstable issue;

This is what I understood as well from Jack's PR above, but it would be nice if we had a reference point that we can use to cross check with other collectives (particularly CollectivePermute). I'll wait on the XLA team's comment.

miladm · 2025-03-13T17:01:39Z

@yaochengji - can you please share the latest technical updates on the support of this op outside of SPMD path?

rpsilva-aws changed the title ~~CollectivePermute ambiguous support~~ CollectivePermute support Mar 11, 2025

ysiraichi added enhancement New feature or request SPMD / Distributed labels Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CollectivePermute support #8815

CollectivePermute support #8815

rpsilva-aws commented Mar 11, 2025 •

edited

Loading

ManfeiBai commented Mar 11, 2025

rpsilva-aws commented Mar 11, 2025 •

edited

Loading

miladm commented Mar 13, 2025

CollectivePermute support #8815

CollectivePermute support #8815

Comments

rpsilva-aws commented Mar 11, 2025 • edited Loading

🐛 Bug

ManfeiBai commented Mar 11, 2025

rpsilva-aws commented Mar 11, 2025 • edited Loading

miladm commented Mar 13, 2025

rpsilva-aws commented Mar 11, 2025 •

edited

Loading

rpsilva-aws commented Mar 11, 2025 •

edited

Loading