Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CollectivePermute support #8815

Open
rpsilva-aws opened this issue Mar 11, 2025 · 3 comments
Open

CollectivePermute support #8815

rpsilva-aws opened this issue Mar 11, 2025 · 3 comments
Labels
enhancement New feature or request SPMD / Distributed

Comments

@rpsilva-aws
Copy link
Collaborator

rpsilva-aws commented Mar 11, 2025

🐛 Bug

We currently discourage the use of CollectivePermute in #2384. There is no context behind why this is the case, including whether the motivation was hardware specific. It seems that we have enabled it for All-to-All (#2472) - do we have the same guidance/information from the XLA team?

I can not find any relevant reference to 'all_to_all_emitter'.

cc: @miladm @ManfeiBai @JackCaoG

@rpsilva-aws rpsilva-aws changed the title CollectivePermute ambiguous support CollectivePermute support Mar 11, 2025
@ManfeiBai
Copy link
Collaborator

Hi, @rpsilva-aws, thanks,

IIUC, all_to_all_emitter is defined internally for all_to_all, which helped fixed all_to_all unstable issue;

for collective_permute, we would need to sync with XLA team, cc @ddunl

btw, what's the current failure did you met with collective_permute now? any context or reproduce material?

@rpsilva-aws
Copy link
Collaborator Author

rpsilva-aws commented Mar 11, 2025

Thanks @ManfeiBai. I have not encountered any failure yet - at least with Neuron's TRN1, but this is a generally concerning docstring/call-out when trying to productionize with this collective on XLA. We need to use this instead of P2P send/recv for other HW specific reasons.

IIUC, all_to_all_emitter is defined internally for all_to_all, which helped fixed all_to_all unstable issue;

This is what I understood as well from Jack's PR above, but it would be nice if we had a reference point that we can use to cross check with other collectives (particularly CollectivePermute). I'll wait on the XLA team's comment.

@ysiraichi ysiraichi added enhancement New feature or request SPMD / Distributed labels Mar 12, 2025
@miladm
Copy link
Collaborator

miladm commented Mar 13, 2025

@yaochengji - can you please share the latest technical updates on the support of this op outside of SPMD path?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SPMD / Distributed
Projects
None yet
Development

No branches or pull requests

4 participants