You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently discourage the use of CollectivePermute in #2384. There is no context behind why this is the case, including whether the motivation was hardware specific. It seems that we have enabled it for All-to-All (#2472) - do we have the same guidance/information from the XLA team?
I can not find any relevant reference to 'all_to_all_emitter'.
Thanks @ManfeiBai. I have not encountered any failure yet - at least with Neuron's TRN1, but this is a generally concerning docstring/call-out when trying to productionize with this collective on XLA. We need to use this instead of P2P send/recv for other HW specific reasons.
IIUC, all_to_all_emitter is defined internally for all_to_all, which helped fixed all_to_all unstable issue;
This is what I understood as well from Jack's PR above, but it would be nice if we had a reference point that we can use to cross check with other collectives (particularly CollectivePermute). I'll wait on the XLA team's comment.
🐛 Bug
We currently discourage the use of CollectivePermute in #2384. There is no context behind why this is the case, including whether the motivation was hardware specific. It seems that we have enabled it for All-to-All (#2472) - do we have the same guidance/information from the XLA team?
I can not find any relevant reference to 'all_to_all_emitter'.
cc: @miladm @ManfeiBai @JackCaoG
The text was updated successfully, but these errors were encountered: