Why does Tensor-parallel Communication Overlap require MPI? #11849

mjkpolo · 2025-01-14T19:50:18Z

mjkpolo
Jan 14, 2025

Hello,
I was trying to replicate the MLCommons llama 2 70b training with h200s, but the cluster I am using doesn't support MPI. I got it working by setting

export TP_COMM_OVERLAP=0

and using a smaller model, because I noticed in the NeMo code:

NeMo/nemo/lightning/_strategy_lib.py

Line 92 in dc08edd

init_mpi_proc_group=getattr(parallel_config, "tp_comm_overlap", False)

I am not very familiar with training or HPC applications, but why does this feature require MPI, and cannot use NCCL? I read this blog post to try and understand what tensor parallel communication overlap is, but I can't figure out why I need MPI for it.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does Tensor-parallel Communication Overlap require MPI? #11849

{{title}}

Replies: 0 comments

Select a reply

Why does Tensor-parallel Communication Overlap require MPI? #11849

mjkpolo Jan 14, 2025

Replies: 0 comments

mjkpolo
Jan 14, 2025