checkpointing/PGTransport: add NCCL/Gloo transport for checkpoints #110

d4l3k · 2025-02-13T21:31:14Z

This adds a ProcessGroup based CheckpointTransport to allow for transferring a state_dict via the backend network.

This supports NCCL with cuda devices and Gloo on CPU devices. DTensor is supported but other tensor subclasses will likely error.

This is cleaned up code from #104

Additional changes:

ProcessGroupBaby: pass the timeout parameter to the subprocess so we can catch NCCL timeouts
refactored http_transport_test

Core algorithm:

Sender

preprocess the state_dict using pytree to separate the tensors from python objects
serialize the tensor metadata and the non-python objects using pickle
send the metadata size via send/recv
send the pickled buffer via send/recv
send each tensor via send/recv

These sends will send to each of the receiving peers simultaneously.

Receiver

Receiving is largely the same. Notably when receiving each tensor, a buffer is allocated on the device, received and then transferred to CPU to prevent CUDA ooms. Allocating and transferring is significantly slower than doing and inplace receive and is something we should fix in a follow up PR.

Test plan:

added a new shared multi rank recovery test and enabled it for PGTransport w/ NCCL+Gloo and the existing HTTPTransport

pytest torchft/checkpointing/

allenwang28

this is really cool work!

torchft/checkpointing/transport_test.py

allenwang28 · 2025-02-13T23:14:04Z

torchft/checkpointing/pg_transport.py

+
+def _prepare_tensor(tensor: torch.Tensor) -> Tuple[torch.Tensor, _TensorMeta]:
+    return (
+        _cast_tensor(tensor, torch.uint8),


IIUC, we're casting to uint8 to reduce memory pressure / speed up the transfer, but should we be concerned about any precision loss?

I see that transport_test.py verifies closeness/correctness through run_multi_recovery_test, but it isn't making sense to me!

This just reinterprets the tensor as a bunch of bytes (hence the uint8) backed by the same UntypedStorage range of bytes. No bytes modified, so no loss of precision:
https://pytorch.org/docs/stable/storage.html#untyped-storage-api
@d4l3k I presume if you don't do this cast and just pass the original tensor object with its original dtype, then you do lose precision? Or is it just inconvenient on the recv side to interpret the tensor with its original dtype right away?

In most cases it doesn't matter and should result in a byte identical output.

There's a few trade-offs:

Pro: not all tensor types are supported by NCCL torch.uint16 for instance can't be sent via nccl so doing this cast to uint8 allows us to support any dtype

Con: arguably it's better to use the non-storage option to avoid sending duplicate/extra bytes for strided/offset tensors. If you have two tensors sharing the same underlying storage or a tensor that's strided in this implementation we end up sending twice as much data

can we just do tensor.view(torch.uint8) instead?

.view doesn't work for strided tensors -- I'm not sure we need to support those but I think I'll leave it as is for now

allenwang28 · 2025-02-13T23:22:14Z

torchft/checkpointing/pg_transport.py

+    def metadata(self) -> str:
+        return "<n/a>"
+
+    def disallow_checkpoint(self) -> None:


should this be implemented?

We don't support any async/out of band calls via PG so nothing needs to be done here

For HTTP we need to this to avoid serving a checkpoint during optimizer step but since send is synchronous in PG we don't need any additional synchronization

fegin · 2025-02-14T07:07:39Z

torchft/checkpointing/pg_transport.py

+                    work.append(self._pg.send([t], dst_rank, tag=3 + i))
+
+                # allow 3 concurrent transfers at a time
+                while len(work) > (3 * len(dst_ranks)):


Is this used to avoid OOM?

Yes, it's to avoid OOM when transferring between devices

fegin · 2025-02-14T07:16:36Z

torchft/checkpointing/pg_transport.py

+            i += 1
+
+            # TODO: allow in place receives to avoid having to copy to cpu to
+            # avoid OOMs


We should be able to avoid transferring TensorMeta and DTensorMeta and avoid to(cpu) if we can first call state_dict() to get the state_dict structure and traverse the state_dict and send/recv the tensor directly.

Yeah we totally can do that in most cases -- I wanted to make that refactor in a follow up PR since I still need to figure out how to do that and this may be a decent fallback if we have some weird state dict

H-Huang

Overall looks pretty good! I had some comments and questions mostly about around readability and using existing infra, but otherwise looks solid enough to land

torchft/checkpointing/pg_transport.py

H-Huang · 2025-02-14T17:32:41Z

torchft/checkpointing/pg_transport.py

+
+def _prepare_tensor(tensor: torch.Tensor) -> Tuple[torch.Tensor, _TensorMeta]:
+    return (
+        _cast_tensor(tensor, torch.uint8),


can we just do tensor.view(torch.uint8) instead?

H-Huang · 2025-02-14T17:33:25Z

torchft/checkpointing/pg_transport.py

+    dtype: torch.dtype
+    storage_offset: int
+    stride: Tuple[int, ...]
+    nbytes: int


Not important now, but I wonder if we need to store quantization information and also wondering how thats handled in dtensor if you know

@H-Huang do you know how quantized information is stored? Is it a different tensor subclass or just packed into the storage?

H-Huang · 2025-02-14T17:56:02Z

torchft/checkpointing/pg_transport.py

+
+        work = []
+
+        with _timeit("send pickle"):


so we have send_object_list and recv_object_list https://github.com/pytorch/pytorch/blob/8b5ee275fb455156a944445fb92c43731369ace3/torch/distributed/distributed_c10d.py#L3181 which is what we use in PP to exchange shape metadata between stages to preallocate the recv buffers.

It is pretty similar since it pickles then sends object sizes, then the object data. I think yours may be more efficient since there are only 2 additional sends of metadata and the rest are the actual data. But wanted to flag in case we wanted to somehow consolidate some logic!

Thanks! Yeah, I missed that before -- this is good to know though for right now I think this implementation is a bit more performant since it sends the same data to multiple receivers

If we wrap the underlying PG we also should be able to use the broadcast_object_list variant which should give us best of both worlds

I'm planning a follow up PR since to make a subworld we need to do some underlying improvements in how we calculate the recovering workers

H-Huang · 2025-02-14T18:00:22Z

torchft/checkpointing/pg_transport.py

+
+
+@dataclass
+class _StateDictMeta:


I was thinking the _StateDictMeta and dataclasses could probably use some more comments since that's pretty important in determining how we serialize / deserialize and being able to update them. I am kinda curious how DCP handles this metadata when it transfers and if we have existing structure we can use? @fegin @LucasLLC do you know?

Added comments and updated field names to make it clearer

d4l3k requested review from daulet-askarov, teja-rao, fegin, H-Huang and LucasLLC February 13, 2025 21:31

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 13, 2025

allenwang28 reviewed Feb 13, 2025

View reviewed changes

fegin reviewed Feb 14, 2025

View reviewed changes

H-Huang approved these changes Feb 14, 2025

View reviewed changes

checkpointing/PGTransport: add NCCL/Gloo transport for checkpoints

05202f3

d4l3k force-pushed the d4l3k/pg_transport branch from a8ee556 to 05202f3 Compare February 14, 2025 21:16

d4l3k merged commit 8628a3f into main Feb 14, 2025
6 checks passed

d4l3k deleted the d4l3k/pg_transport branch February 14, 2025 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpointing/PGTransport: add NCCL/Gloo transport for checkpoints #110

checkpointing/PGTransport: add NCCL/Gloo transport for checkpoints #110

d4l3k commented Feb 13, 2025 •

edited

Loading

allenwang28 left a comment

allenwang28 Feb 13, 2025

daulet-askarov Feb 13, 2025 •

edited

Loading

d4l3k Feb 14, 2025 •

edited

Loading

H-Huang Feb 14, 2025

d4l3k Feb 14, 2025

allenwang28 Feb 13, 2025

d4l3k Feb 14, 2025

fegin Feb 14, 2025

d4l3k Feb 14, 2025

fegin Feb 14, 2025

d4l3k Feb 14, 2025 •

edited

Loading

H-Huang left a comment •

edited

Loading

H-Huang Feb 14, 2025

H-Huang Feb 14, 2025

d4l3k Feb 14, 2025

H-Huang Feb 14, 2025 •

edited

Loading

d4l3k Feb 14, 2025 •

edited

Loading

H-Huang Feb 14, 2025 •

edited

Loading

d4l3k Feb 14, 2025



		@dataclass
		class _StateDictMeta:

checkpointing/PGTransport: add NCCL/Gloo transport for checkpoints #110

checkpointing/PGTransport: add NCCL/Gloo transport for checkpoints #110

Conversation

d4l3k commented Feb 13, 2025 • edited Loading

Core algorithm:

Sender

Receiver

Test plan:

allenwang28 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daulet-askarov Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

d4l3k Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d4l3k Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

H-Huang left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

H-Huang Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

d4l3k Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

H-Huang Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d4l3k commented Feb 13, 2025 •

edited

Loading

daulet-askarov Feb 13, 2025 •

edited

Loading

d4l3k Feb 14, 2025 •

edited

Loading

d4l3k Feb 14, 2025 •

edited

Loading

H-Huang left a comment •

edited

Loading

H-Huang Feb 14, 2025 •

edited

Loading

d4l3k Feb 14, 2025 •

edited

Loading

H-Huang Feb 14, 2025 •

edited

Loading