github_url: | https://github.com/pytorch-labs/torchft |
---|
This repository implements primitives and E2E solutions for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job.
GETTING STARTED? See Install and Usage in the README.
.. toctree:: :maxdepth: 1 :caption: Reference process_group manager optim ddp local_sgd data checkpointing parameter_server coordination
torchft is BSD 3-Clause licensed. See LICENSE for more details.
Copyright © Meta Platforms, Inc