Skip to content

Latest commit

 

History

History
35 lines (25 loc) · 842 Bytes

index.rst

File metadata and controls

35 lines (25 loc) · 842 Bytes
github_url:https://github.com/pytorch-labs/torchft

torchft

This repository implements primitives and E2E solutions for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job.

GETTING STARTED? See Install and Usage in the README.

.. toctree::
    :maxdepth: 1
    :caption: Reference

    process_group
    manager
    optim
    ddp
    local_sgd
    data
    checkpointing
    parameter_server
    coordination


License

torchft is BSD 3-Clause licensed. See LICENSE for more details.

Copyright © Meta Platforms, Inc