Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REP] AWS accelerators trn1_inf support #39

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

chappidim
Copy link

@chappidim chappidim commented Jul 27, 2023

Details

This enhancement proposal briefly talks about AWS accelerators (Trainium/Inferentia) support on Ray.

Related

33504

Signed-off-by: maheedhar reddy chappidi <[email protected]>
@scv119
Copy link
Contributor

scv119 commented Jul 28, 2023

Will take a first round review this weekend!

@scv119
Copy link
Contributor

scv119 commented Jul 30, 2023

looks there are two main questions:

  • should we use a different resource name other than GPU?

this is mainly because ray train code are expected no code change if GPU is available. it seems not the case for AWS’s accelator (requries code change). If we want to stick with GPU, we should have a way to differentiate nvidia GPU from other xla accelators.

  • does all xla devices follow the same pytorch API?

this is somewhat related to the previous question. I assume it's the case, and we should make sure the API works out of box for other xla devices, like TPU

@chappidim
Copy link
Author

Thanks for the initial review.

  1. +1 to that, I'm open for adding new predefined resource-type if aws_accelerators doesn't fit as GPU. The only counterpart to that is per this docs we treat any other accelerator as GPU.
  2. AWS expects users to install a custom built torch_xla. Implies, torch_xla_neuron may not support some of the APIs but the package follows all the API standards listed by open-source torch-xla.

Comment on lines +78 to +91
TorchBackend is the communication for TorchTrainer and it supports limited backends (nccl, gloo) today.
In order to support NeuronCore we would use PythonXLA framework and configure the backend to XLA.
Also, requires additional configuration of torch-elastic (now called tourchrun) environment variables
for the XLA devices to detect.

```text
class _TorchBackend(Backend):
def on_start():
# support xla backend
# Configure master env of xla device related to torchrun/torch-elastic
def on_shutdown():
# cleanup NeuronCore cache if needed
def on_training_start():
# configure rank/world_size/node_rank based on xla device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit fuzzy on whether XLA is expected to sit at the same level in the stack as nccl and gloo, based on my knowledge of Torch backends.

As an alternative approach, would the following interface make logical sense? Is this the right layer of abstraction?

class TorchXLAConfig(TorchConfig)

    @property
    def backend_cls(self):
        return _TorchXLABackend

class _TorchXLABackend(_TorchBackend):
    # XLA specific logic here

# User defined code
trainer = TorchTrainer(torch_config=TorchXLAConfig(...))

To better understand how to think about this, I'd love to learn more about how Torch XLA environment are typically set up and configured in practice - do you have any pointers to any best practices or other references I could take a look at?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a newbie to this space, as I learn more it makes sense to have separate XLAConfig with Backend. Wondering if we want to be more explicit on backend as it can vary per XLA device? TorchXLAConfig/_TorchAwsNeuronXLABackend where the basic setup (Rank/WorldSize,MasterPort/Addr) is already done by NeuronSDK [1] and include anything related to torchrun[2].

I'm happy to ask around some SMEs around this area but here's the information I gathered so far.

  1. Configure TPU library configuration [3][4]
  2. Configure pjrt(latest)/xrt [5]
  3. Configure world/local, rank/size, master add/port (for torchrun) - generic to torch

[1] https://sagemaker.readthedocs.io/en/v2.167.0/frameworks/pytorch/using_pytorch.html#distributed-training-with-pytorch-neuron-on-trn1-instances
[2] https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/mlp.html?highlight=torchrun#multi-worker-data-parallel-mlp-training-using-torchrun
[3] https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#troubleshooting
[4] https://lightning.ai/docs/pytorch/stable/accelerators/tpu_faq.html
[5] https://github.com/pytorch/xla/blob/424d8c8eec4e7fd3af83749da2652b2fcd6d207a/configuration.yaml

@chappidim
Copy link
Author

looks there are two main questions:

  • should we use a different resource name other than GPU?

this is mainly because ray train code are expected no code change if GPU is available. it seems not the case for AWS’s accelator (requries code change). If we want to stick with GPU, we should have a way to differentiate nvidia GPU from other xla accelators.

  • does all xla devices follow the same pytorch API?

this is somewhat related to the previous question. I assume it's the case, and we should make sure the API works out of box for other xla devices, like TPU

@scv119 Checking if we got enough quorum on adding num_neuron_cores as pre-defined resource? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants