-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REP] AWS accelerators trn1_inf support #39
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: maheedhar reddy chappidi <[email protected]>
Will take a first round review this weekend! |
looks there are two main questions:
this is mainly because ray train code are expected no code change if GPU is available. it seems not the case for AWS’s accelator (requries code change). If we want to stick with GPU, we should have a way to differentiate nvidia GPU from other xla accelators.
this is somewhat related to the previous question. I assume it's the case, and we should make sure the API works out of box for other xla devices, like TPU |
Thanks for the initial review.
|
TorchBackend is the communication for TorchTrainer and it supports limited backends (nccl, gloo) today. | ||
In order to support NeuronCore we would use PythonXLA framework and configure the backend to XLA. | ||
Also, requires additional configuration of torch-elastic (now called tourchrun) environment variables | ||
for the XLA devices to detect. | ||
|
||
```text | ||
class _TorchBackend(Backend): | ||
def on_start(): | ||
# support xla backend | ||
# Configure master env of xla device related to torchrun/torch-elastic | ||
def on_shutdown(): | ||
# cleanup NeuronCore cache if needed | ||
def on_training_start(): | ||
# configure rank/world_size/node_rank based on xla device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit fuzzy on whether XLA is expected to sit at the same level in the stack as nccl
and gloo
, based on my knowledge of Torch backends.
As an alternative approach, would the following interface make logical sense? Is this the right layer of abstraction?
class TorchXLAConfig(TorchConfig)
@property
def backend_cls(self):
return _TorchXLABackend
class _TorchXLABackend(_TorchBackend):
# XLA specific logic here
# User defined code
trainer = TorchTrainer(torch_config=TorchXLAConfig(...))
To better understand how to think about this, I'd love to learn more about how Torch XLA environment are typically set up and configured in practice - do you have any pointers to any best practices or other references I could take a look at?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a newbie to this space, as I learn more it makes sense to have separate XLAConfig with Backend. Wondering if we want to be more explicit on backend as it can vary per XLA device? TorchXLAConfig/_TorchAwsNeuronXLABackend
where the basic setup (Rank/WorldSize,MasterPort/Addr) is already done by NeuronSDK [1] and include anything related to torchrun[2].
I'm happy to ask around some SMEs around this area but here's the information I gathered so far.
- Configure TPU library configuration [3][4]
- Configure pjrt(latest)/xrt [5]
- Configure world/local, rank/size, master add/port (for torchrun) - generic to torch
[1] https://sagemaker.readthedocs.io/en/v2.167.0/frameworks/pytorch/using_pytorch.html#distributed-training-with-pytorch-neuron-on-trn1-instances
[2] https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/mlp.html?highlight=torchrun#multi-worker-data-parallel-mlp-training-using-torchrun
[3] https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#troubleshooting
[4] https://lightning.ai/docs/pytorch/stable/accelerators/tpu_faq.html
[5] https://github.com/pytorch/xla/blob/424d8c8eec4e7fd3af83749da2652b2fcd6d207a/configuration.yaml
@scv119 Checking if we got enough quorum on adding |
Details
This enhancement proposal briefly talks about AWS accelerators (Trainium/Inferentia) support on Ray.
Related
33504