-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding aws_sagemaker_scheduler #801
Conversation
Please find this first version and let me know what you think, @kiukchung |
dd3af1c
to
f4c2e60
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you include unittests? Thanks!
eab55c5
to
741d7a8
Compare
Added doc files as requested, @kiukchung |
Do we need to add anything else to this PR, @kiukchung? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for adding this (and making the docs changes)!
@clumsy it looks like there are unittest failures. Can you take a look? (and also rebase onto HEAD) |
f36325c
to
697a2bd
Compare
I fixed the mypy issue, @kiukchung. Looks like we need a permission to rerun all the workflows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this.
697a2bd
to
b38d137
Compare
@clumsy looks like there are unittest failures. Do the tests pass locally? |
Weird, they were passing locally, @kiukchung. Let me have a look. |
b38d137
to
78c5286
Compare
The test should pass now, @kiukchung |
Thanks, it looks like lint is still failing. Can you try running the linter locally (essentially running |
78c5286
to
f99411c
Compare
@kiukchung looks like black rules might have been changed recently. rerun |
@clumsy looks like pyre (typecheck) is failing on a few sagemaker related files:
|
f99411c
to
cd63e27
Compare
Fixed pyre and docs builds, @kiukchung |
Please let me know if additional changes are needed, @kiukchung |
cd63e27
to
02ef23a
Compare
Fixed pyre issue by upgrading local version of pyre dependencies, the errors I see are not related to this PR:
link failure is not related to this PR:
Unit test failure is not related to this PR:
Docs build fine locally if I disable nbsphinx. |
02ef23a
to
a81bc74
Compare
Thanks @clumsy for fixing! There was a merge conflict on |
@kiukchung looks like this PR introduced a issue in the doctest build https://github.com/pytorch/torchx/actions/runs/8456379306/job/23166027094 |
Providing an implementation of torchx scheduler for AWS SageMaker.
Several considerations that influenced this particular implementation:
torch_distributed
entry_point
does not allow specifying customtorchrun
command, it constructs the command itself meaning the component that one should use foraws_sagemaker
scheduler isutils.python
. That also means that instance count and instance type have to be passed as scheduler args.--
.Unit test is to be added later, sending this PR ahead to initiate the discussion and gather feedback.
Test plan: