Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding aws_sagemaker_scheduler #801

Merged
merged 2 commits into from
Mar 27, 2024
Merged

Conversation

clumsy
Copy link
Contributor

@clumsy clumsy commented Dec 18, 2023

Providing an implementation of torchx scheduler for AWS SageMaker.

Several considerations that influenced this particular implementation:

  1. SageMaker's torch_distributed entry_point does not allow specifying custom torchrun command, it constructs the command itself meaning the component that one should use for aws_sagemaker scheduler is utils.python. That also means that instance count and instance type have to be passed as scheduler args.
  2. Current SageMaker does not allow using modules, only scripts are allowed. This should go away once these two changes are adopted: feature: python module support to torch_distributed aws/sagemaker-python-sdk#4324 and feature: add python module entrypoint type, add python module support to torch_distributed aws/sagemaker-training-toolkit#205. This PR assumes they get adopted and allows specifying the module.
  3. In order to make torchx one stop for submitting jobs to SageMaker I added all currently supported scheduler parameters with their documentation from AWS docs.
  4. Some parameters had to be controlled by torchx (e.g. from role) - failing if one attempts to set those.
  5. Unfortunately special care had to be taken when processing hyperparameters (script arguments), as SageMakers expect those in a key value format and would prepend them with --.

Unit test is to be added later, sending this PR ahead to initiate the discussion and gather feedback.

Test plan:

  • tested by submitting a job to SageMaker in an AWS account: works for a script, fails for a module at runtime (see the notes above)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 18, 2023
@clumsy
Copy link
Contributor Author

clumsy commented Dec 18, 2023

Please find this first version and let me know what you think, @kiukchung

Copy link
Collaborator

@kiukchung kiukchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you include unittests? Thanks!

@clumsy clumsy force-pushed the feature/sagemaker branch 2 times, most recently from eab55c5 to 741d7a8 Compare January 18, 2024 14:32
@clumsy
Copy link
Contributor Author

clumsy commented Jan 19, 2024

Added doc files as requested, @kiukchung

@clumsy
Copy link
Contributor Author

clumsy commented Jan 25, 2024

Do we need to add anything else to this PR, @kiukchung?

Copy link
Collaborator

@kiukchung kiukchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for adding this (and making the docs changes)!

@kiukchung
Copy link
Collaborator

@clumsy it looks like there are unittest failures. Can you take a look? (and also rebase onto HEAD)

@clumsy clumsy force-pushed the feature/sagemaker branch 3 times, most recently from f36325c to 697a2bd Compare February 6, 2024 13:52
@clumsy
Copy link
Contributor Author

clumsy commented Feb 12, 2024

I fixed the mypy issue, @kiukchung. Looks like we need a permission to rerun all the workflows.

Copy link

@goswamig goswamig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this.

@kiukchung
Copy link
Collaborator

@clumsy looks like there are unittest failures. Do the tests pass locally?

@clumsy
Copy link
Contributor Author

clumsy commented Feb 16, 2024

Weird, they were passing locally, @kiukchung. Let me have a look.

@clumsy
Copy link
Contributor Author

clumsy commented Feb 26, 2024

The test should pass now, @kiukchung

@kiukchung
Copy link
Collaborator

The test should pass now, @kiukchung

Thanks, it looks like lint is still failing. Can you try running the linter locally (essentially running usort, black, and flake8)? This is the lint action: https://github.com/pytorch/torchx/blob/main/scripts/lint.sh#L58-L60

@clumsy
Copy link
Contributor Author

clumsy commented Mar 4, 2024

@kiukchung looks like black rules might have been changed recently. rerun lint.sh and submitted a new revision.

@kiukchung
Copy link
Collaborator

@clumsy looks like pyre (typecheck) is failing on a few sagemaker related files:

 torchx/schedulers/aws_sagemaker_scheduler.py:28:0 Unused ignore [0]: The `pyre-ignore[21]` or `pyre-fixme[21]` comment is not suppressing type errors, please remove it.
torchx/schedulers/test/aws_sagemaker_scheduler_test.py:225:28 Missing parameter annotation [2]: Parameter `mock_pytorch_estimator` has no type specified.
torchx/schedulers/test/aws_sagemaker_scheduler_test.py:259:25 Undefined attribute [16]: `Optional` has no attribute `app_id`.
torchx/schedulers/test/aws_sagemaker_scheduler_test.py:260:25 Undefined attribute [16]: `Optional` has no attribute `state`.
torchx/schedulers/test/aws_sagemaker_scheduler_test.py:262:12 Undefined attribute [16]: `Optional` has no attribute `ui_url`.

@clumsy clumsy force-pushed the feature/sagemaker branch from f99411c to cd63e27 Compare March 5, 2024 16:04
@clumsy
Copy link
Contributor Author

clumsy commented Mar 5, 2024

Fixed pyre and docs builds, @kiukchung

@clumsy
Copy link
Contributor Author

clumsy commented Mar 18, 2024

Please let me know if additional changes are needed, @kiukchung

@clumsy clumsy force-pushed the feature/sagemaker branch from cd63e27 to 02ef23a Compare March 25, 2024 20:22
@clumsy
Copy link
Contributor Author

clumsy commented Mar 25, 2024

Fixed pyre issue by upgrading local version of pyre dependencies, the errors I see are not related to this PR:

torchx/examples/apps/lightning/profiler.py:22:0 Undefined import [21]: Could not find a module corresponding to import `pytorch_lightning.loggers.base`.
torchx/examples/apps/lightning/profiler.py:23:0 Undefined import [21]: Could not find a module corresponding to import `pytorch_lightning.profiler.base`.
torchx/examples/apps/lightning/profiler.py:26:28 Undefined or invalid type [11]: Annotation `BaseProfiler` is not defined as a type.
torchx/examples/apps/lightning/profiler.py:33:31 Undefined or invalid type [11]: Annotation `LightningLoggerBase` is not defined as a type.
torchx/examples/apps/lightning/profiler.py:37:8 Missing attribute annotation [4]: Attribute `logger` of class `SimpleLoggingProfiler` has no type specified.
torchx/examples/apps/lightning/train.py:172:12 Incompatible parameter type [6]: In call `pl.trainer.trainer.Trainer.__init__`, for argument `profiler`, expected `Union[None, Profiler, str]` but got `SimpleLoggingProfiler`.

link failure is not related to this PR:

Running linters ...
Checking torchx/schedulers/__init__.py
torchx/schedulers/__init__.py:33:5: E704 multiple statements on one line (def)
Checking torchx/schedulers/aws_sagemaker_scheduler.py
Checking torchx/schedulers/test/aws_sagemaker_scheduler_test.py
One of the linters returned an error. See above output.

Unit test failure is not related to this PR:


=========================== short test summary info ============================
FAILED torchx/schedulers/test/local_scheduler_test.py::LocalDirectorySchedulerTest::test_no_orphan_process_function - AssertionError: OSError not raised

Docs build fine locally if I disable nbsphinx.

@clumsy clumsy force-pushed the feature/sagemaker branch from 02ef23a to a81bc74 Compare March 27, 2024 13:55
@kiukchung
Copy link
Collaborator

Thanks @clumsy for fixing! There was a merge conflict on dev-requirements.txt which I just resolved. Will let the CI run again and merge once everything is green.

@kiukchung kiukchung merged commit dd4c61f into pytorch:main Mar 27, 2024
20 of 21 checks passed
@d4l3k
Copy link
Member

d4l3k commented Mar 27, 2024

@kiukchung looks like this PR introduced a issue in the doctest build

https://github.com/pytorch/torchx/actions/runs/8456379306/job/23166027094

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants