Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sagemaker with local_gpu start a container withou --gpus all flag. #4196

Open
celsofranssa opened this issue Oct 16, 2023 · 1 comment
Open

Comments

@celsofranssa
Copy link

celsofranssa commented Oct 16, 2023

Starting a Sagemaker Pytorch Estimator based on a custom docker image stored in AWS ECR

from sagemaker.pytorch.estimator import PyTorch

    role = "arn:..."

    estimator = PyTorch(
        image_uri="1...ecr...amazonaws.com/...:prototype",
        git_config={"repo": "https://github.com/celsofranssa/LightningPrototype.git", "branch": "sagemaker"},
        entry_point="main.py",
        role=role,
        region="us-...",
        instance_type="local", # ml.g4dn.2xlarge
        instance_count=1,
        volume_size=225,
        hyperparameters=hparams
    )
    estimator.fit()

creates a container without GPU capabilities.

s4qen8aex3-algo-1-maof3  | Training Env:
s4qen8aex3-algo-1-maof3  | 
s4qen8aex3-algo-1-maof3  | {
s4qen8aex3-algo-1-maof3  |     "additional_framework_parameters": {},
s4qen8aex3-algo-1-maof3  |     "channel_input_dirs": {},
s4qen8aex3-algo-1-maof3  |     "current_host": "algo-1-maof3",
s4qen8aex3-algo-1-maof3  |     "current_instance_group": "homogeneousCluster",
s4qen8aex3-algo-1-maof3  |     "current_instance_group_hosts": [],
s4qen8aex3-algo-1-maof3  |     "current_instance_type": "local",
s4qen8aex3-algo-1-maof3  |     "distribution_hosts": [
s4qen8aex3-algo-1-maof3  |         "algo-1-maof3"
s4qen8aex3-algo-1-maof3  |     ],
s4qen8aex3-algo-1-maof3  |     "distribution_instance_groups": [],
s4qen8aex3-algo-1-maof3  |     "framework_module": null,
s4qen8aex3-algo-1-maof3  |     "hosts": [
s4qen8aex3-algo-1-maof3  |         "algo-1-maof3"
s4qen8aex3-algo-1-maof3  |     ],
s4qen8aex3-algo-1-maof3  |     "hyperparameters": {},
s4qen8aex3-algo-1-maof3  |     "input_config_dir": "/opt/ml/input/config",
s4qen8aex3-algo-1-maof3  |     "input_data_config": {},
s4qen8aex3-algo-1-maof3  |     "input_dir": "/opt/ml/input",
s4qen8aex3-algo-1-maof3  |     "instance_groups": [],
s4qen8aex3-algo-1-maof3  |     "instance_groups_dict": {},
s4qen8aex3-algo-1-maof3  |     "is_hetero": false,
s4qen8aex3-algo-1-maof3  |     "is_master": true,
s4qen8aex3-algo-1-maof3  |     "is_modelparallel_enabled": null,
s4qen8aex3-algo-1-maof3  |     "is_smddpmprun_installed": false,
s4qen8aex3-algo-1-maof3  |     "log_level": 20,
s4qen8aex3-algo-1-maof3  |     "master_hostname": "algo-1-maof3",
s4qen8aex3-algo-1-maof3  |     "model_dir": "/opt/ml/model",
s4qen8aex3-algo-1-maof3  |     "module_name": "main",
s4qen8aex3-algo-1-maof3  |     "network_interface_name": "eth0",
s4qen8aex3-algo-1-maof3  |     "num_cpus": 12,
s4qen8aex3-algo-1-maof3  |     "num_gpus": 0,
s4qen8aex3-algo-1-maof3  |     "num_neurons": 0,
s4qen8aex3-algo-1-maof3  |     "output_data_dir": "/opt/ml/output/data",
s4qen8aex3-algo-1-maof3  |     "output_dir": "/opt/ml/output",
s4qen8aex3-algo-1-maof3  |     "output_intermediate_dir": "/opt/ml/output/intermediate",
s4qen8aex3-algo-1-maof3  |     "resource_config": {
s4qen8aex3-algo-1-maof3  |         "current_host": "algo-1-maof3",
s4qen8aex3-algo-1-maof3  |         "hosts": [
s4qen8aex3-algo-1-maof3  |             "algo-1-maof3"
s4qen8aex3-algo-1-maof3  |         ]
s4qen8aex3-algo-1-maof3  |     },
s4qen8aex3-algo-1-maof3  |     "user_entry_point": "main.py"
s4qen8aex3-algo-1-maof3  | }
@celsofranssa celsofranssa changed the title Sagemaker with local_gpu start a container withou --gpus all flag. Sagemaker with local_gpu start a container withou --gpus all flag. Oct 16, 2023
@vpbhargav
Copy link
Contributor

@celsofranssa I will try to test out this code on my end but I noticed in the code snippet you shared above, you are using instance_type="local" instead of instance_type="local_gpu" if your intent is to emulate GPU local mode. Can you confirm if that is the issue and or even with local_gpu there are issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants