Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with distributed SyncBatchNorm in MIL pipeline #5198

Open
bhashemian opened this issue Sep 22, 2022 · 11 comments
Open

Issue with distributed SyncBatchNorm in MIL pipeline #5198

bhashemian opened this issue Sep 22, 2022 · 11 comments
Labels
bug Something isn't working Pathology/Microscopy Digital Pathology and Microscopy related

Comments

@bhashemian
Copy link
Member

bhashemian commented Sep 22, 2022

User has reported an issue with the MIL pipeline when used with the distributed flag. #5081

I have tested it with MONAI tag 0.9.1 and it works fine while it fails with latest version of MONAI. This issue needs to be investigated.

Log reported by the user:

Versions:
NVIDIA Release 22.08 (build 42105213)
PyTorch Version 1.13.0a0+d321be6
projectmonai/monai:latest
DIGEST:sha256:109d2204811a4a0f9f6bf436eca624c42ed9bb3dbc6552c90b65a2db3130fefd

Error:
Traceback (most recent call last):
File "MIL.py", line 724, in
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(args,))
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/workspace/MIL.py", line 565, in main_worker
train_loss, train_acc = train_epoch(model, train_loader, optimizer, scaler=scaler, epoch=epoch, args=args)
File "/workspace/MIL.py", line 61, in train_epoch
logits = model(data)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1009, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 970, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/monai/monai/networks/nets/milmodel.py", line 238, in forward
x = self.net(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 285, in forward
return self._forward_impl(x)
File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 270, in _forward_impl
x = self.relu(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 102, in forward
return F.relu(input, inplace=self.inplace)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1453, in relu
return handle_torch_function(relu, (input,), input, inplace=inplace)
File "/opt/conda/lib/python3.8/site-packages/torch/overrides.py", line 1528, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/opt/monai/monai/data/meta_tensor.py", line 249, in torch_function
ret = super().torch_function(func, types, args, kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 1089, in torch_function
ret = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1455, in relu
result = torch.relu(input)
RuntimeError: Output 0 of SyncBatchNormBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.

@bhashemian bhashemian added the help wanted Extra attention is needed label Sep 23, 2022
@bhashemian
Copy link
Member Author

@myron do you have any insight here?

@bhashemian bhashemian changed the title Issue with distributed in MIL pipeline Issue with distributed SyncBatchNorm in MIL pipeline Sep 23, 2022
@myron
Copy link
Collaborator

myron commented Sep 27, 2022

I'm not sure what it means

@myron
Copy link
Collaborator

myron commented Oct 7, 2022

I think it's related to the new MetaTensor somehow, plz see my issue here #5283

@dyhan316
Copy link

dyhan316 commented Oct 7, 2022

Hello! I also encountered a very similar error occurred to me too, though I ran a different code (Not sure if it's related).

The error occurs when I run DDP with a torch dataset I made using MONAI ImageDataset function. The funny thing is that it works when I run with only one GPU allocated, but failed when I try to use with multiple GPUs.

The monai version I was running was 1.0.0 with torch version 1.11.0+cu113

The error I got was the following :

data_path 도 config에서 받도록 하기
data_path 도 config에서 받도록 하기
<class 'monai.transforms.utility.array.AddChannel'>: Class `AddChannel` has been deprecated since version 0.8. please use MetaTensor data type and monai.transforms.EnsureChannelFirst instead.
<class 'monai.transforms.utility.array.AddChannel'>: Class `AddChannel` has been deprecated since version 0.8. please use MetaTensor data type and monai.transforms.EnsureChannelFirst instead.
Traceback (most recent call last):
  File "main_3D.py", line 348, in <module>
    main()
  File "main_3D.py", line 75, in main
    torch.multiprocessing.spawn(main_worker, (args,), args.ngpus_per_node)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/scratch/connectome/dyhan316/VAE_ADHD/barlowtwins/main_3D.py", line 141, in main_worker
    loss = model.forward(y1, y2)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/scratch/connectome/dyhan316/VAE_ADHD/barlowtwins/main_3D.py", line 222, in forward
    z1 = self.projector(self.backbone(y1))           #i.e. z1 : representation of y1 (before normalization)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 98, in forward
    return F.relu(input, inplace=self.inplace)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/functional.py", line 1438, in relu
    return handle_torch_function(relu, (input,), input, inplace=inplace)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/overrides.py", line 1394, in handle_torch_function
    result = torch_func_method(public_api, types, args, kwargs)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/monai/data/meta_tensor.py", line 249, in __torch_function__
    ret = super().__torch_function__(func, types, args, kwargs)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/_tensor.py", line 1142, in __torch_function__
    ret = func(*args, **kwargs)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/functional.py", line 1440, in relu
    result = torch.relu_(input)
RuntimeError: Output 0 of SyncBatchNormBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.

I could share the code, if anyone thinks that it might help with solving this issue! (However, I should say that since I am a novice at pytorch, the error I got might be a pytorch problem and not a MONAI one!)

@dyhan316
Copy link

dyhan316 commented Oct 7, 2022

update : DDP works when using MONAI 0.9.1. Therefore, I think it's the same issue as this the OP

@bhashemian bhashemian added bug Something isn't working WG: Pathology Pathology/Microscopy Digital Pathology and Microscopy related and removed help wanted Extra attention is needed bug Something isn't working WG: Pathology labels Oct 7, 2022
@bhashemian
Copy link
Member Author

bhashemian commented Oct 7, 2022

It seems that this is a PyTorch issue that is caused by using MetaTensor (a subclass of torch.Tesnor). @wyli has created a bug report on PyTorch: pytorch/pytorch#86456

@dyhan316
Copy link

dyhan316 commented Oct 7, 2022

Thank you! Also, thank you for this wonderful package! :)

@ibro45
Copy link
Contributor

ibro45 commented Jan 13, 2023

Hi, any plans on fixing this?

@wyli
Copy link
Contributor

wyli commented Jan 14, 2023

This requires an upstream fix being discussed here pytorch/pytorch#86456

A workaround would be dropping the metadata of metatensor x using x.as_tensor()

@KumoLiu
Copy link
Contributor

KumoLiu commented Dec 20, 2023

Because the bug in the upstream has not yet been fixed, this ticket should be kept. Same as #5283

@KumoLiu KumoLiu reopened this Dec 20, 2023
@github-project-automation github-project-automation bot moved this from 💯 Complete to 🌋 In Progress in AI in Pathology🔬 Dec 20, 2023
@zhijian-yang
Copy link

Any idea on how to solve this issue now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Pathology/Microscopy Digital Pathology and Microscopy related
Projects
Status: 🌋 In Progress
Development

No branches or pull requests

8 participants