Issue with distributed `SyncBatchNorm` in MIL pipeline #5198

bhashemian · 2022-09-22T19:11:48Z

User has reported an issue with the MIL pipeline when used with the distributed flag. #5081

I have tested it with MONAI tag 0.9.1 and it works fine while it fails with latest version of MONAI. This issue needs to be investigated.

Log reported by the user:

Versions:
NVIDIA Release 22.08 (build 42105213)
PyTorch Version 1.13.0a0+d321be6
projectmonai/monai:latest
DIGEST:sha256:109d2204811a4a0f9f6bf436eca624c42ed9bb3dbc6552c90b65a2db3130fefd

Error:
Traceback (most recent call last):
File "MIL.py", line 724, in
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(args,))
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/workspace/MIL.py", line 565, in main_worker
train_loss, train_acc = train_epoch(model, train_loader, optimizer, scaler=scaler, epoch=epoch, args=args)
File "/workspace/MIL.py", line 61, in train_epoch
logits = model(data)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1009, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 970, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/monai/monai/networks/nets/milmodel.py", line 238, in forward
x = self.net(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 285, in forward
return self._forward_impl(x)
File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 270, in _forward_impl
x = self.relu(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 102, in forward
return F.relu(input, inplace=self.inplace)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1453, in relu
return handle_torch_function(relu, (input,), input, inplace=inplace)
File "/opt/conda/lib/python3.8/site-packages/torch/overrides.py", line 1528, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/opt/monai/monai/data/meta_tensor.py", line 249, in torch_function
ret = super().torch_function(func, types, args, kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 1089, in torch_function
ret = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1455, in relu
result = torch.relu(input)
RuntimeError: Output 0 of SyncBatchNormBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.

bhashemian · 2022-09-23T16:41:44Z

@myron do you have any insight here?

myron · 2022-09-27T18:03:58Z

I'm not sure what it means

myron · 2022-10-07T03:05:36Z

I think it's related to the new MetaTensor somehow, plz see my issue here #5283

dyhan316 · 2022-10-07T03:48:43Z

Hello! I also encountered a very similar error occurred to me too, though I ran a different code (Not sure if it's related).

The error occurs when I run DDP with a torch dataset I made using MONAI ImageDataset function. The funny thing is that it works when I run with only one GPU allocated, but failed when I try to use with multiple GPUs.

The monai version I was running was 1.0.0 with torch version 1.11.0+cu113

The error I got was the following :

data_path 도 config에서 받도록 하기
data_path 도 config에서 받도록 하기
<class 'monai.transforms.utility.array.AddChannel'>: Class `AddChannel` has been deprecated since version 0.8. please use MetaTensor data type and monai.transforms.EnsureChannelFirst instead.
<class 'monai.transforms.utility.array.AddChannel'>: Class `AddChannel` has been deprecated since version 0.8. please use MetaTensor data type and monai.transforms.EnsureChannelFirst instead.
Traceback (most recent call last):
  File "main_3D.py", line 348, in <module>
    main()
  File "main_3D.py", line 75, in main
    torch.multiprocessing.spawn(main_worker, (args,), args.ngpus_per_node)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/scratch/connectome/dyhan316/VAE_ADHD/barlowtwins/main_3D.py", line 141, in main_worker
    loss = model.forward(y1, y2)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/scratch/connectome/dyhan316/VAE_ADHD/barlowtwins/main_3D.py", line 222, in forward
    z1 = self.projector(self.backbone(y1))           #i.e. z1 : representation of y1 (before normalization)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 98, in forward
    return F.relu(input, inplace=self.inplace)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/functional.py", line 1438, in relu
    return handle_torch_function(relu, (input,), input, inplace=inplace)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/overrides.py", line 1394, in handle_torch_function
    result = torch_func_method(public_api, types, args, kwargs)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/monai/data/meta_tensor.py", line 249, in __torch_function__
    ret = super().__torch_function__(func, types, args, kwargs)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/_tensor.py", line 1142, in __torch_function__
    ret = func(*args, **kwargs)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN/lib/python3.8/site-packages/torch/nn/functional.py", line 1440, in relu
    result = torch.relu_(input)
RuntimeError: Output 0 of SyncBatchNormBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.

I could share the code, if anyone thinks that it might help with solving this issue! (However, I should say that since I am a novice at pytorch, the error I got might be a pytorch problem and not a MONAI one!)

dyhan316 · 2022-10-07T05:39:49Z

update : DDP works when using MONAI 0.9.1. Therefore, I think it's the same issue as this the OP

bhashemian · 2022-10-07T14:09:39Z

It seems that this is a PyTorch issue that is caused by using MetaTensor (a subclass of torch.Tesnor). @wyli has created a bug report on PyTorch: pytorch/pytorch#86456

dyhan316 · 2022-10-07T14:12:20Z

Thank you! Also, thank you for this wonderful package! :)

ibro45 · 2023-01-13T22:52:11Z

Hi, any plans on fixing this?

wyli · 2023-01-14T00:07:04Z

This requires an upstream fix being discussed here pytorch/pytorch#86456

A workaround would be dropping the metadata of metatensor x using x.as_tensor()

KumoLiu · 2023-12-20T08:09:00Z

Because the bug in the upstream has not yet been fixed, this ticket should be kept. Same as #5283

zhijian-yang · 2025-02-21T18:59:06Z

Any idea on how to solve this issue now?

bhashemian added the help wanted Extra attention is needed label Sep 23, 2022

bhashemian changed the title ~~Issue with distributed in MIL pipeline~~ Issue with distributed SyncBatchNorm in MIL pipeline Sep 23, 2022

myron mentioned this issue Oct 7, 2022

MetaTensor and DistributedDataParallel. bug (SyncBatchNormBackward is a view and is being modified inplace) #5283

Open

bhashemian added bug Something isn't working WG: Pathology Pathology/Microscopy Digital Pathology and Microscopy related and removed help wanted Extra attention is needed bug Something isn't working WG: Pathology labels Oct 7, 2022

bhashemian added this to AI in Pathology🔬 Oct 7, 2022

bhashemian moved this to Backlog in AI in Pathology🔬 Oct 7, 2022

bhashemian added this to the Pathology Bug Fixes and Misc Improvements milestone Oct 7, 2022

surajpaib mentioned this issue Apr 17, 2023

Lambda transform should allow type conversion based on user-defined function #6379

Closed

vikashg closed this as completed Dec 19, 2023

github-project-automation bot moved this to 💯 Complete in AI in Pathology🔬 Dec 19, 2023

KumoLiu reopened this Dec 20, 2023

github-project-automation bot moved this from 💯 Complete to 🌋 In Progress in AI in Pathology🔬 Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with distributed `SyncBatchNorm` in MIL pipeline #5198

Issue with distributed `SyncBatchNorm` in MIL pipeline #5198

bhashemian commented Sep 22, 2022 •

edited

Loading

bhashemian commented Sep 23, 2022

myron commented Sep 27, 2022

myron commented Oct 7, 2022

dyhan316 commented Oct 7, 2022 •

edited

Loading

dyhan316 commented Oct 7, 2022

bhashemian commented Oct 7, 2022 •

edited

Loading

dyhan316 commented Oct 7, 2022

ibro45 commented Jan 13, 2023

wyli commented Jan 14, 2023 •

edited

Loading

KumoLiu commented Dec 20, 2023

zhijian-yang commented Feb 21, 2025

Issue with distributed SyncBatchNorm in MIL pipeline #5198

Issue with distributed SyncBatchNorm in MIL pipeline #5198

Comments

bhashemian commented Sep 22, 2022 • edited Loading

bhashemian commented Sep 23, 2022

myron commented Sep 27, 2022

myron commented Oct 7, 2022

dyhan316 commented Oct 7, 2022 • edited Loading

dyhan316 commented Oct 7, 2022

bhashemian commented Oct 7, 2022 • edited Loading

dyhan316 commented Oct 7, 2022

ibro45 commented Jan 13, 2023

wyli commented Jan 14, 2023 • edited Loading

KumoLiu commented Dec 20, 2023

zhijian-yang commented Feb 21, 2025

Issue with distributed `SyncBatchNorm` in MIL pipeline #5198

Issue with distributed `SyncBatchNorm` in MIL pipeline #5198

bhashemian commented Sep 22, 2022 •

edited

Loading

dyhan316 commented Oct 7, 2022 •

edited

Loading

bhashemian commented Oct 7, 2022 •

edited

Loading

wyli commented Jan 14, 2023 •

edited

Loading