-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with distributed SyncBatchNorm
in MIL pipeline
#5198
Comments
@myron do you have any insight here? |
SyncBatchNorm
in MIL pipeline
I'm not sure what it means |
I think it's related to the new MetaTensor somehow, plz see my issue here #5283 |
Hello! I also encountered a very similar error occurred to me too, though I ran a different code (Not sure if it's related). The error occurs when I run DDP with a torch dataset I made using MONAI ImageDataset function. The funny thing is that it works when I run with only one GPU allocated, but failed when I try to use with multiple GPUs. The monai version I was running was 1.0.0 with torch version 1.11.0+cu113 The error I got was the following :
I could share the code, if anyone thinks that it might help with solving this issue! (However, I should say that since I am a novice at pytorch, the error I got might be a pytorch problem and not a MONAI one!) |
update : DDP works when using MONAI 0.9.1. Therefore, I think it's the same issue as this the OP |
It seems that this is a PyTorch issue that is caused by using MetaTensor (a subclass of |
Thank you! Also, thank you for this wonderful package! :) |
Hi, any plans on fixing this? |
This requires an upstream fix being discussed here pytorch/pytorch#86456 A workaround would be dropping the metadata of metatensor x using x.as_tensor() |
Because the bug in the upstream has not yet been fixed, this ticket should be kept. Same as #5283 |
Any idea on how to solve this issue now? |
User has reported an issue with the MIL pipeline when used with the distributed flag. #5081
I have tested it with MONAI tag 0.9.1 and it works fine while it fails with latest version of MONAI. This issue needs to be investigated.
Log reported by the user:
Versions:
NVIDIA Release 22.08 (build 42105213)
PyTorch Version 1.13.0a0+d321be6
projectmonai/monai:latest
DIGEST:sha256:109d2204811a4a0f9f6bf436eca624c42ed9bb3dbc6552c90b65a2db3130fefd
Error:
Traceback (most recent call last):
File "MIL.py", line 724, in
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(args,))
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/workspace/MIL.py", line 565, in main_worker
train_loss, train_acc = train_epoch(model, train_loader, optimizer, scaler=scaler, epoch=epoch, args=args)
File "/workspace/MIL.py", line 61, in train_epoch
logits = model(data)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1009, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 970, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/monai/monai/networks/nets/milmodel.py", line 238, in forward
x = self.net(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 285, in forward
return self._forward_impl(x)
File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 270, in _forward_impl
x = self.relu(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 102, in forward
return F.relu(input, inplace=self.inplace)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1453, in relu
return handle_torch_function(relu, (input,), input, inplace=inplace)
File "/opt/conda/lib/python3.8/site-packages/torch/overrides.py", line 1528, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/opt/monai/monai/data/meta_tensor.py", line 249, in torch_function
ret = super().torch_function(func, types, args, kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 1089, in torch_function
ret = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1455, in relu
result = torch.relu(input)
RuntimeError: Output 0 of SyncBatchNormBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.
The text was updated successfully, but these errors were encountered: