Multi-GPU training for object detection #7527

Rajesh-ParaxialTech · 2024-03-10T16:56:32Z

Rajesh-ParaxialTech
Mar 10, 2024

Hello
Thanks a lot for the nice GPU training concept and support. Currently I have been training an object detection model using the "Single Node multi-gpu " mode. I have tried to incorporate the features mentioned in the Brats segmentation (in my case I am trying the object detection model training) in my code. But couldnt find a way to incorporate the following:

Please let me know where i can incorporate in the code. My feeling is that this is to be incorporated somewhere in this file ".local/lib/python3.10/site-packages/monai/bundle/reference_resolver.py" of the pipeline used in the object detection tasks. At this point the config_items are evaluated and my object detection model (i.e the retinanet detector training) is exceuted with data, I think.

Without using the statement (mentioned in the figure above) in my code, while running the code, i am getting this error shown in figure below

Also my question is whether this line "model = DistributedDataParallel(model, device_ids=[device]) " mentioned in the figure above (first figure) is required for single Node multi-GPU training mode. Or is it required only for multi-node, multi-GPU training ?

It would be really helpful, If i could be guided in this regard.

Thanking you again
Rajesh

KumoLiu · 2024-03-11T03:16:36Z

KumoLiu
Mar 11, 2024
Maintainer

Hi @Rajesh-ParaxialTech, thanks for your interest here.

Please let me know where i can incorporate in the code.

I didn't fully get your question, what do you want to incorparate? Do you mean how to ensure your pipeline can using DDP?
Several things you need to ensure:

Setup DDP environment
You need to set up the DDP environment manually before running your mode
https://github.com/Project-MONAI/tutorials/blob/570b19e678e06aba26f53df8bc73f848ae2984ba/acceleration/distributed_training/brats_training_ddp.py#L172
Wrap model in DDP
Once you have initialized the DDP environment, you can wrap your model in the DDP wrapper. This will duplicate your model across different GPUs, so DistributedDataParallel is needed here.
Setup DataLoader for distributed training
In addition to wrapping your model, you'll need to create a DistributedSampler which is an essential component in performing distributed and parallel training. It ensures each process gets a different split of the dataset to avoid redundant computation

Hope it helps, thanks.

2 replies

Rajesh-ParaxialTech Mar 13, 2024
Author

Thank you KumoLiu. Yes that was the answer I meant.

Also I have tried a "single node 2-gpu" training with the following program features:

step 1. (this will accomplish the step 1 you mentioned in your answer, i assume)
My main program starts like this:

**step 2 (this will accomplish the step 3 you mentioned in your answer, i assume) **
I have added following code in module " /.local/lib/python3.10/site-packages/monai/bundle/scripts.py " as follows:

GPU '0' will run with train.json and GPU '1' will run with train1.json . Train.json points to a dataset_fold0.json which has training data for GPU '0' and Train1.json points to a dataset_fold0.json which has training data for GPU '1' .

**step 3 (this will accomplish the step 2 you mentioned in your answer, i assume) **
I have modified train.jsons like this below:

The command line I am using to run this training pipeline is
" python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 test_multigpu.py "

Is this Ok ?

Thank you
Rajesh

KumoLiu Mar 14, 2024
Maintainer

Looks good to me. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training for object detection #7527

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Multi-GPU training for object detection #7527

Rajesh-ParaxialTech Mar 10, 2024

Replies: 1 comment · 2 replies

KumoLiu Mar 11, 2024 Maintainer

Rajesh-ParaxialTech Mar 13, 2024 Author

KumoLiu Mar 14, 2024 Maintainer

Rajesh-ParaxialTech
Mar 10, 2024

Replies: 1 comment 2 replies

KumoLiu
Mar 11, 2024
Maintainer

Rajesh-ParaxialTech Mar 13, 2024
Author

KumoLiu Mar 14, 2024
Maintainer