Bug? Should data transformations (e.g. scaling) be implemented in 'def preprocess_data' instead of 'def setup'? #471

tiefenthaler · 2024-08-26T14:51:50Z

tiefenthaler
Aug 26, 2024

I wonder why data transformations (e.g. scaling) that should be performed on data like train or test are implemented in 'def setup'?
Using PyTorchLightning, 'def setup' enables distributed processing on GPUs. Which splits data (e.g. training data) over the given GPUs and performs the transformation (e.g. scaling) independently on each GPU. This would result into multiple data chucks of the training data that are getting scaled separate. So transformations that should be guaranteed to be performed on all training data should go to 'def preprocess_data'. Of course, sklearn preprocessing methods (as it is implemented) are not tensor based and do not support GPU support, so the current implementation does not cause a bug. Since this could change and does not follow the standards of PL this should be changed.
Note that data transformations for "more standard" use cases of PyTorch like Vision or NLP follow a different manner. E.g. resizing an image is usually not based on the distribution, where it is the case for scaling or encoding.
See PL documentation
Would be great to know why data transformations are implemented in 'def setup'?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug? Should data transformations (e.g. scaling) be implemented in 'def preprocess_data' instead of 'def setup'? #471

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Bug? Should data transformations (e.g. scaling) be implemented in 'def preprocess_data' instead of 'def setup'? #471

tiefenthaler Aug 26, 2024

Replies: 0 comments

tiefenthaler
Aug 26, 2024