Bug? Should data transformations (e.g. scaling) be implemented in 'def preprocess_data' instead of 'def setup'? #471
Unanswered
tiefenthaler
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I wonder why data transformations (e.g. scaling) that should be performed on data like train or test are implemented in 'def setup'?
Using PyTorchLightning, 'def setup' enables distributed processing on GPUs. Which splits data (e.g. training data) over the given GPUs and performs the transformation (e.g. scaling) independently on each GPU. This would result into multiple data chucks of the training data that are getting scaled separate. So transformations that should be guaranteed to be performed on all training data should go to 'def preprocess_data'. Of course, sklearn preprocessing methods (as it is implemented) are not tensor based and do not support GPU support, so the current implementation does not cause a bug. Since this could change and does not follow the standards of PL this should be changed.
Note that data transformations for "more standard" use cases of PyTorch like Vision or NLP follow a different manner. E.g. resizing an image is usually not based on the distribution, where it is the case for scaling or encoding.
See PL documentation
Would be great to know why data transformations are implemented in 'def setup'?
Beta Was this translation helpful? Give feedback.
All reactions