Skip to content

Commit d327f43

Browse files
committed
refactor for lightning tutorial
1 parent f2401f9 commit d327f43

33 files changed

+1191
-821646
lines changed

README.md

+1,066-96
Large diffs are not rendered by default.

README_PyTorchLightning_Tutorial.MD

-1,095
This file was deleted.

README_packaging_classification.md

+125
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# 1. Overview
2+
3+
Find the full code in the repository [pytorch-lightning-tabular-classification](https://github.com/tiefenthaler/pytorch-lightning-tabular-classification) on GitHub.
4+
5+
## 1.1. Table of Contents
6+
7+
**Section 1 and 2 provide a brief overview of PyTorch Lightning.**
8+
**Section 3 describes the use case of tabular multi-class classification.**
9+
**Section 4 provides the code implementation of the deep learning pipeline.**
10+
11+
- [1. Overview](#1-overview)
12+
- [1.1. Table of Contents](#11-table-of-contents)
13+
- [2. Packaging Classification a Tabular Data Use Case using Machine Learning](#2-packaging-classification-a-tabular-data-use-case-using-machine-learning)
14+
- [3. Use case description:](#3-use-case-description)
15+
- [4. Structure of the showcase](#4-structure-of-the-showcase)
16+
- [5. Code structure](#5-code-structure)
17+
- [Google Colab X Google Drive (quick start)](#google-colab-x-google-drive-quick-start)
18+
- [Azure ML Service (quick start)](#azure-ml-service-quick-start)
19+
20+
# 2. Packaging Classification a Tabular Data Use Case using Machine Learning
21+
22+
This repo is meant as a showcase to demonstrate a data science workflow for a multi-class classification (see use case description below) use case in business context. Data science use cases may have very different nature in terms of how the result/solution is used by the business. This use case has the characteristic of providing one-time insights and a inference solution to be reused manually to get the classification outputs further continues usage by the business. Therefore the repo focuses on analytics and limits operational aspects for analytical reusability. Still the repo contains proper data engineering and data science best practices for data preparation, pre-processing, modeling and evaluation. Many different frameworks for modeling with their specific integrations are used. Each of the different frameworks is used to demonstrate the reusability of the code and the data science workflow and contains a **pre-processing pipeline**, **modeling pipeline** and **evaluation pipeline**. A description of each solution for a given framework as given in the related notebook. For the data science pipeline with best performance, the best performing model is used to create a **prediction pipeline** and **deeper analysis** of the model and the related results in respect to performance and the business goal.
23+
24+
The repo focuses on the following aspects:
25+
26+
- Build a simple ETL pipeline to prepare the raw data for analysis and classification.
27+
- Conduct general data analysis for data quality investigation under consideration of the business goal.
28+
- Conduct data analysis to get an understanding how to handle the data for multi-class classification, including a naive benchmark model using sklearn (DummyClassifier & a custom classifier).
29+
- Build multiple machine learning pipelines to evaluate best classification performance. The following aspects are considered within those pipelines:
30+
- Benchmarking pipelines to compare performance of multiple different types of models:
31+
- A basic benchmarking pipeline using naive classifiers as a base line.
32+
- A AutoML (automated machine learning) pipeline using PyCaret to compare a "large" variety of machine learning algorithms, considering:
33+
- including and excluding custom data pre-processing
34+
- pre-defined hyper-parameter set for each algorithm by PyCaret
35+
- using random search for HPO (hyper-parameter optimization) with a pre-defined hyper-parameter search space for each algorithm by PyCaret
36+
- A AutoML (automated machine learning) pipeline using AutoGluon.Tablular, considering:
37+
- including and excluding custom data pre-processing
38+
- including auto pre-processing by AutoGluon.Tabular
39+
- including auto feature engineering by AutoGluon.Tabular
40+
- including multiple classifiers by using:
41+
- multiple ml algorithms
42+
- "standard" HPO for each algorithm defined by AutoGluon.Tabular
43+
- ensambles of algorithms (bagging and stacking with possible multiple layers)
44+
- A benchmarking pipeline for multiple tree-based algorithms, considering:
45+
(since AutoML indicates a good performance of tree-based algorithms for the given use case; as well as showing that no single tree-based algorithm significantly outperforms others)
46+
- Tree-based classifiers: DecisionTree, RandomForest, LightGBM.
47+
- Model hyper-parameter optimization.
48+
- Class imbalance.
49+
- A benchmarking pipeline for Neural Network using Pytorch/Lightning, considering:
50+
(AutoML shows relative low performance of NN with defined time constrains for the given use case. Double-check the AutoML-NN results with individual constrains)
51+
- MLP and Embedding-MLP.
52+
- Custom classes to handle tabular data for Pytorch/Lightning Dataset, DataLoaders, LightningDataModule, LightningModule, Trainer, and Models.
53+
- Model hyper-parameter optimization.
54+
- Class imbalance.
55+
- A custom pipeline for the best performing model based on benchmarking, considering:
56+
- Model hyper-parameter optimization.
57+
- Class imbalance.
58+
- Oversampling.
59+
- Business decision optimization for best model.
60+
- Threshold analysis (since best model provides probabilistic forecasts).
61+
- Consideration of class values from a business perspective using profit curves and under consideration of thresholds.
62+
- Build a production pipeline (training & inference) excluding infrastructure aspects for best model to provide final results.
63+
64+
**Python:** [sklearn](https://scikit-learn.org/stable/) | [PyCaret](https://pycaret.gitbook.io/docs) | [AutoGluon.Tabular](https://auto.gluon.ai/stable/tutorials/tabular/index.html) | [LightGBM](https://lightgbm.readthedocs.io/en/stable/) | [PyTorch/Lightning](https://lightning.ai/pytorch-lightning) | [MLflow](https://mlflow.org/) | [Optuna](https://optuna.org/) | [Docker](https://www.docker.com/)
65+
66+
# 3. Use case description:
67+
68+
To reach sustainability goals for the packaging of products, the company needs to know to which packaging categories the single items belong to. Since this information is not given for 45.058 items of the total 137.035 items, the goal is to provide the categories for the items with missing ones based on a data-driven approach. The solution should be applied for comparable data sets from multiple origins.
69+
70+
First analysis has shown that simple 1:1 relationships and rule-based approaches do not lead to proper results. Therefore, a machine learning approach was used. The goal is to build a solution that is capable of doing a highly accurate prediction for as many packaging categories as possible. Meaning that on the one side predictions need to meet a certain threshold for accuracy to be useful for the business (a small amount of wrong classifications can be tolerated but low classification accuracy does not help the business). On the other hand, a threshold for a minimum number of products needs to be covered (it is not mandatory to provide good predictions for all items, but providing good predictions for only a small amount of items also does not help the business a lot). Finally the machine learning solution should consider business decision optimization (cost optimization) based on different individual packaging categories (class).
71+
72+
# 4. Structure of the showcase
73+
As the showcase is intended to reflect the data science process to tackle the use case, the structure builds up on this.
74+
75+
# 5. Code structure
76+
77+
```
78+
Directory-tree structure:
79+
|-- environment.yml
80+
|-- README.md
81+
|-- README_ml_packaging_classification.md
82+
|-- notebooks
83+
| |-- 20_clf_pipeline_pytorch_embeddingMLP_optuna.ipynb # Embedding MLP with Optuna
84+
| |-- 20_clf_pipeline_pytorch_embeddingMLP.ipynb # Embedding MLP
85+
| |-- 20_clf_pipeline_pytorch_MLP_optuna.ipynb # MLP with Optuna
86+
| |-- 20_clf_pipeline_pytorch_MLP.ipynb # MLP, including deteiled code description
87+
|-- src
88+
| |-- pytorch_tabular # Modules for Tabular Data using PyTorch Lightning
89+
| | |-- callbacks.py # callbacks for Lightning Trainer
90+
| | |-- enciders.py # custom encoders for data preprocessing
91+
| | |-- tabular_lightning.py # lightning classes for tabular data
92+
| | |-- tabular_lightning_utils.py # for shared utilitie functions
93+
| | |-- tabular_models.py # custom models for Pytorch/Lightning
94+
| |-- utils.py # for shared functions
95+
```
96+
97+
# Google Colab X Google Drive (quick start)
98+
99+
Some notebooks include code to use google colab with google drive.
100+
Google colab is a free cloud service for running python code in the browser and has native integration with google drive.
101+
To run the notebooks, you need to mount your google drive to the colab environment (code implementation).
102+
Ensure to define the google drive path in the code (colab config file).
103+
The additional packages needed in colab are included in the code implemenation (kernel restart required).
104+
It is recommended to run those notebooks on GPU or on a machine with high amount of CPU cores (only limited available on free tier).
105+
It is not recommended to run the notebooks on the free tier CPU machine (only 2 cores available) for those notebooks.
106+
107+
# Azure ML Service (quick start)
108+
109+
How to run the code in Azure ML Service:
110+
111+
- Create a new Azure ML Service workspace.
112+
- Create a new Azure ML Service compute instance.
113+
- Clone the repository to compute instance under "/home/azureuser/cloudfiles/code/Users/<user.name>/"
114+
to ensure storage of the code in the related storage account (File Share).
115+
- optinal but recommended: install [Miniforge](https://github.com/conda-forge/miniforge) on compute instance for fast virtual python environment creation.
116+
- Run the following commands in the terminal:
117+
```
118+
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
119+
bash Miniforge3-$(uname)-$(uname -m).sh
120+
```
121+
- Restart terminal to use mamba as package manager.
122+
- Create a new virtual environment with the following command in the terminal from the root directory of the repository:
123+
```mamba env create -f environment.yml```
124+
- Create/upload your config file to define your directory pathes accodringly.
125+
- Activate the environment respectivley select the environment in the jupyter notebook (the kernel can be selected under python kernel dropdown menu) to run the code/notebooks.

0 commit comments

Comments
 (0)