|
| 1 | +# 1. Overview |
| 2 | + |
| 3 | +Find the full code in the repository [pytorch-lightning-tabular-classification](https://github.com/tiefenthaler/pytorch-lightning-tabular-classification) on GitHub. |
| 4 | + |
| 5 | +## 1.1. Table of Contents |
| 6 | + |
| 7 | +**Section 1 and 2 provide a brief overview of PyTorch Lightning.** |
| 8 | +**Section 3 describes the use case of tabular multi-class classification.** |
| 9 | +**Section 4 provides the code implementation of the deep learning pipeline.** |
| 10 | + |
| 11 | +- [1. Overview](#1-overview) |
| 12 | + - [1.1. Table of Contents](#11-table-of-contents) |
| 13 | +- [2. Packaging Classification a Tabular Data Use Case using Machine Learning](#2-packaging-classification-a-tabular-data-use-case-using-machine-learning) |
| 14 | +- [3. Use case description:](#3-use-case-description) |
| 15 | +- [4. Structure of the showcase](#4-structure-of-the-showcase) |
| 16 | +- [5. Code structure](#5-code-structure) |
| 17 | +- [Google Colab X Google Drive (quick start)](#google-colab-x-google-drive-quick-start) |
| 18 | +- [Azure ML Service (quick start)](#azure-ml-service-quick-start) |
| 19 | + |
| 20 | +# 2. Packaging Classification a Tabular Data Use Case using Machine Learning |
| 21 | + |
| 22 | +This repo is meant as a showcase to demonstrate a data science workflow for a multi-class classification (see use case description below) use case in business context. Data science use cases may have very different nature in terms of how the result/solution is used by the business. This use case has the characteristic of providing one-time insights and a inference solution to be reused manually to get the classification outputs further continues usage by the business. Therefore the repo focuses on analytics and limits operational aspects for analytical reusability. Still the repo contains proper data engineering and data science best practices for data preparation, pre-processing, modeling and evaluation. Many different frameworks for modeling with their specific integrations are used. Each of the different frameworks is used to demonstrate the reusability of the code and the data science workflow and contains a **pre-processing pipeline**, **modeling pipeline** and **evaluation pipeline**. A description of each solution for a given framework as given in the related notebook. For the data science pipeline with best performance, the best performing model is used to create a **prediction pipeline** and **deeper analysis** of the model and the related results in respect to performance and the business goal. |
| 23 | + |
| 24 | +The repo focuses on the following aspects: |
| 25 | + |
| 26 | +- Build a simple ETL pipeline to prepare the raw data for analysis and classification. |
| 27 | +- Conduct general data analysis for data quality investigation under consideration of the business goal. |
| 28 | +- Conduct data analysis to get an understanding how to handle the data for multi-class classification, including a naive benchmark model using sklearn (DummyClassifier & a custom classifier). |
| 29 | +- Build multiple machine learning pipelines to evaluate best classification performance. The following aspects are considered within those pipelines: |
| 30 | + - Benchmarking pipelines to compare performance of multiple different types of models: |
| 31 | + - A basic benchmarking pipeline using naive classifiers as a base line. |
| 32 | + - A AutoML (automated machine learning) pipeline using PyCaret to compare a "large" variety of machine learning algorithms, considering: |
| 33 | + - including and excluding custom data pre-processing |
| 34 | + - pre-defined hyper-parameter set for each algorithm by PyCaret |
| 35 | + - using random search for HPO (hyper-parameter optimization) with a pre-defined hyper-parameter search space for each algorithm by PyCaret |
| 36 | + - A AutoML (automated machine learning) pipeline using AutoGluon.Tablular, considering: |
| 37 | + - including and excluding custom data pre-processing |
| 38 | + - including auto pre-processing by AutoGluon.Tabular |
| 39 | + - including auto feature engineering by AutoGluon.Tabular |
| 40 | + - including multiple classifiers by using: |
| 41 | + - multiple ml algorithms |
| 42 | + - "standard" HPO for each algorithm defined by AutoGluon.Tabular |
| 43 | + - ensambles of algorithms (bagging and stacking with possible multiple layers) |
| 44 | + - A benchmarking pipeline for multiple tree-based algorithms, considering: |
| 45 | + (since AutoML indicates a good performance of tree-based algorithms for the given use case; as well as showing that no single tree-based algorithm significantly outperforms others) |
| 46 | + - Tree-based classifiers: DecisionTree, RandomForest, LightGBM. |
| 47 | + - Model hyper-parameter optimization. |
| 48 | + - Class imbalance. |
| 49 | + - A benchmarking pipeline for Neural Network using Pytorch/Lightning, considering: |
| 50 | + (AutoML shows relative low performance of NN with defined time constrains for the given use case. Double-check the AutoML-NN results with individual constrains) |
| 51 | + - MLP and Embedding-MLP. |
| 52 | + - Custom classes to handle tabular data for Pytorch/Lightning Dataset, DataLoaders, LightningDataModule, LightningModule, Trainer, and Models. |
| 53 | + - Model hyper-parameter optimization. |
| 54 | + - Class imbalance. |
| 55 | + - A custom pipeline for the best performing model based on benchmarking, considering: |
| 56 | + - Model hyper-parameter optimization. |
| 57 | + - Class imbalance. |
| 58 | + - Oversampling. |
| 59 | + - Business decision optimization for best model. |
| 60 | + - Threshold analysis (since best model provides probabilistic forecasts). |
| 61 | + - Consideration of class values from a business perspective using profit curves and under consideration of thresholds. |
| 62 | +- Build a production pipeline (training & inference) excluding infrastructure aspects for best model to provide final results. |
| 63 | + |
| 64 | +**Python:** [sklearn](https://scikit-learn.org/stable/) | [PyCaret](https://pycaret.gitbook.io/docs) | [AutoGluon.Tabular](https://auto.gluon.ai/stable/tutorials/tabular/index.html) | [LightGBM](https://lightgbm.readthedocs.io/en/stable/) | [PyTorch/Lightning](https://lightning.ai/pytorch-lightning) | [MLflow](https://mlflow.org/) | [Optuna](https://optuna.org/) | [Docker](https://www.docker.com/) |
| 65 | + |
| 66 | +# 3. Use case description: |
| 67 | + |
| 68 | +To reach sustainability goals for the packaging of products, the company needs to know to which packaging categories the single items belong to. Since this information is not given for 45.058 items of the total 137.035 items, the goal is to provide the categories for the items with missing ones based on a data-driven approach. The solution should be applied for comparable data sets from multiple origins. |
| 69 | + |
| 70 | +First analysis has shown that simple 1:1 relationships and rule-based approaches do not lead to proper results. Therefore, a machine learning approach was used. The goal is to build a solution that is capable of doing a highly accurate prediction for as many packaging categories as possible. Meaning that on the one side predictions need to meet a certain threshold for accuracy to be useful for the business (a small amount of wrong classifications can be tolerated but low classification accuracy does not help the business). On the other hand, a threshold for a minimum number of products needs to be covered (it is not mandatory to provide good predictions for all items, but providing good predictions for only a small amount of items also does not help the business a lot). Finally the machine learning solution should consider business decision optimization (cost optimization) based on different individual packaging categories (class). |
| 71 | + |
| 72 | +# 4. Structure of the showcase |
| 73 | +As the showcase is intended to reflect the data science process to tackle the use case, the structure builds up on this. |
| 74 | + |
| 75 | +# 5. Code structure |
| 76 | + |
| 77 | +``` |
| 78 | +Directory-tree structure: |
| 79 | +|-- environment.yml |
| 80 | +|-- README.md |
| 81 | +|-- README_ml_packaging_classification.md |
| 82 | +|-- notebooks |
| 83 | +| |-- 20_clf_pipeline_pytorch_embeddingMLP_optuna.ipynb # Embedding MLP with Optuna |
| 84 | +| |-- 20_clf_pipeline_pytorch_embeddingMLP.ipynb # Embedding MLP |
| 85 | +| |-- 20_clf_pipeline_pytorch_MLP_optuna.ipynb # MLP with Optuna |
| 86 | +| |-- 20_clf_pipeline_pytorch_MLP.ipynb # MLP, including deteiled code description |
| 87 | +|-- src |
| 88 | +| |-- pytorch_tabular # Modules for Tabular Data using PyTorch Lightning |
| 89 | +| | |-- callbacks.py # callbacks for Lightning Trainer |
| 90 | +| | |-- enciders.py # custom encoders for data preprocessing |
| 91 | +| | |-- tabular_lightning.py # lightning classes for tabular data |
| 92 | +| | |-- tabular_lightning_utils.py # for shared utilitie functions |
| 93 | +| | |-- tabular_models.py # custom models for Pytorch/Lightning |
| 94 | +| |-- utils.py # for shared functions |
| 95 | +``` |
| 96 | + |
| 97 | +# Google Colab X Google Drive (quick start) |
| 98 | + |
| 99 | +Some notebooks include code to use google colab with google drive. |
| 100 | +Google colab is a free cloud service for running python code in the browser and has native integration with google drive. |
| 101 | +To run the notebooks, you need to mount your google drive to the colab environment (code implementation). |
| 102 | +Ensure to define the google drive path in the code (colab config file). |
| 103 | +The additional packages needed in colab are included in the code implemenation (kernel restart required). |
| 104 | +It is recommended to run those notebooks on GPU or on a machine with high amount of CPU cores (only limited available on free tier). |
| 105 | +It is not recommended to run the notebooks on the free tier CPU machine (only 2 cores available) for those notebooks. |
| 106 | + |
| 107 | +# Azure ML Service (quick start) |
| 108 | + |
| 109 | +How to run the code in Azure ML Service: |
| 110 | + |
| 111 | +- Create a new Azure ML Service workspace. |
| 112 | +- Create a new Azure ML Service compute instance. |
| 113 | +- Clone the repository to compute instance under "/home/azureuser/cloudfiles/code/Users/<user.name>/" |
| 114 | + to ensure storage of the code in the related storage account (File Share). |
| 115 | +- optinal but recommended: install [Miniforge](https://github.com/conda-forge/miniforge) on compute instance for fast virtual python environment creation. |
| 116 | + - Run the following commands in the terminal: |
| 117 | + ``` |
| 118 | + curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" |
| 119 | + bash Miniforge3-$(uname)-$(uname -m).sh |
| 120 | + ``` |
| 121 | + - Restart terminal to use mamba as package manager. |
| 122 | +- Create a new virtual environment with the following command in the terminal from the root directory of the repository: |
| 123 | + ```mamba env create -f environment.yml``` |
| 124 | +- Create/upload your config file to define your directory pathes accodringly. |
| 125 | +- Activate the environment respectivley select the environment in the jupyter notebook (the kernel can be selected under python kernel dropdown menu) to run the code/notebooks. |
0 commit comments