This Repository contains code and explained approach for the data science JOB-A-THON September 2021 Hackathon conducted by Analytics vidya
- numpy 2. pandas 3. matplotlib 4. sklearn
- Data Loading and Processing 2. Model Creation, Training and Evaluation
I have used python module Pandas to load train and test data and stored the DataFrames in python varaibles.
Now using pandas methods like head(), info(), describe(), corr(), nunique() and pairplot got a basic understandign of the data.
The attributes in the data are ID, Store_id, Store_Type, Location_Type, Region_code, Date, Holiday, Discount, #Order and we have to predict Sales. It is observed that there are no missing values and few categorical attributes.
As ID attribute is of no use I have dropped it, #Order attribute is not in test data I haev dropped it and I want to create a simple model so removed Date attribute Now after dropping these attributes I have encoded categorical attribues using Label Encoder in sklearn Then The data is spllited to train data and test data using train_test_split() method in sklearn without shuffling data as we have to save the sequence in the data.
As it is Regression problem I have used few regression algorithms like LinearRegression, RidgeRegression, LassoRegression, RandomForestRegression and few other ensemble regression algorithms. Amount these RandomForestRegression algorithm gave good validation score so I have used this algorithm to train on complete data and used it to predict on test data. For Evaluation mean squared log error is used as metric.
using this approach The error I got on private Leaderboard is 225.79.