The data set is generated by the players of a game. The data itself is in compressed CSV format split in multiple files. We have two datasets data/profiles
and data/activity
in their distinct folders. Data does not have a header row. The profiles
dataset contains user profiles with following columns:
- player_id (integer) - unique identifier of the player
- registration_date (yyyy-MM-dd) - date when the player 1st played the game
- country code (integer) - country of the user
- operating system (integer) - operating system of the user
- device type (integer) - type of device used by the player
The activity
contains the information on players' daily visits in the game. E.g. if player with ID 123
plays the game at least once on 2018-09-02
then there is a row with those values in the data set.
Complete schema of activity
dataset contains columns:
- event_date (yyyy-MM-dd)
- player_id (integer) - unique identifier of the player
- money_spent (float) - Total money spent during the day
- session_count (integer) - Number of game sessions for the day
- purchase_count (integer) - Number of purchases during the day
- time_spent_seconds (integer) - Total time spent playing during the day
- ads_impressions (integer) - Total number of seen ads during the day
- ads_clicks (integer) - Total number of clicked ads during the day
The target of this task is to build a machine learning model to identify the churns. Churns are players are not seen after 7th day from the registration
This is a test of end-to-end complete life-cycle of a machine learning model building. The following items are suggested to be included in the deliverable:
-
data example generation
-
label and feature engineering
-
splitting of training/validation/test set
-
model selection and parameter tuning
-
model training and evaluation
-
model deployment and service
you are supposed to submit the following items:
-
jupyter notebooks of data processing, model training, and model evaluation
-
performance metrics of model training and evaluation
-
a docker image containing the model files and service of the model. The docker image should be available at https://hub.docker.com/ ready for
docker pull
-
a document describing the process of modeling training and how to use the service of the model, and
-
a writeup detailing your choice of performance metrics & methods of model evaluation