- Nick Gigliotti
- [email protected]
Banco de Portugal has asked me to create a model to help them predict which customers are likely to invest in term deposit accounts as a result of telemarketing. Telemarketing is, no doubt, very stressful and time-consuming work. Salespersons don't like to waste the time of customers, because it's a waste of their time too. Not only that, but dealing with uninterested customers is surely the ugliest part the job. How many times a day does a bank telemarketer have to put up with insults and rude remarks? On the other hand, salespersons who are stuck calling low-potential customers are likely to resort to aggressive, desperate, sales tactics. It's like trench warfare over the phone, and it needs to be made easier.
That's where machine learning comes into play, and in particular logistic regression. Logistic regression models are widely used because they offer a good combination of simplicity and predictive power. My goal is to create a strong predictive model which can predict investments based on data which can be realistically obtained in advance. Banco de Portugal will use my model to increase the efficency of their telemarketing efforts by discovering the customers with the highest probability of investing.
I train my predictive classifier on a Banco de Portugal telemarketing dataset which is publically available on the UCI Machine Learning Repository. The data was collected between May 2008 and November 2010. It contains 21 features total and about 41k observations. About two thirds of the features are categorical and one third are numeric.
Note that I have renamed, added, or removed some features.
- 'age' - years
- 'job' - type of job
- 'marital' - marital status
- 'education' - level of education
- 'default' - has defaulted on credit
- 'housing' - has housing loan
- 'loan' - has personal loan
- 'contact_cellular' - call was on a cellular rather than landline
- 'contact_month' - month of last contact
- 'contact_weekday' - weekday of last contact
- 'contact_duration' - duration of last contact in seconds
- 'contact_count' - total number of contacts during this campaign
- 'invested' - invested in a term deposit (target variable)
A term deposit is a short-term investment which typically matures within a few months or years.
- 'recent_prev_contact' - last contacted less than one week ago during previous campaign
- 'prev_contact' - was contacted during a previous campaign
- 'prev_failure' - previous campaign did not result in a sale
- 'prev_success' - previous campaign resulted in a sale
- 'emp_var_rate' - employment variation rate (quarterly indicator)
- 'cons_price_idx' - consumer price index (monthly indicator)
- 'cons_conf_idx' - consumer confidence index (monthly indicator)
- 'euribor_3m' - euribor 3 month rate (daily indicator)
- 'n_employed' - thousands of people employed (quarterly indicator)
- I perform some preliminary data cleaning and reorganization, including converting some categorical and numeric features to binary {0, 1} features.
- I perform a train-test split, dropping 'contact_duration'.
The 'contact_duration' feature is not information that Banco de Portugal would realistically be able to plug into my model in advance.
- I set up preprocessing pipelines which feed into a Scikit-Learn classification "estimator".
- I create a baseline dummy model and a baseline logistic regression model.
- I make iterative progress on the logistic regression model, adding preprocessors and changing parameters at each step.
- I encode categorical variables using a one-hot scheme, letting missing values go to 0.0.
- I fill the few remaining missing values with 0.0, for consistency.
- I filter out highly inter-correlated sets of features, retaining the best feature from each set. This is crucial to avoid multicollinearity.
- I perform a slight 95% Winsorization on the data before scaling to reduce the influence of outliers.
- I center the data on each feature's mean and scale to standard deviation.
- I use Scikit-Learn's built-in class weight balancing.
- I use L2 regularization to reduce overfitting.
- I retrain the final model pipeline on the full dataset.
Here is a look at the diagnostic plots (confusion matrix, ROC curve, precision-recall curve) of the final model before it was retrained on the full dataset. Notice the strong diagonal on the confusion matrix, with ~0.71 positive recall. The average precision (AP) score is 0.42 and the weighted ROC AUC score is 0.78. Not bad, after dropping 'contact_duration', the strongest feature in the dataset.
The highest magnitude coefficients are 'prev_success', 'contact_cellular', 'n_employed', and 'contact_month_may'. The most interesting novel discovery gleaned from the model is that the Portuguese employment count has a very strong negative relationship with clients choosing to invest. I don't understand why this is, but it is undoubtedly very strong. When employment is low, bank marketing for term deposits is highly effective!
The most important future work would be to build different types of models and compare them to my final LogisticRegression
. RandomForestClassifier
, LinearSVC
, and KNeighborsClassifier
are three obvious choices. Unlike most support vector machines, the LinearSVC
is able to handle datasets with large numbers of observations. But as it is a linear model, I still have to worry about multicollinearity.
Multicollinearity is not a concern, however, with the RandomForestClassifier
or the KNeighborsClassifier
. That means no features have to be dropped on that account. This alone is reason to think one of these models could perform better than my regression. Of all of these, I see the most potential in the RandomForestClassifier
, in part because it has so many hyperparameters to tune.
- My custom analysis code can be found in the 'tools' module/directory.
- The original data files are located in the 'data' directory.
- The final model pipeline is saved in the 'models' directory.
- The parameter search results are located in the 'sweep_results' directory.
- See main_notebook.ipynb for the analysis notebook.
- See the 'presentation' directory for the presentation and related files.
- See the 'reference' directory for a paper written on a similar dataset.