Bank Telemarketing Efficiency: A Data Driven Approach using Random Forest Classifiers

Bank Marketing: Predicting Telemarketing Success

The random forest classification model is used as a data driven method for predicting marketing success.

Document Updated on:

## [1] "2016-04-25 23:40:15 EST"

Data information

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. The goal of the model is to predict if a customer will subscribe to an account. The full article is available at: A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014.

Random Forest Classification Model

Some of the key advantages to using a random forest model include:

  • reduces probability of overfitting
  • higher model performance accuracy

Input dataset has 20 independent variables and a target variable. The target variable y is binary, with 88% =No and 12% =Yes. Before model fitting all character variables are coded as factors. Additional attention to the levels is necessary. By default R uses alphabetical rules to assign factor level numbers. This model will produce the benchmark level for analyzing additional models. The best model is kept after each iteration of model fitting.

Data Description

Summary Statistics

All variables are shown as numerical because it’s easy for me to code. In practice the five number summary for categorical variables is meaningless. However, the N,Min, and Max on this table provides quality information for categorical variables.

median mean st.dev min max sum NAs zeros ones n
age 39 41.2 10.6 19 87 186130 0 0 0 4521
job 5 5.4 3.3 1 12 24464 0 0 478 4521
marital 2 2.1 0.6 1 3 9710 0 0 528 4521
education 2 2.2 0.7 1 4 10088 0 0 678 4521
default 1 1.0 0.1 1 2 4597 0 0 4445 4521
balance 444 1422.7 3009.6 -3313 71188 6431836 0 357 15 4521
housing 2 1.6 0.5 1 2 7080 0 0 1962 4521
loan 1 1.2 0.4 1 2 5212 0 0 3830 4521
contact 1 1.7 0.9 1 3 7470 0 0 2896 4521
day 16 15.9 8.2 1 31 71953 0 0 27 4521
month 7 6.5 3.0 1 12 29568 0 0 293 4521
duration 185 264.0 259.9 4 3025 1193369 0 0 0 4521
campaign 2 2.8 3.1 1 50 12630 0 0 1734 4521
pdays -1 39.8 100.1 -1 871 179785 0 0 2 4521
previous 0 0.5 1.7 0 25 2453 0 3705 286 4521
poutcome 4 3.6 1.0 1 4 16091 0 0 490 4521
y 1 1.1 0.3 1 2 5042 0 0 4000 4521

Train and test data

Split the data sample into development and validation samples. The model is fit using the train set and then predictions are quantified using the actuals.

Model Class: Classification vs Regression

Both development and validation samples have similar target variable distribution. This is just a sample validation.

If target variable is factor, classification decision tree is built. This is known before model fitting but it’s good coding practice to verify the type of response variable.

Class of target or response variable is factor, so a classification Random Forest will be built. The current data frame has a list of independent variables, so we can make it formula and then pass as a parameter value for randomForest.

Random Forest Model Fit

The sample data and formula are used for building the Random Forest model. The number of trees is set to 500 decision trees. The error rate across decision trees seems to indicate that after 100 decision trees, there is not a significant reduction in error rate.

plot of chunk rand_forst

Variable Importance

Variable importance plot is also a useful tool and can be plotted using varImpPlot function. The top 5 variables are selected and plotted based on Model Accuracy and Gini value.

plot of chunk varImp_plot

MeanDecreaseGini Variables
duration 153.8 duration
month 64.2 month
balance 51.2 balance
age 49.8 age
day 45.8 day
job 40.7 job
poutcome 30.9 poutcome
pdays 24.2 pdays
campaign 19.9 campaign
education 13.6 education
marital 13.2 marital
previous 13.2 previous
contact 9.6 contact
housing 7.1 housing
loan 3.9 loan
default 1.7 default

We can also get a table with decreasing order of importance based on a measure (1 for model accuracy and 2 node impurity). Based on Random Forest variable importance, the variables could be selected for any other predictive modelling techniques or machine learning.

Now, we want to measure the accuracy of the Random Forest model. The negative predictive value and specificity are very important for pharmaceutical companies seeking to understand test accuracy. The model can be optimized to increase specificity. Increasing the specificity will decrease the accuracy of the sensitivity. Optimal sensitivity and specificity cutoffs should be identified after the first iteration of model fitting.

Some of the other model performance statistics are

  • KS
  • Lift Chart
  • ROC Curve

Predict Response Variable Value using Random Forest

Generic predict function can be used for predicting response variable using Random Forest object. The train set y-variable distribution is 88% = No and 12% = Yes.

  no  yes 
88.7 11.3 

Confusion Matrix: Actuals vs Predicted Response

A confusionMatrix function from the caret package can be used for creating confusion matrix based on the actual response variable and predicted value. Predictions of the fitted model are typically overly optimistic. The test set predictions provide a more reliable test of accuracy.

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  2423    0
       yes    0  308

               Accuracy : 1          
                 95% CI : (0.9987, 1)
    No Information Rate : 0.8872     
    P-Value [Acc > NIR] : < 2.2e-16  

                  Kappa : 1          
 Mcnemar's Test P-Value : NA         

            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.1128     
         Detection Rate : 0.1128     
   Detection Prevalence : 0.1128     
      Balanced Accuracy : 1.0000     

       'Positive' Class : yes        

Validation Sample

It has accuracy of 99.81%, which is fantastic. Now we can predict response for the validation sample and calculate model accuracy for the sample.

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  1543  138
       yes   34   75

               Accuracy : 0.9039          
                 95% CI : (0.8893, 0.9172)
    No Information Rate : 0.881           
    P-Value [Acc > NIR] : 0.001201        

                  Kappa : 0.419           
 Mcnemar's Test P-Value : 4.04e-15        

            Sensitivity : 0.35211         
            Specificity : 0.97844         
         Pos Pred Value : 0.68807         
         Neg Pred Value : 0.91791         
             Prevalence : 0.11899         
         Detection Rate : 0.04190         
   Detection Prevalence : 0.06089         
      Balanced Accuracy : 0.66528         

       'Positive' Class : yes             

Accuracy level has dropped to 91.4% but still significantly higher.

Discussion

Additional models should be fit to this exact data to evaluate performance of the error on the validation data set. The variable importance profile is a data driven approach. Anticapate the variable importance profile will be very different with respect to the model class used. Some of the predictor importance variability should be analyzed in an iterative work flow:

  • Identify multicolinearity between predictors
  • Identify nonlinear relationships between individual X and Y
  • Remove outliers (carefully) and impute missing values

High dimensional data benefits from dimension reduction before model fitting. Data dimension reduction methods:

  • Principal Component Regression
  • Partial Least Squares
  • Lasso
  • Clustering

Additional models to be evaluated:

  • Naive Bayes
  • Neural Network
  • SVM

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s