- Bank Marketing: Predicting Telemarketing Success
- Data Description
- Predict Response Variable Value using Random Forest
Bank Marketing: Predicting Telemarketing Success
The random forest classification model is used as a data driven method for predicting marketing success.
Document Updated on:
##  "2016-04-25 23:40:15 EST"
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. The goal of the model is to predict if a customer will subscribe to an account. The full article is available at: A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014.
Random Forest Classification Model
Some of the key advantages to using a random forest model include:
- reduces probability of overfitting
- higher model performance accuracy
Input dataset has 20 independent variables and a target variable. The target variable y is binary, with 88% =No and 12% =Yes. Before model fitting all character variables are coded as factors. Additional attention to the levels is necessary. By default R uses alphabetical rules to assign factor level numbers. This model will produce the benchmark level for analyzing additional models. The best model is kept after each iteration of model fitting.
All variables are shown as numerical because it’s easy for me to code. In practice the five number summary for categorical variables is meaningless. However, the N,Min, and Max on this table provides quality information for categorical variables.
Train and test data
Split the data sample into development and validation samples. The model is fit using the train set and then predictions are quantified using the actuals.
Model Class: Classification vs Regression
Both development and validation samples have similar target variable distribution. This is just a sample validation.
If target variable is factor, classification decision tree is built. This is known before model fitting but it’s good coding practice to verify the type of response variable.
Class of target or response variable is factor, so a classification Random Forest will be built. The current data frame has a list of independent variables, so we can make it formula and then pass as a parameter value for randomForest.
Random Forest Model Fit
The sample data and formula are used for building the Random Forest model. The number of trees is set to 500 decision trees. The error rate across decision trees seems to indicate that after 100 decision trees, there is not a significant reduction in error rate.
Variable importance plot is also a useful tool and can be plotted using varImpPlot function. The top 5 variables are selected and plotted based on Model Accuracy and Gini value.
We can also get a table with decreasing order of importance based on a measure (1 for model accuracy and 2 node impurity). Based on Random Forest variable importance, the variables could be selected for any other predictive modelling techniques or machine learning.
Now, we want to measure the accuracy of the Random Forest model. The negative predictive value and specificity are very important for pharmaceutical companies seeking to understand test accuracy. The model can be optimized to increase specificity. Increasing the specificity will decrease the accuracy of the sensitivity. Optimal sensitivity and specificity cutoffs should be identified after the first iteration of model fitting.
Some of the other model performance statistics are
- Lift Chart
- ROC Curve
Predict Response Variable Value using Random Forest
Generic predict function can be used for predicting response variable using Random Forest object. The train set y-variable distribution is 88% = No and 12% = Yes.
no yes 88.7 11.3
Confusion Matrix: Actuals vs Predicted Response
A confusionMatrix function from the caret package can be used for creating confusion matrix based on the actual response variable and predicted value. Predictions of the fitted model are typically overly optimistic. The test set predictions provide a more reliable test of accuracy.
Confusion Matrix and Statistics Reference Prediction no yes no 2423 0 yes 0 308 Accuracy : 1 95% CI : (0.9987, 1) No Information Rate : 0.8872 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 1 Mcnemar's Test P-Value : NA Sensitivity : 1.0000 Specificity : 1.0000 Pos Pred Value : 1.0000 Neg Pred Value : 1.0000 Prevalence : 0.1128 Detection Rate : 0.1128 Detection Prevalence : 0.1128 Balanced Accuracy : 1.0000 'Positive' Class : yes
It has accuracy of 99.81%, which is fantastic. Now we can predict response for the validation sample and calculate model accuracy for the sample.
Confusion Matrix and Statistics Reference Prediction no yes no 1543 138 yes 34 75 Accuracy : 0.9039 95% CI : (0.8893, 0.9172) No Information Rate : 0.881 P-Value [Acc > NIR] : 0.001201 Kappa : 0.419 Mcnemar's Test P-Value : 4.04e-15 Sensitivity : 0.35211 Specificity : 0.97844 Pos Pred Value : 0.68807 Neg Pred Value : 0.91791 Prevalence : 0.11899 Detection Rate : 0.04190 Detection Prevalence : 0.06089 Balanced Accuracy : 0.66528 'Positive' Class : yes
Accuracy level has dropped to 91.4% but still significantly higher.
Additional models should be fit to this exact data to evaluate performance of the error on the validation data set. The variable importance profile is a data driven approach. Anticapate the variable importance profile will be very different with respect to the model class used. Some of the predictor importance variability should be analyzed in an iterative work flow:
- Identify multicolinearity between predictors
- Identify nonlinear relationships between individual X and Y
- Remove outliers (carefully) and impute missing values
High dimensional data benefits from dimension reduction before model fitting. Data dimension reduction methods:
- Principal Component Regression
- Partial Least Squares
Additional models to be evaluated:
- Naive Bayes
- Neural Network