Ensemble Modeling: A Stacked Model Approach

Document Updated last:

sysSt <- Sys.time()
sysSt
## [1] "2017-08-15 23:03:38 EDT"

This paper illustrates an ensemble model approach for the Ames House Price data. An ensemble model combines multiple models in an effort to increase the predictive accuracy. This specific ensemble approach illustrates the model stacking method.

The diagram below shows the high-level model stacking architecture, which is composed of two stages. The top-level stage, called Level 0, is composed of three modeling algorithms:

  1. Gradient Boosting (gbm)
  2. Extreme Gradient Boosting (xgb)
  3. Random Forest (rngr)

The second stage, Level 1, is composed of a single Neural Network (nnet) model. The predictions from Level 1 are used to create the final data set.

Feature Sets in the diagram represents one or more sets of attributes created for the Level 0 models.

plot of chunk model_workflow

The following chart shows the model performance of the individual Level 0 models versus the overall stacked model.

##   Model PLB_Score   Level
## 1   gbm   0.12852 Level 0
## 2   xgb   0.13696 Level 0
## 3  rngr   0.14703 Level 0
## 4  nnet   0.12678 Level 1

The percent reduction for the Level 1 model compared to the best performing Level 0 model:

## [1] 1.353875

plot of chunk percent_reduction

From the above chart we see that the Level 1 model is a 1.35% improvement over the best performing Level 0 model.

To fit within the constraints of an intuitive ensemble model approach, a simple stacked model was designed. A performance gain for Level 1 predictive power supports the simple design efficacy. The wisdom of Occum’s Razor, i.e. parsimony, cannot be overstated when designing models. A simple model design supports efficient model optimization. The specific simplications are

  • Limit Level 0 to three models
  • Limit Level 1 to one model

Improvements in stacked model performance can be accomplished by

  • Adding models to Level 0 and Level 1 using different algorithms
  • Tuning model Hyper-parameters
  • Adding feature sets by feature engineering
  • Adding levels in the model structure

The remainder of this paper demonstrates the model stacking training pipeline. First we show an approach for creating model feature sets. Next we demonstrate an approach for training Level 0 models.

Then we end with creating features for the Level 1 model and creating the final data set.

Data Preparation

Retrieve Data

Initial Data Profile

Feature selection is based on a wrapper algorithm for feature importance analysis. The full set of variables includes SalePrice (y-variable) and 79 predictors (x-variables). Most variables have informative descriptive names.

##  [1] "MSSubClass"    "MSZoning"      "LotArea"       "LotShape"     
##  [5] "LandContour"   "Neighborhood"  "BldgType"      "HouseStyle"   
##  [9] "OverallQual"   "OverallCond"   "YearBuilt"     "YearRemodAdd" 
## [13] "Exterior1st"   "Exterior2nd"   "MasVnrArea"    "ExterQual"    
## [17] "Foundation"    "BsmtQual"      "BsmtCond"      "BsmtFinType1" 
## [21] "BsmtFinSF1"    "BsmtFinType2"  "BsmtUnfSF"     "TotalBsmtSF"  
## [25] "HeatingQC"     "CentralAir"    "X1stFlrSF"     "X2ndFlrSF"    
## [29] "GrLivArea"     "BsmtFullBath"  "FullBath"      "HalfBath"     
## [33] "BedroomAbvGr"  "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd" 
## [37] "Functional"    "Fireplaces"    "FireplaceQu"   "GarageType"   
## [41] "GarageYrBlt"   "GarageFinish"  "GarageCars"    "GarageArea"   
## [45] "GarageQual"    "GarageCond"    "PavedDrive"    "WoodDeckSF"   
## [49] "OpenPorchSF"   "Fence"         "Alley"         "LandSlope"    
## [53] "Condition1"    "RoofStyle"     "MasVnrType"    "BsmtExposure" 
## [57] "Electrical"    "EnclosedPorch" "SaleCondition" "LotFrontage"  
## [61] "Street"        "Utilities"     "LotConfig"     "Condition2"   
## [65] "RoofMatl"      "ExterCond"     "BsmtFinSF2"    "Heating"      
## [69] "LowQualFinSF"  "BsmtHalfBath"  "X3SsnPorch"    "ScreenPorch"  
## [73] "PoolArea"      "PoolQC"        "MiscFeature"   "MiscVal"      
## [77] "MoSold"        "YrSold"        "SaleType"

Create Level 0 Model Feature Sets

Machine learning algorithms perform better when uninformative predictors are removed. For this work, two feature sets were created. Both of these sets included Boruta (wrapper algorithm) confirmed and tentative attributes.

Each feature set is created by a specific user-defined R function. These functions
convert the raw training data into a feature set.

No extensive feature engineering was performed. Missing values are handled as
follows:

  • Numeric: set to -1
  • Character: set to “*MISSING*”

Character attributes are converted to R factor variables.

Level 0 Model Training

Helper Function For Training

Data is split up into train and test, using an 80/20 split.

gbm Model

Generalized Boosted Regression Model
Score: 0.12852

## [1] 0.1309466
## Average CV rmse: 0.1304117

xgboost Model

Extreme Gradient Boosting

Score: 0.13696

## [1] 11.53097
## Average CV rmse: 11.53097

ranger Model

Random Forest

Score: 0.14703

## [1] 0.138126
## Average CV rmse: 0.1369463

Level 1 Model Training

Create predictions For Level 1 Model

Neural Net Model

Score: 0.12678

## Average CV rmse: 0.1270413

References

Additional Resources

For additional information on model stacking see these references:

Document Utilities

## [1] "2017-08-15 23:01:13 EDT"
## Time difference of 1.164407 mins
## [1] "/media/disc/Megasync/R_wordpress/ensemble_stacked_model"
## [1] "cache"                       "ensemble_stacked_wp_cache"  
## [3] "ensemble_stacked_wp_files"   "ensemble_stacked_wp.html"   
## [5] "ensemble_stacked_wp.nb.html" "ensemble_stacked_wp.Rmd"    
## [7] "figure"
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  splines   stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] e1071_1.6-8      gbm_2.1.3        survival_2.41-3  bindrcpp_0.1    
##  [5] Metrics_0.1.2    nnet_7.3-12      ranger_0.7.0     xgboost_0.6-4   
##  [9] dplyr_0.5.0.9004 plyr_1.8.4       caret_6.0-76     ggplot2_2.2.1   
## [13] lattice_0.20-35  markdown_0.7.7   knitr_1.15.1     RWordPress_0.2-3
## 
## loaded via a namespace (and not attached):
##  [1] reshape2_1.4.2     colorspace_1.3-2   stats4_3.3.3      
##  [4] mgcv_1.8-17        XML_3.98-1.9       rlang_0.0.0.9018  
##  [7] ModelMetrics_1.1.0 nloptr_1.0.4       glue_1.0.0        
## [10] foreach_1.4.3      bindr_0.1          stringr_1.2.0     
## [13] MatrixModels_0.4-1 munsell_0.4.3      gtable_0.2.0      
## [16] codetools_0.2-15   evaluate_0.10      labeling_0.3      
## [19] SparseM_1.76       class_7.3-14       quantreg_5.29     
## [22] pbkrtest_0.4-7     XMLRPC_0.3-0       highr_0.6         
## [25] Rcpp_0.12.10       scales_0.4.1       lme4_1.1-12       
## [28] digest_0.6.12      stringi_1.1.3      grid_3.3.3        
## [31] tools_3.3.3        bitops_1.0-6       magrittr_1.5      
## [34] lazyeval_0.2.0     RCurl_1.95-4.8     tibble_1.3.0      
## [37] car_2.1-4          MASS_7.3-45        Matrix_1.2-8      
## [40] data.table_1.10.4  assertthat_0.2.0   minqa_1.2.4       
## [43] iterators_1.0.8    R6_2.2.0           compiler_3.3.3    
## [46] nlme_3.1-131

Leave a comment