Ensemble Methods - Bagging, Random Forests, Boosting

Overview

Ensemble Methods are methods that combine together many model predictions. For example, in Bagging (short for bootstrap aggregation), parallel models are constructed on m = many bootstrapped samples (eg., 50), and then the predictions from the m models are averaged to obtain the prediction from the ensemble of models.

In this tutorial we walk through basics of three Ensemble Methods: Bagging, Random Forests, and Boosting.

Outline

In this session we cover …

Introduction to Data
Splitting Data into Training and Test sets
Model 0: A Single Classification Tree
Model 1: Bagging of ctrees
Model 2: Random Forest for classification trees
Model 2a: CForest for Conditional Inference Tree
Model 3: Random Forest with Boosting
Model Stacking (Not included yet)
Model Comparison
Conclusion

Loading Libraries Used In This Script

library(ISLR)          #the Carseat Data
library(psych)         #data descriptives
library(caret)         #training and cross validation, other model libraries
library(rpart)         #trees
library(rattle)        #fancy tree plot 
library(rpart.plot)    #enhanced tree plots
library(RColorBrewer)  #color pallets
library(party)         #alternative decision tree algorithm
library(partykit)      #convert rpart object to BinaryTree
library(randomForest)  #random forest
library(pROC)          #ROC curves
library(gbm)           #gradient boosting
library(ggplot2)       #data visualization
library(dplyr)         #data manipulation

1. Introduction to the Data

Reading in the CarSeats Data Exploration Data Set

This is a simulated data set containing sales of child car seats at 400 different stores. Sales can be predicted by 10 other variables.

#loading the data
data("Carseats")

Data Descriptives

Lets have a quick look at the data file and the descriptives.

#data structure
head(Carseats,10)

##    Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1   9.50       138     73          11        276   120       Bad  42        17
## 2  11.22       111     48          16        260    83      Good  65        10
## 3  10.06       113     35          10        269    80    Medium  59        12
## 4   7.40       117    100           4        466    97    Medium  55        14
## 5   4.15       141     64           3        340   128       Bad  38        13
## 6  10.81       124    113          13        501    72       Bad  78        16
## 7   6.63       115    105           0         45   108    Medium  71        15
## 8  11.85       136     81          15        425   120      Good  67        10
## 9   6.54       132    110           0        108   124    Medium  76        10
## 10  4.69       132    113           0        131   124    Medium  76        17
##    Urban  US
## 1    Yes Yes
## 2    Yes Yes
## 3    Yes Yes
## 4    Yes Yes
## 5    Yes  No
## 6     No Yes
## 7    Yes  No
## 8    Yes Yes
## 9     No  No
## 10    No Yes

Our outcome of interest will be a binary version of Sales: Unit sales (in thousands) at each location.

(Note again that there is no id variable. This is convenient for some tasks.)

#sample descriptives
describe(Carseats)

##             vars   n   mean     sd median trimmed    mad min    max  range
## Sales          1 400   7.50   2.82   7.49    7.43   2.87   0  16.27  16.27
## CompPrice      2 400 124.97  15.33 125.00  125.04  14.83  77 175.00  98.00
## Income         3 400  68.66  27.99  69.00   68.26  35.58  21 120.00  99.00
## Advertising    4 400   6.64   6.65   5.00    5.89   7.41   0  29.00  29.00
## Population     5 400 264.84 147.38 272.00  265.56 191.26  10 509.00 499.00
## Price          6 400 115.79  23.68 117.00  115.92  22.24  24 191.00 167.00
## ShelveLoc*     7 400   2.31   0.83   3.00    2.38   0.00   1   3.00   2.00
## Age            8 400  53.32  16.20  54.50   53.48  20.02  25  80.00  55.00
## Education      9 400  13.90   2.62  14.00   13.88   2.97  10  18.00   8.00
## Urban*        10 400   1.71   0.46   2.00    1.76   0.00   1   2.00   1.00
## US*           11 400   1.65   0.48   2.00    1.68   0.00   1   2.00   1.00
##              skew kurtosis   se
## Sales        0.18    -0.11 0.14
## CompPrice   -0.04     0.01 0.77
## Income       0.05    -1.10 1.40
## Advertising  0.63    -0.57 0.33
## Population  -0.05    -1.21 7.37
## Price       -0.12     0.41 1.18
## ShelveLoc*  -0.62    -1.28 0.04
## Age         -0.08    -1.14 0.81
## Education    0.04    -1.31 0.13
## Urban*      -0.90    -1.20 0.02
## US*         -0.60    -1.64 0.02

#histogram of outcome
Carseats %>%
  ggplot(aes(x=Sales)) +
  geom_histogram(binwidth=1, boundary=.5, fill="white", color="black") + 
  geom_vline(xintercept = 8, color="red", linewidth=2) +
  labs(x = "Sales")

For convenience of didactic illustration we create a new variable HighSales that is binary, “No” if Sales <= 8, and “Yes” otherwise.

Carseats <- Carseats %>%
  #creating new binary variable
  mutate(HighSales=ifelse(Sales<=8, "No", "Yes"),
         #remove old variable
         Sales = NULL,
         #convert a factor variable into a numeric variable 
         ShelveLoc = as.numeric(ShelveLoc))

2. Splitting the Data Into Training and Test Sets

We split the data: half for Training, half for Testing.

#random sample half the rows 
halfsample = sample(dim(Carseats)[1], dim(Carseats)[1]/2) # half of sample

#create training and test data sets
Carseats.train = Carseats[halfsample, ]
Carseats.test = Carseats[-halfsample, ]

We will use these to evaluate a variety of different classification algorithms: Random Forests, CForests, etc.

Setting Up the K-Fold Cross Validation k = 10 Cross-Validation Folds

First, we set up the cross validation control.

#Setting the random seed for replication
set.seed(1234)

#setting up cross-validation
cvcontrol <- trainControl(method="repeatedcv", 
                          number = 10,
                          allowParallel=TRUE)

3. Model 0: A Single Classification Tree

We first optimize fit of a classification tree. Our objective with the cross-validation is to optimize the size of the tree - tuning the complexity parameter.

train.tree <- train(as.factor(HighSales) ~ ., 
                    data=Carseats.train,
                    method="ctree",
                    trControl=cvcontrol,
                    tuneLength = 10)
train.tree

## Conditional Inference Tree 
## 
## 200 samples
##  10 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   mincriterion  Accuracy  Kappa    
##   0.0100000     0.690     0.3509902
##   0.1188889     0.690     0.3509902
##   0.2277778     0.690     0.3509902
##   0.3366667     0.685     0.3430716
##   0.4455556     0.685     0.3430716
##   0.5544444     0.665     0.3062027
##   0.6633333     0.690     0.3544755
##   0.7722222     0.675     0.3169755
##   0.8811111     0.680     0.3406330
##   0.9900000     0.690     0.3768388
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mincriterion = 0.99.

plot(train.tree)

We see how the accuracy is maximized at a relatively less complex tree.

Look at the final tree:

# plot tree
plot(train.tree$finalModel,
     main="Regression Tree for Carseat High Sales")

To evaluate the accuracy of the tree we can look at the confusion matrix for the Training data.

#obtaining class predictions
tree.classTrain <-  predict(train.tree, type="raw")
head(tree.classTrain)

## [1] No No No No No No
## Levels: No Yes

#computing confusion matrix
confusionMatrix(as.factor(Carseats.train$HighSales), tree.classTrain)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  103  17
##        Yes  28  52
##                                           
##                Accuracy : 0.775           
##                  95% CI : (0.7108, 0.8309)
##     No Information Rate : 0.655           
##     P-Value [Acc > NIR] : 0.0001539       
##                                           
##                   Kappa : 0.5203          
##                                           
##  Mcnemar's Test P-Value : 0.1360371       
##                                           
##             Sensitivity : 0.7863          
##             Specificity : 0.7536          
##          Pos Pred Value : 0.8583          
##          Neg Pred Value : 0.6500          
##              Prevalence : 0.6550          
##          Detection Rate : 0.5150          
##    Detection Prevalence : 0.6000          
##       Balanced Accuracy : 0.7699          
##                                           
##        'Positive' Class : No              
##

Some Errors. But the model was learned.

More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions
tree.classTest <-  predict(train.tree, 
                           newdata = Carseats.test,
                           type="raw")
head(tree.classTest)

## [1] Yes No  No  No  Yes No 
## Levels: No Yes

#computing confusion matrix
confusionMatrix(as.factor(Carseats.test$HighSales), tree.classTest)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  87  29
##        Yes 35  49
##                                          
##                Accuracy : 0.68           
##                  95% CI : (0.6105, 0.744)
##     No Information Rate : 0.61           
##     P-Value [Acc > NIR] : 0.02412        
##                                          
##                   Kappa : 0.3367         
##                                          
##  Mcnemar's Test P-Value : 0.53197        
##                                          
##             Sensitivity : 0.7131         
##             Specificity : 0.6282         
##          Pos Pred Value : 0.7500         
##          Neg Pred Value : 0.5833         
##              Prevalence : 0.6100         
##          Detection Rate : 0.4350         
##    Detection Prevalence : 0.5800         
##       Balanced Accuracy : 0.6707         
##                                          
##        'Positive' Class : No             
##

Accuracy of 0.68

When evaluating classification models, a few other functions may be useful. For example, the pROC package provides convenience for calculating confusion matrices, the associated measures of sensitivity and specificity, and for obtaining and plotting ROC curves. We can also look at the ROC curve by extracting probabilities of “Yes”.

#Obtaining predicted probabilities for Test data
tree.probs=predict(train.tree,
                   newdata=Carseats.test,
                   type="prob")
head(tree.probs)

##          No       Yes
## 1 0.2156863 0.7843137
## 2 0.8269231 0.1730769
## 3 0.8269231 0.1730769
## 4 0.8269231 0.1730769
## 5 0.2156863 0.7843137
## 6 0.8269231 0.1730769

#Calculate ROC curve
rocCurve.tree <- roc(Carseats.test$HighSales,tree.probs[,"Yes"])

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

#plot the ROC curve
plot(rocCurve.tree,col=c(4))

#calculate the area under curve (bigger is better)
auc(rocCurve.tree)

## Area under the curve: 0.6823

4. Model 1: Bagging of ctrees

Training the model using treebag.

We first optimize fit of a classification tree. Our objective with the cross-validation is to optmize the size of the tree - tuning the complexity parameter.

#Using treebag 
train.bagg <- train(as.factor(HighSales) ~ .,
                    data=Carseats.train,
                    method="treebag",
                    trControl=cvcontrol,
                    importance=TRUE)

train.bagg

## Bagged CART 
## 
## 200 samples
##  10 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.78      0.5303142

plot(varImp(train.bagg))

Not yet sure how to parse mode details from the output in order to look at the collection of trees.

To evaluate the accuracy of the Bagged Trees we can look at the confusion matrix for the Training data.

#obtaining class predictions
bagg.classTrain <-  predict(train.bagg, type="raw")
head(bagg.classTrain)

## [1] No  No  Yes No  No  No 
## Levels: No Yes

#computing confusion matrix
confusionMatrix(as.factor(Carseats.train$HighSales), bagg.classTrain)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  120   0
##        Yes   0  80
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9817, 1)
##     No Information Rate : 0.6        
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0        
##             Specificity : 1.0        
##          Pos Pred Value : 1.0        
##          Neg Pred Value : 1.0        
##              Prevalence : 0.6        
##          Detection Rate : 0.6        
##    Detection Prevalence : 0.6        
##       Balanced Accuracy : 1.0        
##                                      
##        'Positive' Class : No         
##

The accuracy is perfect!

More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions
bagg.classTest <-  predict(train.bagg, 
                           newdata = Carseats.test,
                           type="raw")
head(bagg.classTest)

## [1] Yes Yes No  No  Yes No 
## Levels: No Yes

#computing confusion matrix
confusionMatrix(as.factor(Carseats.test$HighSales), bagg.classTest)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  89  27
##        Yes 31  53
##                                           
##                Accuracy : 0.71            
##                  95% CI : (0.6418, 0.7718)
##     No Information Rate : 0.6             
##     P-Value [Acc > NIR] : 0.0007913       
##                                           
##                   Kappa : 0.4008          
##                                           
##  Mcnemar's Test P-Value : 0.6936406       
##                                           
##             Sensitivity : 0.7417          
##             Specificity : 0.6625          
##          Pos Pred Value : 0.7672          
##          Neg Pred Value : 0.6310          
##              Prevalence : 0.6000          
##          Detection Rate : 0.4450          
##    Detection Prevalence : 0.5800          
##       Balanced Accuracy : 0.7021          
##                                           
##        'Positive' Class : No              
##

Accuracy of 0.79

We can also look at the ROC curve by extracting probabilities of “Yes”.

#Obtaining predicted probabilities for Test data
bagg.probs=predict(train.bagg,
                   newdata=Carseats.test,
                   type="prob")
head(bagg.probs)

##     No  Yes
## 1 0.20 0.80
## 2 0.48 0.52
## 3 0.76 0.24
## 4 0.88 0.12
## 5 0.00 1.00
## 6 0.92 0.08

#Calculate ROC curve
rocCurve.bagg <- roc(Carseats.test$HighSales,bagg.probs[,"Yes"])

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

#plot the ROC curve
plot(rocCurve.bagg,col=c(6))

#calculate the area under curve (bigger is better)
auc(rocCurve.bagg)

## Area under the curve: 0.7972

5. Model 2: Random Forest for Classification Trees

Training the model using random forest.

train.rf <- train(as.factor(HighSales) ~ ., 
                  data=Carseats.train,
                  method="rf",
                  trControl=cvcontrol,
                  #tuneLength = 3,
                  importance=TRUE)
train.rf

## Random Forest 
## 
## 200 samples
##  10 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa    
##    2    0.810     0.5817081
##    6    0.770     0.5061877
##   10    0.785     0.5381540
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

We can look at the confusion matrix for the Training data.

#obtaining class predictions
rf.classTrain <-  predict(train.rf, type="raw")
head(rf.classTrain)

## [1] No  No  Yes No  No  No 
## Levels: No Yes

#computing confusion matrix
confusionMatrix(as.factor(Carseats.train$HighSales), rf.classTrain)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  120   0
##        Yes   0  80
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9817, 1)
##     No Information Rate : 0.6        
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0        
##             Specificity : 1.0        
##          Pos Pred Value : 1.0        
##          Neg Pred Value : 1.0        
##              Prevalence : 0.6        
##          Detection Rate : 0.6        
##    Detection Prevalence : 0.6        
##       Balanced Accuracy : 1.0        
##                                      
##        'Positive' Class : No         
##

No Errors. That is good - the model was learned well.

More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions
rf.classTest <-  predict(train.rf, 
                         newdata = Carseats.test,
                         type="raw")
head(rf.classTest)

## [1] Yes No  No  No  Yes No 
## Levels: No Yes

#computing confusion matrix
confusionMatrix(as.factor(Carseats.test$HighSales), rf.classTest)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  106  10
##        Yes  37  47
##                                        
##                Accuracy : 0.765        
##                  95% CI : (0.7, 0.8219)
##     No Information Rate : 0.715        
##     P-Value [Acc > NIR] : 0.0663581    
##                                        
##                   Kappa : 0.4953       
##                                        
##  Mcnemar's Test P-Value : 0.0001491    
##                                        
##             Sensitivity : 0.7413       
##             Specificity : 0.8246       
##          Pos Pred Value : 0.9138       
##          Neg Pred Value : 0.5595       
##              Prevalence : 0.7150       
##          Detection Rate : 0.5300       
##    Detection Prevalence : 0.5800       
##       Balanced Accuracy : 0.7829       
##                                        
##        'Positive' Class : No           
##

Accuracy of 0.79. An improvement over Bagging only.

We can also look at the ROC curve by extracting probabilities of “Yes”.

#Obtaining predicted probabilities for Test data
rf.probs=predict(train.rf,
                 newdata=Carseats.test,
                 type="prob")
head(rf.probs)

##      No   Yes
## 3 0.334 0.666
## 4 0.672 0.328
## 5 0.730 0.270
## 7 0.918 0.082
## 8 0.250 0.750
## 9 0.814 0.186

#Calculate ROC curve
rocCurve.rf <- roc(Carseats.test$HighSales,rf.probs[,"Yes"])

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

#plot the ROC curve
plot(rocCurve.rf,col=c(1))

#calculate the area under curve (bigger is better)
auc(rocCurve.rf)

## Area under the curve: 0.8449

6. Model 2a: CForest for Conditional Inference Tree

An implementation of the random forest and bagging ensemble algorithms utilizing conditional inference trees as base learners (from party package)

train.cf <- train(HighSales ~ .,   #cforest knows the outcome is binary (unlike rf)
                  data=Carseats.train,
                  method="cforest",
                  trControl=cvcontrol)  #Note that importance not available here 
train.cf

## Conditional Inference Random Forest 
## 
## 200 samples
##  10 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa    
##    2    0.795     0.5468233
##    6    0.775     0.5205785
##   10    0.780     0.5333754
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

We can look at the confusion matrix for the Training data.

#obtaining class predictions
cf.classTrain <-  predict(train.cf, 
                          type="raw")
head(cf.classTrain)

## [1] No No No No No No
## Levels: No Yes

#computing confusion matrix
confusionMatrix(as.factor(Carseats.train$HighSales), cf.classTrain)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  116   4
##        Yes  29  51
##                                           
##                Accuracy : 0.835           
##                  95% CI : (0.7762, 0.8836)
##     No Information Rate : 0.725           
##     P-Value [Acc > NIR] : 0.0001807       
##                                           
##                   Kappa : 0.6374          
##                                           
##  Mcnemar's Test P-Value : 2.943e-05       
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 0.9273          
##          Pos Pred Value : 0.9667          
##          Neg Pred Value : 0.6375          
##              Prevalence : 0.7250          
##          Detection Rate : 0.5800          
##    Detection Prevalence : 0.6000          
##       Balanced Accuracy : 0.8636          
##                                           
##        'Positive' Class : No              
##

A few Errors. Model learned pretty well.

More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions
cf.classTest <-  predict(train.cf,
                         newdata = Carseats.test,
                         type="raw")
head(cf.classTest)

## [1] Yes No  No  No  Yes No 
## Levels: No Yes

#computing confusion matrix
confusionMatrix(as.factor(Carseats.test$HighSales), cf.classTest)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  106  10
##        Yes  52  32
##                                           
##                Accuracy : 0.69            
##                  95% CI : (0.6209, 0.7533)
##     No Information Rate : 0.79            
##     P-Value [Acc > NIR] : 0.9997          
##                                           
##                   Kappa : 0.3166          
##                                           
##  Mcnemar's Test P-Value : 1.919e-07       
##                                           
##             Sensitivity : 0.6709          
##             Specificity : 0.7619          
##          Pos Pred Value : 0.9138          
##          Neg Pred Value : 0.3810          
##              Prevalence : 0.7900          
##          Detection Rate : 0.5300          
##    Detection Prevalence : 0.5800          
##       Balanced Accuracy : 0.7164          
##                                           
##        'Positive' Class : No              
##

Accuracy of 0.775.

We can also look at the ROC curve by extracting probabilities of “Yes”.

#Obtaining predicted probabilities for Test data
cf.probs=predict(train.cf,
                 newdata=Carseats.test,
                 type="prob")
head(cf.probs)

##          No       Yes
## 1 0.4666175 0.5333825
## 2 0.6406987 0.3593013
## 3 0.7342379 0.2657621
## 4 0.7882546 0.2117454
## 5 0.4409320 0.5590680
## 6 0.7540845 0.2459155

#Calculate ROC curve
rocCurve.cf <- roc(Carseats.test$HighSales,cf.probs[,"Yes"])

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

#plot the ROC curve
plot(rocCurve.cf,col=c(2))

#calculate the area under curve (bigger is better)
auc(rocCurve.cf)

## Area under the curve: 0.7787

7. Model 3: Random Forest with Boosting

Possible to use a variety of packages (all can be accessed through caret): gbm, ada, and xgbLinear Can lookup the various tuning parameters

modelLookup("ada")

##   model parameter          label forReg forClass probModel
## 1   ada      iter         #Trees  FALSE     TRUE      TRUE
## 2   ada  maxdepth Max Tree Depth  FALSE     TRUE      TRUE
## 3   ada        nu  Learning Rate  FALSE     TRUE      TRUE

modelLookup("gbm")

##   model         parameter                   label forReg forClass probModel
## 1   gbm           n.trees   # Boosting Iterations   TRUE     TRUE      TRUE
## 2   gbm interaction.depth          Max Tree Depth   TRUE     TRUE      TRUE
## 3   gbm         shrinkage               Shrinkage   TRUE     TRUE      TRUE
## 4   gbm    n.minobsinnode Min. Terminal Node Size   TRUE     TRUE      TRUE

Here, we use Gradient Boosting Example tuning parameters for gbm: http://topepo.github.io/caret/training.html

Training with gradient boosting

train.gbm <- train(as.factor(HighSales) ~ ., 
                   data=Carseats.train,
                   method="gbm",
                   verbose=F,
                   trControl=cvcontrol)
train.gbm

## Stochastic Gradient Boosting 
## 
## 200 samples
##  10 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy  Kappa    
##   1                   50      0.810     0.5864970
##   1                  100      0.850     0.6781974
##   1                  150      0.840     0.6559583
##   2                   50      0.835     0.6435729
##   2                  100      0.855     0.6870897
##   2                  150      0.860     0.6973974
##   3                   50      0.845     0.6600169
##   3                  100      0.845     0.6640979
##   3                  150      0.865     0.7048597
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.

We can look at the confusion matrix for the Training data.

#obtaining class predictions
gbm.classTrain <-  predict(train.gbm, type="raw")
head(gbm.classTrain)

## [1] No  No  Yes No  No  No 
## Levels: No Yes

#computing confusion matrix
confusionMatrix(as.factor(Carseats.train$HighSales), gbm.classTrain)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  120   0
##        Yes   0  80
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9817, 1)
##     No Information Rate : 0.6        
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0        
##             Specificity : 1.0        
##          Pos Pred Value : 1.0        
##          Neg Pred Value : 1.0        
##              Prevalence : 0.6        
##          Detection Rate : 0.6        
##    Detection Prevalence : 0.6        
##       Balanced Accuracy : 1.0        
##                                      
##        'Positive' Class : No         
##

A few Errors. Model learned quite well.

More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions
gbm.classTest <-  predict(train.gbm,
                          newdata = Carseats.test,
                          type="raw")
head(gbm.classTest)

## [1] Yes No  No  No  Yes No 
## Levels: No Yes

#computing confusion matrix
confusionMatrix(as.factor(Carseats.test$HighSales), gbm.classTest)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  103  13
##        Yes  28  56
##                                           
##                Accuracy : 0.795           
##                  95% CI : (0.7323, 0.8487)
##     No Information Rate : 0.655           
##     P-Value [Acc > NIR] : 1.035e-05       
##                                           
##                   Kappa : 0.5686          
##                                           
##  Mcnemar's Test P-Value : 0.02878         
##                                           
##             Sensitivity : 0.7863          
##             Specificity : 0.8116          
##          Pos Pred Value : 0.8879          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.6550          
##          Detection Rate : 0.5150          
##    Detection Prevalence : 0.5800          
##       Balanced Accuracy : 0.7989          
##                                           
##        'Positive' Class : No              
##

Accuracy of 0.83

We can also look at the ROC curve by extracting probabilities of “Yes”.

#Obtaining predicted probabilities for Test data
gbm.probs=predict(train.gbm,
                  newdata=Carseats.test,
                  type="prob")
head(gbm.probs)

##            No         Yes
## 1 0.140471649 0.859528351
## 2 0.794429421 0.205570579
## 3 0.887294223 0.112705777
## 4 0.996868156 0.003131844
## 5 0.002862385 0.997137615
## 6 0.967026957 0.032973043

#Calculate ROC curve
rocCurve.gbm <- roc(Carseats.test$HighSales,gbm.probs[,"Yes"])

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

#plot the ROC curve
plot(rocCurve.gbm, col=c(3))

#calculate the area under curve (bigger is better)
auc(rocCurve.gbm)

## Area under the curve: 0.8837

8. Model Stacking

See …
https://machinelearningmastery.com/machine-learning-ensembles-with-r/
https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-implementation-in-r/

9. Model Comparisons

We can examine how the models do by looking at the ROC curves.

plot(rocCurve.tree, col=c(4))
plot(rocCurve.bagg, add=TRUE, col=c(6))  #color magenta is bagg
plot(rocCurve.rf, add=TRUE, col=c(1))    #color black is rf
plot(rocCurve.cf, add=TRUE, col=c(2))    #color red is cforest
plot(rocCurve.gbm, add=TRUE, col=c(3))   #color green is gbm

Tree = blue, Bagg = magenta, RF = black, CForest = red, gradient boosting = green

10. Conclusion

For this example, random forests and boosting are more stable than the other methods. Comparing the variable importance metrics to the decision tree results is a way to see how likely the tree is to generalize.

Thank you for playing!

Citations

Brownlee, J. (2016, February 7). How to Build an Ensemble Of Machine Learning Algorithms in R. MachineLearningMastery.Com. https://www.machinelearningmastery.com/machine-learning-ensembles-with-r/

Hothorn, T., Bühlmann, P., Dudoit, S., Molinaro, A., & Van Der Laan, M. J. (2006). Survival ensembles. Biostatistics, 7(3), 355–373. https://doi.org/10.1093/biostatistics/kxj011

Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651–674. https://doi.org/10.1198/106186006X133933

Hothorn, T., & Zeileis, A. (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. Journal of Machine Learning Research, 16(118), 3905–3909.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). ISLR: Data for an Introduction to Statistical Learning with Applications in R (Version 1.4). https://CRAN.R-project.org/package=ISLR

Kaushik, S. (2019, June 25). Ensemble Models in machine learning? (With code in R). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-implementation-in-r/

Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28, 1–26. https://doi.org/10.18637/jss.v028.i05

Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R News, 2(3), 18–22.

Milborrow, S. (2024). Rpart.plot: Plot “rpart” Models: An Enhanced Version of “plot.rpart” (Version 3.1.2). https://CRAN.R-project.org/package=rpart.plot

Neuwirth, E. (2022). RColorBrewer: ColorBrewer Palettes (Version 1.1-3). https://CRAN.R-project.org/package=RColorBrewer

R Core Team. (2024). R: A Language and Environment for Statistical Computing. Foundation for Statistical Computing. https://www.R-project.org/

Revelle, W. (2024). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University. https://CRAN.R-project.org/package=psych

Ridgeway, G., & GBM Developers. (2024). gbm: Generalized Boosted Regression Models (Version 2.2.2). https://CRAN.R-project.org/package=gbm

Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12(1), 77. https://doi.org/10.1186/1471-2105-12-77

Strobl, C., Boulesteix, A.-L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1), 25. https://doi.org/10.1186/1471-2105-8-25

Therneau, T., & Atkinson, B. (2025). rpart: Recursive Partitioning and Regression Trees (Version 4.1.24) . https://CRAN.R-project.org/package=rpart

Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag. https://ggplot2.tidyverse.org/

Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A Grammar of Data Manipulation (Version 1.1.4). https://CRAN.R-project.org/package=dplyr

Williams, G. (2011). Data Mining with {Rattle} and {R}: The art of excavating data for knowledge discovery. Springer. ttps://rd.springer.com/book/10.1007/978-1-4419-9890-3

Zeileis, A., Hothorn, T., & Hornik, K. (2008). Model-Based Recursive Partitioning. Journal of Computational and Graphical Statistics, 17(2), 492–514. https://doi.org/10.1198/106186008X319331