6 Advanced Topics: Ensemble Methods

Ensemble methods are powerful techniques in machine learning that combine predictions from multiple models to improve accuracy and robustness (Bishop & Nasrabadi, 2006). In this chapter, we’ll explore three main types of ensemble methods: bagging, boosting, and stacking.

6.1 What Are Ensemble Methods?

Ensemble methods work by combining multiple models to produce a single, more reliable prediction. By leveraging the strengths of individual models, ensembles often outperform single models.

6.1.1 Benefits of Ensemble Methods

Improved Accuracy: Reduce errors by combining predictions.
Robustness: Less sensitive to noise and outliers.
Versatility: Applicable to both regression and classification tasks.

6.2 Types of Ensemble Methods

6.2.1 Bagging (Bootstrap Aggregation)

Bagging trains multiple models on different subsets of data and averages their predictions. Random Forest is a popular example of bagging.

6.2.2 Boosting

Boosting trains models iteratively, with each model correcting the errors of the previous one. Examples include AdaBoost and XGBoost.

6.2.3 Stacking

Stacking combines the predictions of multiple models (base learners) using a meta-model, which learns how to best combine the base predictions.

6.3 Implementing Bagging with Random Forest

Let’s train and evaluate a random forest model as an example of bagging.

library(tidymodels)

rf_model <- rand_forest(
  mtry = 3,
  trees = 500,
  min_n = 5
) %>%
  set_engine("ranger") %>%
  set_mode("regression")  

rf_fit <- rf_model %>%
  fit(age ~ race + marital, data = train_data)

rf_predictions <- predict(rf_fit, new_data = test_data)

rf_results <- test_data %>%
  bind_cols(rf_predictions) %>%
  metrics(truth = age, estimate = .pred)

rf_results

6.4 Implementing Boosting with XGBoost

Boosting iteratively improves model performance. Here’s how to implement XGBoost.

xgb_model <- boost_tree(
    trees = 1000,
    learn_rate = 0.01,
    tree_depth = 6
) %>%
    set_engine("xgboost") %>%     set_mode("regression")

xgb_fit <- xgb_model %>%
fit(age ~ race + marital, data = train_data)

xgb_predictions <- predict(xgb_fit, new_data = test_data) xgb_results <- test_data %>% bind_cols(xgb_predictions) %>% metrics(truth = age, estimate = .pred)

xgb_results

6.5 Stacking Models

Stacking combines the predictions from multiple models to produce a final prediction.

model_rf <- rand_forest(mtry = 3, trees = 500) %>%
  set_engine("ranger") %>%  
  set_mode("regression")

model_xgb <- boost_tree(trees = 1000, learn_rate = 0.01) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

model_stack <- stacks() %>%
  add_candidates(model_rf) %>%  
  add_candidates(model_xgb) %>%
  blend_predictions() %>%
  fit_members()

stack_predictions <- predict(model_stack, new_data = test_data)
stack_results <- test_data %>%
  bind_cols(stack_predictions) %>%
  metrics(truth = age, estimate = .pred)

stack_results

6.6 Summary

In this chapter, we:

1- Explored the three main types of ensemble methods: bagging, boosting, and stacking.
2- Implemented a random forest model as an example of bagging.
3- Trained and evaluated an XGBoost model for boosting.
4- Combined multiple models using stacking.

6.7 Case Study: Comparing Bagging and Boosting for Income Prediction

6.7.1 Introduction

Ensemble methods like bagging and boosting are powerful techniques for improving predictive accuracy. In this case study, we use Random Forest and Gradient Boosting to predict income levels. We compare the two models to understand their strengths and weaknesses.

6.7.2 Objective

This case study demonstrates:
1. Building and evaluating ensemble methods for regression tasks.
2. Comparing bagging (Random Forest) with boosting (Gradient Boosting).
3. Using cross-validation for model evaluation.

6.7.3 Dataset

We simulate a dataset with features like age, education level, and work hours.

# Simulated income dataset
set.seed(123)
income_data <- data.frame(
  age = sample(20:60, 300, replace = TRUE),
  education_level = sample(c("High School", "Bachelor's", "Master's", "PhD"), 300, replace = TRUE),
  work_hours = rpois(300, lambda = 40),
  income = rnorm(300, mean = 50000, sd = 15000)
)

# View dataset
head(income_data)

6.7.4 Step 1: Data Splitting

library(tidymodels)

# Split the data
set.seed(123)
income_split <- initial_split(income_data, prop = 0.8)
train_data <- training(income_split)
test_data <- testing(income_split)

6.7.5 Step 2: Preprocessing

# Define a recipe
income_recipe <- recipe(income ~ ., data = train_data) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

# Prepare the recipe
prepared_recipe <- prep(income_recipe)
processed_train <- bake(prepared_recipe, new_data = NULL)
processed_test <- bake(prepared_recipe, new_data = test_data)

6.7.6 Step 3: Random Forrest (Bagging)

# Define a Random Forest model
rf_model <- rand_forest(
  mtry = 3,
  trees = 500,
  min_n = 5
) %>%
  set_engine("ranger") %>%
  set_mode("regression")

# Train the model
rf_fit <- rf_model %>%
  fit(income ~ ., data = processed_train)

# Predict on test data
rf_predictions <- predict(rf_fit, new_data = processed_test) %>%
  bind_cols(processed_test)

# Evaluate the model
rf_metrics <- metrics(rf_predictions, truth = income, estimate = .pred)
rf_metrics

6.7.7 Step 4: Gradiant Boosting

# Define a Gradient Boosting model
xgb_model <- boost_tree(
  trees = 1000,
  learn_rate = 0.01,
  tree_depth = 6
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

# Train the model
xgb_fit <- xgb_model %>%
  fit(income ~ ., data = processed_train)

# Predict on test data
xgb_predictions <- predict(xgb_fit, new_data = processed_test) %>%
  bind_cols(processed_test)

# Evaluate the model
xgb_metrics <- metrics(xgb_predictions, truth = income, estimate = .pred)
xgb_metrics

6.7.8 Step 5: Model Comparison

# Combine metrics for comparison
model_comparison <- bind_rows(
  rf_metrics %>% mutate(model = "Random Forest"),
  xgb_metrics %>% mutate(model = "Gradient Boosting")
)

model_comparison