library(tidymodels)
<- rand_forest(
rf_model mtry = 3,
trees = 500,
min_n = 5
%>%
) set_engine("ranger") %>%
set_mode("regression")
<- rf_model %>%
rf_fit fit(age ~ race + marital, data = train_data)
<- predict(rf_fit, new_data = test_data)
rf_predictions
<- test_data %>%
rf_results bind_cols(rf_predictions) %>%
metrics(truth = age, estimate = .pred)
rf_results
6 Advanced Topics: Ensemble Methods
Ensemble methods are powerful techniques in machine learning that combine predictions from multiple models to improve accuracy and robustness (Bishop & Nasrabadi, 2006). In this chapter, we’ll explore three main types of ensemble methods: bagging, boosting, and stacking.
6.1 What Are Ensemble Methods?
Ensemble methods work by combining multiple models to produce a single, more reliable prediction. By leveraging the strengths of individual models, ensembles often outperform single models.
6.1.1 Benefits of Ensemble Methods
- Improved Accuracy: Reduce errors by combining predictions.
- Robustness: Less sensitive to noise and outliers.
- Versatility: Applicable to both regression and classification tasks.
6.2 Types of Ensemble Methods
6.2.1 Bagging (Bootstrap Aggregation)
Bagging trains multiple models on different subsets of data and averages their predictions. Random Forest is a popular example of bagging.
6.2.2 Boosting
Boosting trains models iteratively, with each model correcting the errors of the previous one. Examples include AdaBoost and XGBoost.
6.2.3 Stacking
Stacking combines the predictions of multiple models (base learners) using a meta-model, which learns how to best combine the base predictions.
6.3 Implementing Bagging with Random Forest
Let’s train and evaluate a random forest model as an example of bagging.
6.4 Implementing Boosting with XGBoost
Boosting iteratively improves model performance. Here’s how to implement XGBoost.
xgb_model <- boost_tree(
trees = 1000,
learn_rate = 0.01,
tree_depth = 6
) %>%
set_engine("xgboost") %>%
set_mode("regression")
xgb_fit <- xgb_model %>%
fit(age ~ race + marital, data = train_data)
xgb_predictions <- predict(xgb_fit, new_data = test_data)
xgb_results <- test_data %>%
bind_cols(xgb_predictions) %>%
metrics(truth = age, estimate = .pred)
xgb_results
6.5 Stacking Models
Stacking combines the predictions from multiple models to produce a final prediction.
<- rand_forest(mtry = 3, trees = 500) %>%
model_rf set_engine("ranger") %>%
set_mode("regression")
<- boost_tree(trees = 1000, learn_rate = 0.01) %>%
model_xgb set_engine("xgboost") %>%
set_mode("regression")
<- stacks() %>%
model_stack add_candidates(model_rf) %>%
add_candidates(model_xgb) %>%
blend_predictions() %>%
fit_members()
<- predict(model_stack, new_data = test_data)
stack_predictions <- test_data %>%
stack_results bind_cols(stack_predictions) %>%
metrics(truth = age, estimate = .pred)
stack_results
6.6 Summary
In this chapter, we:
1- Explored the three main types of ensemble methods: bagging, boosting, and stacking.
2- Implemented a random forest model as an example of bagging.
3- Trained and evaluated an XGBoost model for boosting.
4- Combined multiple models using stacking.
6.7 Case Study: Comparing Bagging and Boosting for Income Prediction
6.7.1 Introduction
Ensemble methods like bagging and boosting are powerful techniques for improving predictive accuracy. In this case study, we use Random Forest and Gradient Boosting to predict income levels. We compare the two models to understand their strengths and weaknesses.
6.7.2 Objective
This case study demonstrates:
1. Building and evaluating ensemble methods for regression tasks.
2. Comparing bagging (Random Forest) with boosting (Gradient Boosting).
3. Using cross-validation for model evaluation.
6.7.3 Dataset
We simulate a dataset with features like age, education level, and work hours.
# Simulated income dataset
set.seed(123)
<- data.frame(
income_data age = sample(20:60, 300, replace = TRUE),
education_level = sample(c("High School", "Bachelor's", "Master's", "PhD"), 300, replace = TRUE),
work_hours = rpois(300, lambda = 40),
income = rnorm(300, mean = 50000, sd = 15000)
)
# View dataset
head(income_data)
6.7.4 Step 1: Data Splitting
library(tidymodels)
# Split the data
set.seed(123)
<- initial_split(income_data, prop = 0.8)
income_split <- training(income_split)
train_data <- testing(income_split) test_data
6.7.5 Step 2: Preprocessing
# Define a recipe
<- recipe(income ~ ., data = train_data) %>%
income_recipe step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
# Prepare the recipe
<- prep(income_recipe)
prepared_recipe <- bake(prepared_recipe, new_data = NULL)
processed_train <- bake(prepared_recipe, new_data = test_data) processed_test
6.7.6 Step 3: Random Forrest (Bagging)
# Define a Random Forest model
<- rand_forest(
rf_model mtry = 3,
trees = 500,
min_n = 5
%>%
) set_engine("ranger") %>%
set_mode("regression")
# Train the model
<- rf_model %>%
rf_fit fit(income ~ ., data = processed_train)
# Predict on test data
<- predict(rf_fit, new_data = processed_test) %>%
rf_predictions bind_cols(processed_test)
# Evaluate the model
<- metrics(rf_predictions, truth = income, estimate = .pred)
rf_metrics rf_metrics
6.7.7 Step 4: Gradiant Boosting
# Define a Gradient Boosting model
<- boost_tree(
xgb_model trees = 1000,
learn_rate = 0.01,
tree_depth = 6
%>%
) set_engine("xgboost") %>%
set_mode("regression")
# Train the model
<- xgb_model %>%
xgb_fit fit(income ~ ., data = processed_train)
# Predict on test data
<- predict(xgb_fit, new_data = processed_test) %>%
xgb_predictions bind_cols(processed_test)
# Evaluate the model
<- metrics(xgb_predictions, truth = income, estimate = .pred)
xgb_metrics xgb_metrics
6.7.8 Step 5: Model Comparison
# Combine metrics for comparison
<- bind_rows(
model_comparison %>% mutate(model = "Random Forest"),
rf_metrics %>% mutate(model = "Gradient Boosting")
xgb_metrics
)
model_comparison