5 Hyperparameter Tuning and Model Optimization

Hyperparameter tuning is a crucial step in machine learning that can significantly improve model performance. In this chapter, we’ll explore what hyperparameters are, why tuning them matters, and how to fine-tune models using the tidymodels framework.

5.1 What Are Hyperparameters?

Hyperparameters are settings that control the behavior of a machine learning algorithm (Probst et al., 2019). Unlike parameters, which are learned from the data (e.g., coefficients in linear regression), hyperparameters are set before training.

Examples of Hyperparameters

Random Forest: Number of trees, maximum depth of each tree.
K-Nearest Neighbors (KNN): Number of neighbors (k).
Support Vector Machines (SVM): Kernel type, regularization parameter.

5.2 Why Tune Hyperparameters?

The choice of hyperparameters affects: - Model Complexity: Hyperparameters control how flexible the model is. - Performance: Poorly chosen hyperparameters can lead to overfitting or underfitting.

Overfitting vs. Underfitting

Overfitting: The model performs well on the training data but poorly on unseen data.
Underfitting: The model fails to capture patterns in the data.

5.3 Tools for Hyperparameter Tuning

The tidymodels package provides several tools for hyperparameter tuning:

tune: Used to define tunable hyperparameters.
rsample: Creates resampling strategies like cross-validation.
grid_regular() and grid_random(): Generate grids for searching hyperparameter combinations.

5.4 Example: Hyperparameter Tuning for Random Forest

Let’s optimize a random forest model using tidymodels.

5.4.1 Define a Random Forest Model

rf_model <- rand_forest(     mtry = tune(),
    trees = 1000,
    min_n = tune()
) %>%
    set_engine("ranger") %>%     set_mode("regression")

5.5 Create a Recipe

rf_recipe <- recipe(age ~ race + marital, data = train_data) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())

5.6 Set Up Resampling

5.6.1 5-fold cross-validation

cv_folds <- vfold_cv(train_data, v = 5)

5.7 Define a Grid

5.7.1 Generate a grid of hyperparameters

rf_grid <- grid_regular(
    mtry(range = c(1, 5)),
    min_n(range = c(2, 10)),
    levels = 5
)

5.8 Tune the Model

rf_tune_results <- tune_grid(
    rf_model,
    preprocessor = rf_recipe,
    resamples = cv_folds,
    grid = rf_grid,
    metrics = metric_set(rmse, rsq) # Evaluate using RMSE and R-squared
)

5.9 Evaluate Results

rf_tune_results %>%
collect_metrics()

best_params <- rf_tune_results %>%
select_best("rmse")

best_params

5.10 Summary

In this chapter, we:

1- Learned about hyperparameters and their importance.
2- Explored tools from tidymodels for hyperparameter tuning.
3- Tuned a random forest model using cross-validation.

5.11 Case Study: Optimizing Random Forest for Educational Attainment Prediction

5.11.1 Introduction

Educational attainment is a key indicator in social science research. Predicting the highest level of education achieved using demographic and socioeconomic features can provide insights into societal trends. In this case study, we demonstrate how to optimize a Random Forest model using hyperparameter tuning.

5.11.2 Objective

This case study demonstrates: 1. Hyperparameter tuning for Random Forest.
2. Cross-validation for model evaluation.
3. Using tidymodels for streamlined optimization.

5.11.3 Dataset

We’ll simulate a dataset for this case study.

# Simulated educational attainment dataset
set.seed(123)
edu_data <- data.frame(
  age = sample(18:60, 300, replace = TRUE),
  income = sample(20000:120000, 300, replace = TRUE),
  parental_education = sample(c("High School", "Bachelor's", "Master's", "PhD"), 300, replace = TRUE),
  study_hours = rpois(300, lambda = 5),
  education_level = sample(c("High School", "Bachelor's", "Master's", "PhD"), 300, replace = TRUE)
)

# View dataset
head(edu_data)

5.11.4 Step 1: Data Splitting

library(tidymodels)

# Split the data
set.seed(123)
edu_split <- initial_split(edu_data, prop = 0.8)
train_data <- training(edu_split)
test_data <- testing(edu_split)

5.11.5 Step 2: Preprocessing

# Define a recipe
edu_recipe <- recipe(education_level ~ ., data = train_data) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

# Prepare the recipe
prepared_recipe <- prep(edu_recipe)
processed_train <- bake(prepared_recipe, new_data = NULL)
processed_test <- bake(prepared_recipe, new_data = test_data)

5.11.6 Step 3: Model Specification

# Define a tunable Random Forest model
rf_model <- rand_forest(
  mtry = tune(),
  trees = 1000,
  min_n = tune()
) %>%
  set_engine("ranger") %>%
  set_mode("classification")

5.11.7 Step 4: Hyperparameter Tuning

# Define resampling strategy
cv_folds <- vfold_cv(train_data, v = 5)

# Define a grid for hyperparameter tuning
rf_grid <- grid_regular(
  mtry(range = c(1, 5)),
  min_n(range = c(2, 10)),
  levels = 5
)

# Tune the model
rf_tune <- tune_grid(
  rf_model,
  preprocessor = edu_recipe,
  resamples = cv_folds,
  grid = rf_grid,
  metrics = metric_set(accuracy)
)

# View tuning results
rf_tune %>%
  collect_metrics()

5.11.8 Step 5: Evaluate the Optimized Model

# Select the best hyperparameters
best_params <- rf_tune %>%
  select_best(metric = "accuracy")

# Finalize the workflow
rf_final <- finalize_workflow(
  workflow() %>%
    add_model(rf_model) %>%
    add_recipe(edu_recipe),
  best_params
)

# Fit the final model
rf_fit <- rf_final %>%
  fit(data = train_data)

# Evaluate on test data
rf_predictions <- predict(rf_fit, new_data = test_data) %>%
  bind_cols(test_data)

# Calculate accuracy
metrics(rf_predictions, truth = education_level, estimate = .pred_class)