# Simulated educational attainment dataset
set.seed(123)
<- data.frame(
edu_data age = sample(18:60, 300, replace = TRUE),
income = sample(20000:120000, 300, replace = TRUE),
parental_education = sample(c("High School", "Bachelor's", "Master's", "PhD"), 300, replace = TRUE),
study_hours = rpois(300, lambda = 5),
education_level = sample(c("High School", "Bachelor's", "Master's", "PhD"), 300, replace = TRUE)
)
# View dataset
head(edu_data)
5 Hyperparameter Tuning and Model Optimization
Hyperparameter tuning is a crucial step in machine learning that can significantly improve model performance. In this chapter, we’ll explore what hyperparameters are, why tuning them matters, and how to fine-tune models using the tidymodels framework.
5.1 What Are Hyperparameters?
Hyperparameters are settings that control the behavior of a machine learning algorithm (Probst et al., 2019). Unlike parameters, which are learned from the data (e.g., coefficients in linear regression), hyperparameters are set before training.
Examples of Hyperparameters
- Random Forest: Number of trees, maximum depth of each tree.
- K-Nearest Neighbors (KNN): Number of neighbors (
k
). - Support Vector Machines (SVM): Kernel type, regularization parameter.
5.2 Why Tune Hyperparameters?
The choice of hyperparameters affects: - Model Complexity: Hyperparameters control how flexible the model is. - Performance: Poorly chosen hyperparameters can lead to overfitting or underfitting.
Overfitting vs. Underfitting
- Overfitting: The model performs well on the training data but poorly on unseen data.
- Underfitting: The model fails to capture patterns in the data.
5.3 Tools for Hyperparameter Tuning
The tidymodels package provides several tools for hyperparameter tuning:
tune
: Used to define tunable hyperparameters.rsample
: Creates resampling strategies like cross-validation.grid_regular()
andgrid_random()
: Generate grids for searching hyperparameter combinations.
5.4 Example: Hyperparameter Tuning for Random Forest
Let’s optimize a random forest model using tidymodels.
5.4.1 Define a Random Forest Model
rf_model <- rand_forest(
mtry = tune(),
trees = 1000,
min_n = tune()
) %>%
set_engine("ranger") %>%
set_mode("regression")
5.5 Create a Recipe
rf_recipe <- recipe(age ~ race + marital, data = train_data) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
5.6 Set Up Resampling
5.6.1 5-fold cross-validation
cv_folds <- vfold_cv(train_data, v = 5)
5.7 Define a Grid
5.7.1 Generate a grid of hyperparameters
rf_grid <- grid_regular(
mtry(range = c(1, 5)),
min_n(range = c(2, 10)),
levels = 5
)
5.8 Tune the Model
rf_tune_results <- tune_grid(
rf_model,
preprocessor = rf_recipe,
resamples = cv_folds,
grid = rf_grid,
metrics = metric_set(rmse, rsq) # Evaluate using RMSE and R-squared
)
5.9 Evaluate Results
rf_tune_results %>%
collect_metrics()
best_params <- rf_tune_results %>%
select_best("rmse")
best_params
5.10 Summary
In this chapter, we:
1- Learned about hyperparameters and their importance.
2- Explored tools from tidymodels for hyperparameter tuning.
3- Tuned a random forest model using cross-validation.
5.11 Case Study: Optimizing Random Forest for Educational Attainment Prediction
5.11.1 Introduction
Educational attainment is a key indicator in social science research. Predicting the highest level of education achieved using demographic and socioeconomic features can provide insights into societal trends. In this case study, we demonstrate how to optimize a Random Forest model using hyperparameter tuning.
5.11.2 Objective
This case study demonstrates: 1. Hyperparameter tuning for Random Forest.
2. Cross-validation for model evaluation.
3. Using tidymodels
for streamlined optimization.
5.11.3 Dataset
We’ll simulate a dataset for this case study.
5.11.4 Step 1: Data Splitting
library(tidymodels)
# Split the data
set.seed(123)
<- initial_split(edu_data, prop = 0.8)
edu_split <- training(edu_split)
train_data <- testing(edu_split) test_data
5.11.5 Step 2: Preprocessing
# Define a recipe
<- recipe(education_level ~ ., data = train_data) %>%
edu_recipe step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
# Prepare the recipe
<- prep(edu_recipe)
prepared_recipe <- bake(prepared_recipe, new_data = NULL)
processed_train <- bake(prepared_recipe, new_data = test_data) processed_test
5.11.6 Step 3: Model Specification
# Define a tunable Random Forest model
<- rand_forest(
rf_model mtry = tune(),
trees = 1000,
min_n = tune()
%>%
) set_engine("ranger") %>%
set_mode("classification")
5.11.7 Step 4: Hyperparameter Tuning
# Define resampling strategy
<- vfold_cv(train_data, v = 5)
cv_folds
# Define a grid for hyperparameter tuning
<- grid_regular(
rf_grid mtry(range = c(1, 5)),
min_n(range = c(2, 10)),
levels = 5
)
# Tune the model
<- tune_grid(
rf_tune
rf_model,preprocessor = edu_recipe,
resamples = cv_folds,
grid = rf_grid,
metrics = metric_set(accuracy)
)
# View tuning results
%>%
rf_tune collect_metrics()
5.11.8 Step 5: Evaluate the Optimized Model
# Select the best hyperparameters
<- rf_tune %>%
best_params select_best(metric = "accuracy")
# Finalize the workflow
<- finalize_workflow(
rf_final workflow() %>%
add_model(rf_model) %>%
add_recipe(edu_recipe),
best_params
)
# Fit the final model
<- rf_final %>%
rf_fit fit(data = train_data)
# Evaluate on test data
<- predict(rf_fit, new_data = test_data) %>%
rf_predictions bind_cols(test_data)
# Calculate accuracy
metrics(rf_predictions, truth = education_level, estimate = .pred_class)