Building Your First Machine Learning Model

Machine learning follows a structured process to ensure reliable and reproducible results. In this chapter, we’ll walk through building your first model using tidymodels and the General Social Survey (GSS) dataset.

Introduction to the Machine Learning Workflow

A typical machine learning workflow consists of the following steps (Geron, 2019):
1. Data Splitting: Divide the data into training and testing sets.
2. Preprocessing: Prepare the data for modeling (e.g., scaling, encoding).
3. Model Specification and Training: Define the model and train it on the training data.
4. Model Evaluation: Assess the model’s performance on the testing data.

We’ll apply this process step by step using tidymodels.

Step 1: Data Splitting

Before building a model, it’s important to split the data into training and testing sets. This ensures that the model is evaluated on unseen data.

set.seed(123)  
split <- initial_split(gss_cat, prop = 0.8)
train_data <- training(split)
test_data <- testing(split)

Step 2: Preprocessing Data with recipes

Data preprocessing ensures that your data is clean and suitable for modeling. We’ll use a recipe to specify transformations for the data.

gss_recipe <- recipe(age ~ race + marital, data = train_data) %>%  
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())

prepared_recipe <- prep(gss_recipe)

Step 3: Model Specification and Training

Use parsnip to specify a linear regression model and train it on the training data.

lin_reg <- linear_reg() %>%
set_engine("lm")

lin_reg_model <- lin_reg %>%
fit(age ~ race + marital, data = juice(prepared_recipe))

lin_reg_model

Step 4: Model Evaluation

Evaluate the model’s performance using yardstick. Let’s calculate the Root Mean Squared Error (RMSE) on the testing dataset.

test_predictions <- lin_reg_model %>%
predict(new_data = bake(prepared_recipe, test_data))

results <- test_data %>%  
bind_cols(test_predictions) %>%  
rename(predicted_age = .pred)

rmse(results, truth = age, estimate = predicted_age)

Summary

In this chapter, we:

1- Split the dataset into training and testing sets.
2- Preprocessed the data using recipes.
3- Specified and trained a linear regression model with parsnip.
4- Evaluated the model’s performance using yardstick.

Case Study: Predicting Voting Behavior

Introduction

Voting behavior is a critical area of study in social science research. Predicting whether individuals will vote based on demographic factors helps policymakers design better outreach programs. In this case study, we will build a logistic regression model to predict voting behavior using the tidymodels framework.


Objective

This case study demonstrates: 1. Splitting data into training and testing sets. 2. Preprocessing data for modeling. 3. Building and evaluating a logistic regression model.


Dataset

We’ll simulate a small dataset of individuals, including demographic features and their voting behavior.

# Simulated voting dataset
set.seed(123)
voting_data <- data.frame(
  age = sample(18:80, 200, replace = TRUE),
  income = sample(20000:120000, 200, replace = TRUE),
  education = sample(c("High School", "Bachelor's", "Master's", "PhD"), 200, replace = TRUE),
  voted = sample(c(0, 1), 200, replace = TRUE)
)

# View the dataset
head(voting_data)

Step 1: Data Spliting

library(tidymodels)

# Split the data
set.seed(123)
voting_split <- initial_split(voting_data, prop = 0.8)
train_data <- training(voting_split)
test_data <- testing(voting_split)

Step 2: Data Preprocessing

Preprocess the data using recipes to handle categorical variables and normalize numeric features.

# Define a recipe
voting_recipe <- recipe(voted ~ ., data = train_data) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

# Prepare the recipe
prepared_recipe <- prep(voting_recipe)
processed_train <- bake(prepared_recipe, new_data = NULL)
processed_test <- bake(prepared_recipe, new_data = test_data)

Step 3: Model Training

Train a logistic regression model using the processed training data.

# Define the logistic regression model
log_reg_model <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

# Train the model
log_reg_fit <- log_reg_model %>%
  fit(voted ~ ., data = processed_train)

log_reg_fit

Step 4: Model Evaluation

# Predict on the test data
test_predictions <- predict(log_reg_fit, new_data = processed_test, type = "class") %>%
  bind_cols(processed_test)

# Calculate accuracy
accuracy_metric <- metrics(test_predictions, truth = voted, estimate = .pred_class)

# Generate a confusion matrix
confusion_matrix <- conf_mat(test_predictions, truth = voted, estimate = .pred_class)

list(accuracy = accuracy_metric, confusion_matrix = confusion_matrix)