set.seed(123)
<- initial_split(gss_cat, prop = 0.8)
split <- training(split)
train_data <- testing(split) test_data
Building Your First Machine Learning Model
Machine learning follows a structured process to ensure reliable and reproducible results. In this chapter, we’ll walk through building your first model using tidymodels and the General Social Survey (GSS) dataset.
Introduction to the Machine Learning Workflow
A typical machine learning workflow consists of the following steps (Geron, 2019):
1. Data Splitting: Divide the data into training and testing sets.
2. Preprocessing: Prepare the data for modeling (e.g., scaling, encoding).
3. Model Specification and Training: Define the model and train it on the training data.
4. Model Evaluation: Assess the model’s performance on the testing data.
We’ll apply this process step by step using tidymodels.
Step 1: Data Splitting
Before building a model, it’s important to split the data into training and testing sets. This ensures that the model is evaluated on unseen data.
Step 2: Preprocessing Data with recipes
Data preprocessing ensures that your data is clean and suitable for modeling. We’ll use a recipe to specify transformations for the data.
<- recipe(age ~ race + marital, data = train_data) %>%
gss_recipe step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
<- prep(gss_recipe) prepared_recipe
Step 3: Model Specification and Training
Use parsnip
to specify a linear regression model and train it on the training data.
<- linear_reg() %>%
lin_reg set_engine("lm")
<- lin_reg %>%
lin_reg_model fit(age ~ race + marital, data = juice(prepared_recipe))
lin_reg_model
Step 4: Model Evaluation
Evaluate the model’s performance using yardstick
. Let’s calculate the Root Mean Squared Error (RMSE) on the testing dataset.
<- lin_reg_model %>%
test_predictions predict(new_data = bake(prepared_recipe, test_data))
<- test_data %>%
results bind_cols(test_predictions) %>%
rename(predicted_age = .pred)
rmse(results, truth = age, estimate = predicted_age)
Summary
In this chapter, we:
1- Split the dataset into training and testing sets.
2- Preprocessed the data using recipes
.
3- Specified and trained a linear regression model with parsnip
.
4- Evaluated the model’s performance using yardstick
.
Case Study: Predicting Voting Behavior
Introduction
Voting behavior is a critical area of study in social science research. Predicting whether individuals will vote based on demographic factors helps policymakers design better outreach programs. In this case study, we will build a logistic regression model to predict voting behavior using the tidymodels
framework.
Objective
This case study demonstrates: 1. Splitting data into training and testing sets. 2. Preprocessing data for modeling. 3. Building and evaluating a logistic regression model.
Dataset
We’ll simulate a small dataset of individuals, including demographic features and their voting behavior.
# Simulated voting dataset
set.seed(123)
<- data.frame(
voting_data age = sample(18:80, 200, replace = TRUE),
income = sample(20000:120000, 200, replace = TRUE),
education = sample(c("High School", "Bachelor's", "Master's", "PhD"), 200, replace = TRUE),
voted = sample(c(0, 1), 200, replace = TRUE)
)
# View the dataset
head(voting_data)
Step 1: Data Spliting
library(tidymodels)
# Split the data
set.seed(123)
<- initial_split(voting_data, prop = 0.8)
voting_split <- training(voting_split)
train_data <- testing(voting_split) test_data
Step 2: Data Preprocessing
Preprocess the data using recipes to handle categorical variables and normalize numeric features.
# Define a recipe
<- recipe(voted ~ ., data = train_data) %>%
voting_recipe step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
# Prepare the recipe
<- prep(voting_recipe)
prepared_recipe <- bake(prepared_recipe, new_data = NULL)
processed_train <- bake(prepared_recipe, new_data = test_data) processed_test
Step 3: Model Training
Train a logistic regression model using the processed training data.
# Define the logistic regression model
<- logistic_reg() %>%
log_reg_model set_engine("glm") %>%
set_mode("classification")
# Train the model
<- log_reg_model %>%
log_reg_fit fit(voted ~ ., data = processed_train)
log_reg_fit
Step 4: Model Evaluation
# Predict on the test data
<- predict(log_reg_fit, new_data = processed_test, type = "class") %>%
test_predictions bind_cols(processed_test)
# Calculate accuracy
<- metrics(test_predictions, truth = voted, estimate = .pred_class)
accuracy_metric
# Generate a confusion matrix
<- conf_mat(test_predictions, truth = voted, estimate = .pred_class)
confusion_matrix
list(accuracy = accuracy_metric, confusion_matrix = confusion_matrix)