2 Introduction to Machine Learning in Social Sciences

2.1 What is Machine Learning?

Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on developing algorithms capable of learning patterns from data (Bishop & Nasrabadi, 2006). Unlike traditional statistical methods, which often rely on predefined equations, ML models adapt to data, making them effective for analyzing complex and large datasets.

2.1.1 Key Features of Machine Learning

Prediction: ML uses data to predict outcomes or behaviors (e.g., job satisfaction or voting preferences).
Pattern Recognition: It identifies hidden structures or trends in data (e.g., clustering survey respondents).
Automation: ML automates data analysis, enabling quicker and more efficient insights.

2.3 Example Use Case: Predicting Job Satisfaction

Imagine using the General Social Survey (GSS) dataset to predict job satisfaction based on demographic and workplace factors.
By applying ML, researchers can uncover patterns that might not be immediately apparent with traditional methods.

2.4 Overview of Tidymodels

2.4.1 What is Tidymodels?

Tidymodels is a collection of R packages designed for machine learning workflows. It provides a unified, “tidy” interface for building and evaluating models, making it accessible and user-friendly for social scientists (Kuhn & Silge, 2022; Silge & Kuhn, 2020).

2.4.2 Core Components of Tidymodels

recipes: Data preprocessing (e.g., normalization, encoding).
parsnip: Model specification and training.
rsample: Data splitting (e.g., training and testing sets).
yardstick: Model evaluation metrics.
tune: Hyperparameter tuning.

2.5 Reproducibility and Ethics in Machine Learning

2.5.1 Reproducibility in Research

Reproducibility ensures that analyses can be independently verified and replicated. Tidymodels promotes reproducibility by:
- Consistently documenting data preprocessing and modeling steps.
- Using version-controlled scripts.

2.5.2 Ethical Considerations

Bias in Models: Address potential biases in training data and models.
Data Privacy: Handle sensitive data responsibly.
Transparency: Communicate methods and results clearly to stakeholders.

2.6 Summary

By the end of this chapter, you:
1. Understood the fundamentals of machine learning.
2. Recognized the relevance of ML in social science research.
3. Gained an overview of the tidymodels framework.

3 Case Study: Predicting Survey Response Rates

3.1 Introduction

Survey response rates are a key challenge in social science research. Predicting which individuals are likely to complete a survey can help researchers plan better strategies for engagement and resource allocation. In this case study, we will use a simple machine learning model to predict survey responses based on demographic and historical data.

3.2 Objective

This case study demonstrates:
1. Framing a problem as a machine learning task.
2. Understanding basic machine learning workflow: data preparation, model training, and evaluation.
3. Introducing the concept of classification models.

3.3 Dataset

We’ll simulate a small dataset of survey participants, including demographic features and a target variable (response), which indicates whether they completed the survey (1 for yes, 0 for no).

3.3.1 Simulated survey dataset

set.seed(123)
survey_data <- data.frame(
  age = sample(18:70, 100, replace = TRUE),  
  income = sample(20000:100000, 100, replace = TRUE),
  education = sample(c("High School", "Bachelor's", "Master's", "PhD"), 100, replace = TRUE),
  previous_response = sample(c(0, 1), 100, replace = TRUE),
  response = sample(c(0, 1), 100, replace = TRUE)
)

head(survey_data)

3.4 Step 1: Data Preparation

Before training a machine learning model, we need to preprocess the data. This includes encoding categorical variables and splitting the data into training and testing sets.

library(tidymodels)

# Split the data
set.seed(123)
survey_split <- initial_split(survey_data, prop = 0.8)
train_data <- training(survey_split)
test_data <- testing(survey_split)

# Preprocessing with a recipe
survey_recipe <- recipe(response ~ ., data = train_data) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

# Prepare the recipe
prepared_recipe <- prep(survey_recipe)
baked_train <- bake(prepared_recipe, train_data)
baked_test <- bake(prepared_recipe, test_data)

3.5 Step 2: Model Training

We’ll use logistic regression as our first classification model.

# Define the logistic regression model
log_reg_model <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

# Train the model
log_reg_fit <- log_reg_model %>%
  fit(response ~ ., data = baked_train)

log_reg_fit

3.6 Step 3: Model Evaluation

Evaluate the model’s performance using accuracy and a confusion matrix.

# Predict on the test data
test_predictions <- predict(log_reg_fit, new_data = baked_test, type = "class") %>%
  bind_cols(baked_test)

# Calculate accuracy
accuracy_metric <- metrics(test_predictions, truth = response, estimate = .pred_class)

# Generate a confusion matrix
confusion_matrix <- conf_mat(test_predictions, truth = response, estimate = .pred_class)

list(accuracy = accuracy_metric, confusion_matrix = confusion_matrix)