set.seed(123)
<- data.frame(
survey_data age = sample(18:70, 100, replace = TRUE),
income = sample(20000:100000, 100, replace = TRUE),
education = sample(c("High School", "Bachelor's", "Master's", "PhD"), 100, replace = TRUE),
previous_response = sample(c(0, 1), 100, replace = TRUE),
response = sample(c(0, 1), 100, replace = TRUE)
)
head(survey_data)
2 Introduction to Machine Learning in Social Sciences
2.1 What is Machine Learning?
Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on developing algorithms capable of learning patterns from data (Bishop & Nasrabadi, 2006). Unlike traditional statistical methods, which often rely on predefined equations, ML models adapt to data, making them effective for analyzing complex and large datasets.
2.1.1 Key Features of Machine Learning
- Prediction: ML uses data to predict outcomes or behaviors (e.g., job satisfaction or voting preferences).
- Pattern Recognition: It identifies hidden structures or trends in data (e.g., clustering survey respondents).
- Automation: ML automates data analysis, enabling quicker and more efficient insights.
2.3 Example Use Case: Predicting Job Satisfaction
Imagine using the General Social Survey (GSS) dataset to predict job satisfaction based on demographic and workplace factors.
By applying ML, researchers can uncover patterns that might not be immediately apparent with traditional methods.
2.4 Overview of Tidymodels
2.4.1 What is Tidymodels?
Tidymodels is a collection of R packages designed for machine learning workflows. It provides a unified, “tidy” interface for building and evaluating models, making it accessible and user-friendly for social scientists (Kuhn & Silge, 2022; Silge & Kuhn, 2020).
2.4.2 Core Components of Tidymodels
recipes
: Data preprocessing (e.g., normalization, encoding).parsnip
: Model specification and training.rsample
: Data splitting (e.g., training and testing sets).yardstick
: Model evaluation metrics.tune
: Hyperparameter tuning.
2.5 Reproducibility and Ethics in Machine Learning
2.5.1 Reproducibility in Research
Reproducibility ensures that analyses can be independently verified and replicated. Tidymodels promotes reproducibility by:
- Consistently documenting data preprocessing and modeling steps.
- Using version-controlled scripts.
2.5.2 Ethical Considerations
- Bias in Models: Address potential biases in training data and models.
- Data Privacy: Handle sensitive data responsibly.
- Transparency: Communicate methods and results clearly to stakeholders.
2.6 Summary
By the end of this chapter, you:
1. Understood the fundamentals of machine learning.
2. Recognized the relevance of ML in social science research.
3. Gained an overview of the tidymodels framework.
3 Case Study: Predicting Survey Response Rates
3.1 Introduction
Survey response rates are a key challenge in social science research. Predicting which individuals are likely to complete a survey can help researchers plan better strategies for engagement and resource allocation. In this case study, we will use a simple machine learning model to predict survey responses based on demographic and historical data.
3.2 Objective
This case study demonstrates:
1. Framing a problem as a machine learning task.
2. Understanding basic machine learning workflow: data preparation, model training, and evaluation.
3. Introducing the concept of classification models.
3.3 Dataset
We’ll simulate a small dataset of survey participants, including demographic features and a target variable (response
), which indicates whether they completed the survey (1
for yes, 0
for no).
3.3.1 Simulated survey dataset
3.4 Step 1: Data Preparation
Before training a machine learning model, we need to preprocess the data. This includes encoding categorical variables and splitting the data into training and testing sets.
library(tidymodels)
# Split the data
set.seed(123)
<- initial_split(survey_data, prop = 0.8)
survey_split <- training(survey_split)
train_data <- testing(survey_split)
test_data
# Preprocessing with a recipe
<- recipe(response ~ ., data = train_data) %>%
survey_recipe step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
# Prepare the recipe
<- prep(survey_recipe)
prepared_recipe <- bake(prepared_recipe, train_data)
baked_train <- bake(prepared_recipe, test_data) baked_test
3.5 Step 2: Model Training
We’ll use logistic regression as our first classification model.
# Define the logistic regression model
<- logistic_reg() %>%
log_reg_model set_engine("glm") %>%
set_mode("classification")
# Train the model
<- log_reg_model %>%
log_reg_fit fit(response ~ ., data = baked_train)
log_reg_fit
3.6 Step 3: Model Evaluation
Evaluate the model’s performance using accuracy and a confusion matrix.
# Predict on the test data
<- predict(log_reg_fit, new_data = baked_test, type = "class") %>%
test_predictions bind_cols(baked_test)
# Calculate accuracy
<- metrics(test_predictions, truth = response, estimate = .pred_class)
accuracy_metric
# Generate a confusion matrix
<- conf_mat(test_predictions, truth = response, estimate = .pred_class)
confusion_matrix
list(accuracy = accuracy_metric, confusion_matrix = confusion_matrix)