# Simulated survey dataset
set.seed(123)
<- data.frame(
survey_data age = sample(18:70, 100, replace = TRUE),
income = sample(20000:100000, 100, replace = TRUE),
education = sample(c("High School", "Bachelor's", "Master's", "PhD"), 100, replace = TRUE),
marital_status = sample(c("Single", "Married", "Divorced", "Widowed"), 100, replace = TRUE),
response = sample(c(0, 1), 100, replace = TRUE)
)
# Introduce missing values
$income[sample(1:100, 10)] <- NA
survey_data$age[sample(1:100, 5)] <- NA
survey_data
head(survey_data)
3 Getting Started with R and Tidymodels
3.1 Setting Up Your Environment
To begin, make sure you have R and RStudio installed. RStudio is an integrated development environment (IDE) that makes working with R more efficient.
3.1.1 Installing R and RStudio
- Download R: Visit the R Project website and download the latest version of R for your operating system.
- Install RStudio: Visit the RStudio website and download the free desktop version.
3.1.2 Installing Tidymodels
Tidymodels is a collection of R packages for machine learning workflows. To install Tidymodels, run the following commands in your R console:
install.packages("tidymodels")
library(tidymodels)
Verify that tidymodels is loaded by running:
tidymodels_conflicts()
This command checks for any package conflicts and helps you ensure tidymodels is functioning correctly.
3.3 Data Preparation with dplyr
3.3.1 Why Data Preparation is Important
Data preparation is a critical step in machine learning. It involves cleaning and transforming data to ensure it is ready for analysis.
3.3.2 Common Tasks with dplyr
3.3.2.1 Filtering Rows
To filter respondents aged 25-50:
filtered_data <- gss_cat %>%
filter(age >= 25, age <= 50)
3.3.2.2 Selecting Variables
To select specific columns, such as age
, race
, and marital
:
selected_data <- filtered_data %>%
select(age, race, marital)
3.3.2.3 Mutating Columns
Add new column to categorize respondents as “young” or “middle-aged”:
mutated_data <- filtered_data %>%
mutate(age_group = if_else(age < 35, "Young", "Middle-aged"))
3.3.3 Combining dplyr
Functions
You can chain these operations together using the pip (%>%) operator. For example:
prepared_data <- gss_cat %>%
filter(age >= 25, age <= 50) %>%
select(age, race, marital) %>%
mutate(age_group = if_else(age < 35, "Young", "Middle-aged"))
This creates a cleaned and prepared dataset that is ready for analysis.
3.3.4 Previewing the Prepared Data
To inspect the resulting dataset, use:
head(prepared_data)
3.3.5 Summary of Key dplyr
Functions
Here is quick reference to the main functions used:
filter()
: Select rows based on conditions.
select()
: Choose specific columns.
mutate()
: Add or modify columns.
%>%
: Pipe operator to chain operations.
By mastering these functions, you can efficiently clean and prepare data for machine learning workflows.
3.4 Case Study: Cleaning and Preparing Survey Data
3.4.1 Introduction
In social science research, survey data often require preprocessing before analysis. This includes handling missing values, encoding categorical variables, and scaling numeric features. In this case study, we demonstrate how to use the tidyverse
and tidymodels
to prepare survey data for machine learning.
3.4.2 Objective
This case study demonstrates:
1. Cleaning and transforming data using tidyverse
.
2. Preprocessing data with recipes
from tidymodels
.
3. Preparing a dataset for machine learning tasks.
3.4.3 Dataset
Here’s the simulated survey dataset for this case study:
3.4.4 Step 1: Handle Missing Values
We’ll use the tidyverse to explore and clean the data by imputing missing values.
library(tidyverse)
# Check for missing values
%>%
survey_data summarise(across(everything(), ~sum(is.na(.))))
# Impute missing values
<- survey_data %>%
survey_data mutate(
age = ifelse(is.na(age), mean(age, na.rm = TRUE), age),
income = ifelse(is.na(income), median(income, na.rm = TRUE), income)
)
# Verify missing values are handled
%>%
survey_data summarise(across(everything(), ~sum(is.na(.))))
3.4.5 Step 2: Encode Categorical Variables
Convert categorical variables into dummy variables using the recipes package.
library(tidymodels)
# Define a recipe for data preprocessing
<- recipe(response ~ ., data = survey_data) %>%
survey_recipe step_dummy(all_nominal_predictors())
# Prepare the recipe
<- prep(survey_recipe)
prepared_recipe
# Apply the recipe
<- bake(prepared_recipe, new_data = NULL)
processed_data
# View processed data
head(processed_data)
3.4.6 Step 3: Normalize Numeric Features
Normalize numeric features to scale them for machine learning algorithms.
# Add normalization step to the recipe
<- survey_recipe %>%
survey_recipe step_normalize(all_numeric_predictors())
# Prepare and apply the updated recipe
<- prep(survey_recipe)
prepared_recipe <- bake(prepared_recipe, new_data = NULL)
processed_data
# View processed data
head(processed_data)