3  Getting Started with R and Tidymodels

3.1 Setting Up Your Environment

To begin, make sure you have R and RStudio installed. RStudio is an integrated development environment (IDE) that makes working with R more efficient.

3.1.1 Installing R and RStudio

  1. Download R: Visit the R Project website and download the latest version of R for your operating system.
  2. Install RStudio: Visit the RStudio website and download the free desktop version.

3.1.2 Installing Tidymodels

Tidymodels is a collection of R packages for machine learning workflows. To install Tidymodels, run the following commands in your R console:

install.packages("tidymodels")
library(tidymodels)

Verify that tidymodels is loaded by running:

tidymodels_conflicts()

This command checks for any package conflicts and helps you ensure tidymodels is functioning correctly.

3.2 The General Social Survey (GSS) Dataset

The General Social Survey (GSS) is a rich dataset widely used in social science research. It contains responses on various demographic, behavioral, and attitudinal topics (Smith et al., 2012).

3.2.1 Loading the GSS Dataset

We’ll use the gss_cat dataset from the gss package, which is preloaded with R. To install and load the package:

install.packages("gss")
library(gss)
data(gss_cat)

Preview the first few rows of the dataset:

head(gss_cat)

Key variables in the dataset include:

age: Age of the respondent.
race: Race of the respondent.
marital: Marital status of the respondent.

3.3 Data Preparation with dplyr

3.3.1 Why Data Preparation is Important

Data preparation is a critical step in machine learning. It involves cleaning and transforming data to ensure it is ready for analysis.

3.3.2 Common Tasks with dplyr

3.3.2.1 Filtering Rows

To filter respondents aged 25-50:

filtered_data <- gss_cat %>%
    filter(age >= 25, age <= 50)

3.3.2.2 Selecting Variables

To select specific columns, such as age, race, and marital:

selected_data <- filtered_data %>%
    select(age, race, marital)

3.3.2.3 Mutating Columns

Add new column to categorize respondents as “young” or “middle-aged”:

mutated_data <- filtered_data %>%
    mutate(age_group = if_else(age < 35, "Young", "Middle-aged"))

3.3.3 Combining dplyr Functions

You can chain these operations together using the pip (%>%) operator. For example:

prepared_data <- gss_cat %>%     filter(age >= 25, age <= 50) %>%
    select(age, race, marital) %>%
    mutate(age_group = if_else(age < 35, "Young", "Middle-aged"))

This creates a cleaned and prepared dataset that is ready for analysis.

3.3.4 Previewing the Prepared Data

To inspect the resulting dataset, use:

head(prepared_data)

3.3.5 Summary of Key dplyr Functions

Here is quick reference to the main functions used:

  • filter(): Select rows based on conditions.
  • select(): Choose specific columns.
  • mutate(): Add or modify columns.
  • %>%: Pipe operator to chain operations.

By mastering these functions, you can efficiently clean and prepare data for machine learning workflows.


3.4 Case Study: Cleaning and Preparing Survey Data

3.4.1 Introduction

In social science research, survey data often require preprocessing before analysis. This includes handling missing values, encoding categorical variables, and scaling numeric features. In this case study, we demonstrate how to use the tidyverse and tidymodels to prepare survey data for machine learning.


3.4.2 Objective

This case study demonstrates:
1. Cleaning and transforming data using tidyverse.
2. Preprocessing data with recipes from tidymodels.
3. Preparing a dataset for machine learning tasks.


3.4.3 Dataset

Here’s the simulated survey dataset for this case study:

# Simulated survey dataset
set.seed(123)
survey_data <- data.frame(
  age = sample(18:70, 100, replace = TRUE),
  income = sample(20000:100000, 100, replace = TRUE),
  education = sample(c("High School", "Bachelor's", "Master's", "PhD"), 100, replace = TRUE),
  marital_status = sample(c("Single", "Married", "Divorced", "Widowed"), 100, replace = TRUE),
  response = sample(c(0, 1), 100, replace = TRUE)
)

# Introduce missing values
survey_data$income[sample(1:100, 10)] <- NA
survey_data$age[sample(1:100, 5)] <- NA

head(survey_data)

3.4.4 Step 1: Handle Missing Values

We’ll use the tidyverse to explore and clean the data by imputing missing values.

library(tidyverse)

# Check for missing values
survey_data %>%
  summarise(across(everything(), ~sum(is.na(.))))

# Impute missing values
survey_data <- survey_data %>%
  mutate(
    age = ifelse(is.na(age), mean(age, na.rm = TRUE), age),
    income = ifelse(is.na(income), median(income, na.rm = TRUE), income)
  )

# Verify missing values are handled
survey_data %>%
  summarise(across(everything(), ~sum(is.na(.))))

3.4.5 Step 2: Encode Categorical Variables

Convert categorical variables into dummy variables using the recipes package.

library(tidymodels)

# Define a recipe for data preprocessing
survey_recipe <- recipe(response ~ ., data = survey_data) %>%
  step_dummy(all_nominal_predictors())

# Prepare the recipe
prepared_recipe <- prep(survey_recipe)

# Apply the recipe
processed_data <- bake(prepared_recipe, new_data = NULL)

# View processed data
head(processed_data)

3.4.6 Step 3: Normalize Numeric Features

Normalize numeric features to scale them for machine learning algorithms.

# Add normalization step to the recipe
survey_recipe <- survey_recipe %>%
  step_normalize(all_numeric_predictors())

# Prepare and apply the updated recipe
prepared_recipe <- prep(survey_recipe)
processed_data <- bake(prepared_recipe, new_data = NULL)

# View processed data
head(processed_data)