What’s in a Name?? (Part 4 Solutions)

Matthew Beckman & Justin Post June 25, 2021

Here, we join an analysis already in progress…

Part 1 (RStudio IDE Scavenger Hunt) has been omitted from the following analysis.
Part 2 (Import Data) and Part 3 (R Markdown) have been blended for the purposes of our investigation
Part 4 (Data Wrangling) Exercises Follow

Names to be investigated

We’re investigating the popularity of names in the US each year. Matt has chosen to investigate the names of each person in his immediate family: Matthew, Sarah, Eden, Jack, and Hazel. They’re his favorite people, and also his favorite names! He’s feeling torn about how to include his son Jack in the analysis. Jack’s legal name is “Jon” but he is nearly always called “Jack”–the spelling of “Jon” honors Scandinavian heritage on both sides of the family, and the nickname “Jack” specifically honors his great-grandfather.

Some famous persons by each name of the family include:

Sarah Jessica Parker
Eden Hazard
Jack (Jackie) Robinson
Hazel Findlay
Richard Stallman. Seriously, the first hit of an Internet search for “famous Matthew” was Richard Stallman (!) whose given name is… Matthew. “Matthew the Apostle” (Matt’s namesake) was third on that particular list: https://playback.fm/people/first-name/matthew

This document was last modified 2021-06-25 10:28:47.

Part 4. Data wrangling

4.1 Combine our Data Sets with `dplyr::bind_rows( )`

Task 1: use the starter code below to read the 2020 data:

BabyNames2020 <- 
  read_csv("https://jbpost2.github.io/TeachingWithR/datasets/yob2020.txt", 
           col_names = FALSE, col_types = cols(X2 = col_character()))

Task 2: the starter code from Task 4.1.1 includes a hint to correct the issue discovered when reading the BabyNamesSupp csv data file. Can you fix the issue with the sex column type?
Task 3: before we combine our three data sources, let’s align them such that all three data sources are organized to include the same variables/columns, with the same names. Namely, the variables in the 2020 data should be renamed, and it needs a new variable to reflect the year (2020) for all rows. Note: the order of the columns are not important, as long as they have identical names in each data set to be combined.
Task 4: use bind_rows() to combine BabyNames & BabyNamesSupp & the 2020 data

Solution

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dcData)

data("BabyNames", package = "dcData")

# Task 4.1.2
BabyNamesSupp <- 
  read_csv("https://jbpost2.github.io/TeachingWithR/datasets/BabyNamesSupp.csv",
           col_types = cols(sex = col_character()))  # fixes `sex`
    
# Tasks 4.1.1 & Task 4.1.3
BabyNames2020 <- 
    read_csv("https://jbpost2.github.io/TeachingWithR/datasets/yob2020.txt", 
             col_names = FALSE, col_types = cols(X2 = col_character())) %>%
    rename(name = X1, sex = X2, count = X3) %>%  # rename solution to Task 4.1.3
    mutate(year = 2020)                          # year solution to Task 4.1.3

# Task 4.1.4
BabyNamesFull <- bind_rows(BabyNames, BabyNamesSupp, BabyNames2020)

4.2 Data Wrangling and Summaries

Task 1: filter the data to include only the names you wish to investigate
Task 2: for each name, compute the total frequency across all years of available data (1880 through 2020) and then arrange the results in descending order by total.
Task 3: for each combination of name AND sex, compute the total frequency across all years of available data (1880 through 2020) and then arrange the results by name.
Task 4: filter the year you joined your current institution (or any specific, meaningful year you like) and repeat tasks 2 and 3.

Solution

# vector of names
beckmans <- c("Matthew", "Sarah", "Eden", "Jack", "Hazel")

BabyNamesFull %>%
    filter(name %in% beckmans) %>%                   # Task 4.2.1
    group_by(name) %>%                               # Task 4.2.2
    summarise(total = sum(count, na.rm = TRUE)) %>%  # Task 4.2.2
    arrange(desc(total))

## `summarise()` ungrouping output (override with `.groups` argument)

BabyNamesFull %>%
    filter(name %in% beckmans) %>%
    group_by(name, sex) %>%                          # Task 4.2.3
    summarise(total = sum(count, na.rm = TRUE)) %>%  # Task 4.2.3
    arrange(name)

## `summarise()` regrouping output by 'name' (override with `.groups` argument)

# Task 4.2.4--Matt joined Penn State in 2015
BabyNamesFull %>%
    filter(name %in% beckmans, year == 2015) %>%
    group_by(name) %>%
    summarise(total = sum(count, na.rm = TRUE)) %>% 
    arrange(desc(total))

## `summarise()` ungrouping output (override with `.groups` argument)

BabyNamesFull %>%
    filter(name %in% beckmans, year == 2015) %>%
    group_by(name, sex) %>%
    summarise(total = sum(count, na.rm = TRUE)) %>% 
    arrange(name)

## `summarise()` regrouping output by 'name' (override with `.groups` argument)

Part 5. Graph it

[coming up next…]

What’s in a Name?? (Part 4 Solutions)

Names to be investigated

Part 4. Data wrangling

4.1 Combine our Data Sets with dplyr::bind_rows( )

Solution

4.2 Data Wrangling and Summaries

Solution

Part 5. Graph it

4.1 Combine our Data Sets with `dplyr::bind_rows( )`