Matthew Beckman & Justin Post June 25, 2021
Here, we join an analysis already in progress…
We’re investigating the popularity of names in the US each year. Matt has chosen to investigate the names of each person in his immediate family: Matthew, Sarah, Eden, Jack, and Hazel. They’re his favorite people, and also his favorite names! He’s feeling torn about how to include his son Jack in the analysis. Jack’s legal name is “Jon” but he is nearly always called “Jack”–the spelling of “Jon” honors Scandinavian heritage on both sides of the family, and the nickname “Jack” specifically honors his great-grandfather.
Some famous persons by each name of the family include:
This document was last modified 2021-06-25 10:28:47.
dplyr::bind_rows( )
BabyNames2020 <-
read_csv("https://jbpost2.github.io/TeachingWithR/datasets/yob2020.txt",
col_names = FALSE, col_types = cols(X2 = col_character()))
Task 2: the starter code from Task 4.1.1 includes a hint to
correct the issue discovered when reading the BabyNamesSupp
csv
data file. Can you fix the issue with the sex
column type?
Task 3: before we combine our three data sources, let’s align them such that all three data sources are organized to include the same variables/columns, with the same names. Namely, the variables in the 2020 data should be renamed, and it needs a new variable to reflect the year (2020) for all rows. Note: the order of the columns are not important, as long as they have identical names in each data set to be combined.
Task 4: use bind_rows()
to combine BabyNames
&
BabyNamesSupp
& the 2020 data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dcData)
data("BabyNames", package = "dcData")
# Task 4.1.2
BabyNamesSupp <-
read_csv("https://jbpost2.github.io/TeachingWithR/datasets/BabyNamesSupp.csv",
col_types = cols(sex = col_character())) # fixes `sex`
# Tasks 4.1.1 & Task 4.1.3
BabyNames2020 <-
read_csv("https://jbpost2.github.io/TeachingWithR/datasets/yob2020.txt",
col_names = FALSE, col_types = cols(X2 = col_character())) %>%
rename(name = X1, sex = X2, count = X3) %>% # rename solution to Task 4.1.3
mutate(year = 2020) # year solution to Task 4.1.3
# Task 4.1.4
BabyNamesFull <- bind_rows(BabyNames, BabyNamesSupp, BabyNames2020)
Task 1: filter the data to include only the names you wish to investigate
Task 2: for each name, compute the total frequency across all years of available data (1880 through 2020) and then arrange the results in descending order by total.
Task 3: for each combination of name AND sex, compute the total frequency across all years of available data (1880 through 2020) and then arrange the results by name.
Task 4: filter the year you joined your current institution (or any specific, meaningful year you like) and repeat tasks 2 and 3.
# vector of names
beckmans <- c("Matthew", "Sarah", "Eden", "Jack", "Hazel")
BabyNamesFull %>%
filter(name %in% beckmans) %>% # Task 4.2.1
group_by(name) %>% # Task 4.2.2
summarise(total = sum(count, na.rm = TRUE)) %>% # Task 4.2.2
arrange(desc(total))
## `summarise()` ungrouping output (override with `.groups` argument)
BabyNamesFull %>%
filter(name %in% beckmans) %>%
group_by(name, sex) %>% # Task 4.2.3
summarise(total = sum(count, na.rm = TRUE)) %>% # Task 4.2.3
arrange(name)
## `summarise()` regrouping output by 'name' (override with `.groups` argument)
# Task 4.2.4--Matt joined Penn State in 2015
BabyNamesFull %>%
filter(name %in% beckmans, year == 2015) %>%
group_by(name) %>%
summarise(total = sum(count, na.rm = TRUE)) %>%
arrange(desc(total))
## `summarise()` ungrouping output (override with `.groups` argument)
BabyNamesFull %>%
filter(name %in% beckmans, year == 2015) %>%
group_by(name, sex) %>%
summarise(total = sum(count, na.rm = TRUE)) %>%
arrange(name)
## `summarise()` regrouping output by 'name' (override with `.groups` argument)
[coming up next…]