Files corresponding to Short Course: Introduction to Data Science Using R
We’ll continue to work on the same .Rmd file from the previous exercise.
summarize() to create a new variable called total
that is the sum of the counts (remove NA’s with na.rm = TRUE).names <- c("Justin", "George", "Alexander", "Jacob", "Anderson")
BabyNamesFull %>%
    filter(name %in% names) %>%
    group_by(name, sex) %>%                          
    summarise(total = sum(count, na.rm = TRUE)) 
## `summarise()` has grouped output by 'name'. You can override using the `.groups` argument.
## # A tibble: 10 x 3
## # Groups:   name [5]
##    name      sex     total
##    <chr>     <chr>   <dbl>
##  1 Alexander F        4466
##  2 Alexander M      684145
##  3 Anderson  F         950
##  4 Anderson  M       28016
##  5 George    F        9942
##  6 George    M     1470178
##  7 Jacob     F        2259
##  8 Jacob     M      941483
##  9 Justin    F        3797
## 10 Justin    M      778247
BabyNamesFull data object to only include rows where
the count is more than 50000. Save this as an R object. Then,
create a contingency table to count the number of times each name
appears.temp <- BabyNamesFull %>%
  filter(count >50000)
table(temp$name)
## 
##      Ashley Christopher       David     Deborah       Debra       James 
##           1          12          25           3           1          49 
##       Jason    Jennifer     Jessica        John       Linda        Lisa 
##           6          14           3          49          10           5 
##        Mark        Mary     Matthew     Michael    Patricia     Richard 
##           6          46           1          46           4          12 
##      Robert     William 
##          51          26
summary() on the total counts.temp <- BabyNamesFull %>% 
  group_by(year) %>%
  summarize(total = sum(count))
summary(temp$total)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  192700 1728436 3092162 2535652 3677142 4199919