Matthew Beckman & Justin Post June 25, 2021
dcData
and tidyverse
packages
dcData
is installed from GitHub, so it requires an extra
step. You may have already done this from the instructions prior
to the workshop, but it is shown again here if needed:devtools::install_github("mdbeckman/dcData")
## from instructions before workshop, if needed
# devtools::install_github("mdbeckman/dcData")
library(tidyverse) # this actually loads a group of packages all at once
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dcData)
BabyNames
BabyNames
data from dcData
into your R
environment using the data()
function
BabyNames
object in the Environment paneUse the spreadsheet view to answer the following:
Task 2: how many total rows of data are included?
Task 3: describe what each row in the data actually represents.
Task 4: find the largest count
in the BabyNames
data. How
would you interpret this result?
Task 5: what are the max & min available year
in the data?
Task 2.2.1:
data("BabyNames", package = "dcData")
Task 2.2.2: about 1.8 million rows. We can see this from environment tab, spreadsheet view (footer), among other places
Task 2.2.3: The frequency of a given name, within a year, associated
with a sex
Task 2.2.4: According to US Social Security records, 1947 was a banner year to name a female Linda!
Task 2.2.5: BabyNames
data spans 1880 through 2013
BabyNamesSupp
The file “BabyNameSupp.csv” includes a few years of more recent data to
augment the BabyNames
data. Run the starter code shown below to read
the data and complete the tasks.
Important: The starter code will produce a warning message! Don’t worry, it’s part of the exercise!
# starter code for BabyNamesSupp
BabyNamesSupp <- read_csv("https://jbpost2.github.io/TeachingWithR/datasets/BabyNamesSupp.csv")
## Parsed with column specification:
## cols(
## name = col_character(),
## sex = col_logical(),
## count = col_double(),
## year = col_double()
## )
## Warning: 84619 parsing failures.
## row col expected actual file
## 19208 sex 1/0/T/F/TRUE/FALSE M 'https://jbpost2.github.io/TeachingWithR/datasets/BabyNamesSupp.csv'
## 19209 sex 1/0/T/F/TRUE/FALSE M 'https://jbpost2.github.io/TeachingWithR/datasets/BabyNamesSupp.csv'
## 19210 sex 1/0/T/F/TRUE/FALSE M 'https://jbpost2.github.io/TeachingWithR/datasets/BabyNamesSupp.csv'
## 19211 sex 1/0/T/F/TRUE/FALSE M 'https://jbpost2.github.io/TeachingWithR/datasets/BabyNamesSupp.csv'
## 19212 sex 1/0/T/F/TRUE/FALSE M 'https://jbpost2.github.io/TeachingWithR/datasets/BabyNamesSupp.csv'
## ..... ... .................. ...... ....................................................................
## See problems(...) for more details.
Task 1: Read the warning message carefully; what seems to have gone wrong?
Task 2: open the spreadsheet view to investigate
BabyNamesSupp
…
BabyNamesSupp
?Task 3: use the following R functions to investigate
BabyNamesSupp
further:
head(BabyNamesSupp)
tail(BabyNamesSupp)
str(BabyNamesSupp)
Task 4 (Challenge): Why did read_csv( )
seem to have this
problem with the data intake? Any ideas how we might fix it?
At this point, we aren’t attempting to prepare the BabyNamesSupp
data
for analysis. We’re just reading it into the R environment and making
observations. We’ll be using these data again in later exercises, so we
will make the necessary corrections at that point.
Task 2.3.1: The warning indicates a “parsing failure” and shows that R was expecting logical elements (e.g., TRUE/FALSE) and found “M” in many cases
Task 2.3.2: Yup, R interpreted “F” (meaning female in our data) to mean FALSE and then could not handle “M” appropriately as a result. These data include years 2014 through 2019.
Task 2.3.3:
head(BabyNamesSupp)
tail(BabyNamesSupp)
str(BabyNamesSupp)
## tibble [196,177 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ name : chr [1:196177] "Emma" "Olivia" "Sophia" "Isabella" ...
## $ sex : logi [1:196177] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ count: num [1:196177] 20941 19817 18628 17102 15708 ...
## $ year : num [1:196177] 2014 2014 2014 2014 2014 ...
## - attr(*, "problems")= tibble [84,619 × 5] (S3: tbl_df/tbl/data.frame)
## ..$ row : int [1:84619] 19208 19209 19210 19211 19212 19213 19214 19215 19216 19217 ...
## ..$ col : chr [1:84619] "sex" "sex" "sex" "sex" ...
## ..$ expected: chr [1:84619] "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" ...
## ..$ actual : chr [1:84619] "M" "M" "M" "M" ...
## ..$ file : chr [1:84619] "'https://jbpost2.github.io/TeachingWithR/datasets/BabyNamesSupp.csv'" "'https://jbpost2.github.io/TeachingWithR/datasets/BabyNamesSupp.csv'" "'https://jbpost2.github.io/TeachingWithR/datasets/BabyNamesSupp.csv'" "'https://jbpost2.github.io/TeachingWithR/datasets/BabyNamesSupp.csv'" ...
## - attr(*, "spec")=
## .. cols(
## .. name = col_character(),
## .. sex = col_logical(),
## .. count = col_double(),
## .. year = col_double()
## .. )
Task 2.3.4 (Challenge): The help documentation explains that
read_csv( )
guesses each variable type based on the first 1000
records. The data source had apparently sorted results such that the
first 1000 (or more) rows were all “F” so read_csv( )
concluded these
to be logical (e.g., TRUE or FALSE) data. Quite a reasonable default
under most circumstances, but shows why we should always carefully
inspect our data intake!
Search “RStudio >> Help” to learn about the data…
BabyNames
data from RStudio
Help?BabyNamesSupp
data from
RStudio Help? What happened?Task 2.4.1: Lots! Sice this is loaded from an R package, the package author can make additional information available to you that describes the data, explains the variables, provides a source, etc.
Task 2.4.2: Nothing! The data were read from an external file, so R doesn’t have any information at all to share with us.
Task 1: Want to include 2020 data too? See if you can locate it,
read the data into R, and review the data intake (hint: BabyNames
help
documentation includes a source to investigate).
Again, we aren’t attempting to process the 2020 data yet. We’re just reading it into the R environment and making observations about that process. We’ll be using this data again later in the exercises, so we will make the necessary corrections at that point.
Task 2.5.1 (Challenge): visit “RStudio >> Help >> BabyNames
>> Source” and then download the zip file to find the 2020 data
inside. This can be read into R using read_csv( )
even though the file
extension is .txt
since the data are comma delimited. You may note
that the structure of the data differs slightly from our two other
sources, so we will need some new tools to process it for use.
## challenge
# locate zip file including all years of data, identify 2020 and read into R:
BabyNames2020 <-
read_csv("https://jbpost2.github.io/TeachingWithR/datasets/yob2020.txt",
col_names = FALSE)
## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_logical(),
## X3 = col_double()
## )
## Warning: 13911 parsing failures.
## row col expected actual file
## 17361 X2 1/0/T/F/TRUE/FALSE M 'https://jbpost2.github.io/TeachingWithR/datasets/yob2020.txt'
## 17362 X2 1/0/T/F/TRUE/FALSE M 'https://jbpost2.github.io/TeachingWithR/datasets/yob2020.txt'
## 17363 X2 1/0/T/F/TRUE/FALSE M 'https://jbpost2.github.io/TeachingWithR/datasets/yob2020.txt'
## 17364 X2 1/0/T/F/TRUE/FALSE M 'https://jbpost2.github.io/TeachingWithR/datasets/yob2020.txt'
## 17365 X2 1/0/T/F/TRUE/FALSE M 'https://jbpost2.github.io/TeachingWithR/datasets/yob2020.txt'
## ..... ... .................. ...... ..............................................................
## See problems(...) for more details.
Note: you might hang onto the RStudio default text provided in the new R Markdown file for the moment… it’s packed with tiny examples that will come in handy!
[coming up next…]