Now that we know how to get our raw data into R, we are ready to do the fun stuff - investigating our data!
We discussed the main steps of an EDA and covered the most common data validation and basic manipulations for the data. The next few sets of notes dive into how to find summarize our data. Recall, how we summarize our data depends on the type of data we have!
Categorical (Qualitative) variable - entries are a label or attribute
Numeric (Quantitative) variable - entries are a numerical value where math can be performed
In either situation, we want to describe each variable’s distribution, perhaps comparing across different subgroups!
Let’s start with summaries of strictly categorical data (or numeric variables with only a few values).
Categorical data is usually stored as a character or factor type
Categorical Data Summaries
To summarize categorical variables numerically, we use contingency tables.
To do so visually, we use bar plots.
First, let’s read in the appendicitis data from the previous lecture.
one or more objects which can be interpreted as factors (including numbers or character strings), or a list (such as a data frame) whose components can be so interpreted.
Ok, so we can just pass it the vectors we want or we could pass it a data frame (which remember, is just a list of equal length vectors!).
Let’s create some contingency tables for the SexF, DiagnosisF, and SeverityF variables.
table(app_data$SexF)
Female Male
377 403
We can include NA if we want to via the useNA argument:
table(app_data$SexF, useNA ="always")
Female Male <NA>
377 403 2
We can create a two-way table (two-way for two variables) by adding the second variable in:
table(app_data$SexF, app_data$DiagnosisF)
appendicitis no appendicitis
Female 200 176
Male 262 141
What is returned from when we create a table? An array! (homogenous data structure - 1D array is a vector, 2D is a matrix)
That means we can subset them if want to! Let’s return the conditional one-way table of Sex based on only those that had appendicitis:
appendicitis no appendicitis
Female 145 175
Male 199 141
#orthree_way[, , 2]
appendicitis no appendicitis
Female 145 175
Male 199 141
We can also get a one-way table conditional on two of the variables. Here is the one-way table for sex for only those with an uncomplicated situation and no appendicitis:
three_way[, 2, 2]
Female Male
175 141
Lastly, just note that you can supply a data frame instead of the individual vectors.
table(app_data[, c("SexF", "DiagnosisF")])
DiagnosisF
SexF appendicitis no appendicitis
Female 200 176
Male 262 141
Via the tidyverse
Ok, great. But we might want to stay in the tidyverse. We can use the dplyr::summarize() function to compute summaries on a tibble. This generally outputs a tibble with fewer rows than the original (as we are summarizing the variables to view them in a more compact form). We often use group_by() to set a grouping variable. Any summary done will respect the groupings!
Any of the common summarization functions you can think of are likely permissible in summarize(). The one for counting values is simply n(). Let’s recreate all of our above tables under the tidyverse method.
One-way table:
app_data |>group_by(SexF) |>summarize(count =n())
# A tibble: 3 x 2
SexF count
<fct> <int>
1 Female 377
2 Male 403
3 <NA> 2
Notice that NA values are included by default (probably a good thing). We can remove those with tidyr::drop_na().
`summarise()` has grouped output by 'SexF'. You can override using the
`.groups` argument.
# A tibble: 4 x 3
# Groups: SexF [2]
SexF DiagnosisF count
<fct> <fct> <int>
1 Female appendicitis 200
2 Female no appendicitis 176
3 Male appendicitis 262
4 Male no appendicitis 141
Nice. But that isn’t in the best way for viewing (i.e. a wider format would be more compact for displaying). Let’s use tidyr::pivot_wider() to fix that!