Now that we know how to get our raw data into R, we are ready to do the fun stuff - investigating our data!
We discussed the main steps of an EDA and covered the most common data validation and basic manipulations for the data. The next few sets of notes dive into how to find summarize our data. Recall, how we summarize our data depends on the type of data we have!
Categorical (Qualitative) variable - entries are a label or attribute
Numeric (Quantitative) variable - entries are a numerical value where math can be performed
In either situation, we want to describe each variable’s distribution, perhaps comparing across different subgroups!
Let’s start with summaries of strictly categorical data (or numeric variables with only a few values).
Categorical data is usually stored as a character or factor type
Categorical Data Summaries
To summarize categorical variables numerically, we use contingency tables.
To do so visually, we use bar plots.
First, let’s read in the appendicitis data from the previous lecture.
one or more objects which can be interpreted as factors (including numbers or character strings), or a list (such as a data frame) whose components can be so interpreted.
Ok, so we can just pass it the vectors we want or we could pass it a data frame (which remember, is just a list of equal length vectors!).
Let’s create some contingency tables for the SexF, DiagnosisF, and SeverityF variables.
table(app_data$SexF)
Female Male
377 403
We can include NA if we want to via the useNA argument:
table(app_data$SexF, useNA ="always")
Female Male <NA>
377 403 2
We can create a two-way table (two-way for two variables) by adding the second variable in:
table(app_data$SexF, app_data$DiagnosisF)
appendicitis no appendicitis
Female 200 176
Male 262 141
What is returned from when we create a table? An array! (homogenous data structure - 1D array is a vector, 2D is a matrix)
That means we can subset them if want to! Let’s return the conditional one-way table of Sex based on only those that had appendicitis:
appendicitis no appendicitis
Female 145 175
Male 199 141
#orthree_way[, , 2]
appendicitis no appendicitis
Female 145 175
Male 199 141
We can also get a one-way table conditional on two of the variables. Here is the one-way table for sex for only those with an uncomplicated situation and no appendicitis:
three_way[, 2, 2]
Female Male
175 141
Lastly, just note that you can supply a data frame instead of the individual vectors.
table(app_data[, c("SexF", "DiagnosisF")])
DiagnosisF
SexF appendicitis no appendicitis
Female 200 176
Male 262 141
Via the tidyverse
Ok, great. But we might want to stay in the tidyverse. We can use the dplyr::summarize() function to compute summaries on a tibble. This generally outputs a tibble with fewer rows than the original (as we are summarizing the variables to view them in a more compact form). We often use group_by() to set a grouping variable. Any summary done will respect the groupings!
Any of the common summarization functions you can think of are likely permissible in summarize(). The one for counting values is simply n(). Let’s recreate all of our above tables under the tidyverse method.
One-way table:
app_data |>group_by(SexF) |>summarize(count =n())
# A tibble: 3 × 2
SexF count
<fct> <int>
1 Female 377
2 Male 403
3 <NA> 2
Notice that NA values are included by default (probably a good thing). We can remove those with tidyr::drop_na().
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by SexF and DiagnosisF.
ℹ Output is grouped by SexF.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(SexF, DiagnosisF))` for per-operation grouping
(`?dplyr::dplyr_by`) instead.
# A tibble: 4 × 3
# Groups: SexF [2]
SexF DiagnosisF count
<fct> <fct> <int>
1 Female appendicitis 200
2 Female no appendicitis 176
3 Male appendicitis 262
4 Male no appendicitis 141
Nice. But that isn’t in the best way for viewing (i.e. a wider format would be more compact for displaying). Let’s use tidyr::pivot_wider() to fix that!
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by SexF and DiagnosisF.
ℹ Output is grouped by SexF.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(SexF, DiagnosisF))` for per-operation grouping
(`?dplyr::dplyr_by`) instead.
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by SexF, DiagnosisF, and SeverityF.
ℹ Output is grouped by SexF and DiagnosisF.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(SexF, DiagnosisF, SeverityF))` for per-operation
grouping (`?dplyr::dplyr_by`) instead.
# A tibble: 7 × 4
# Groups: SexF, DiagnosisF [4]
SexF DiagnosisF SeverityF count
<fct> <fct> <fct> <int>
1 Female appendicitis complicated 55
2 Female appendicitis uncomplicated 145
3 Female no appendicitis complicated 1
4 Female no appendicitis uncomplicated 175
5 Male appendicitis complicated 63
6 Male appendicitis uncomplicated 199
7 Male no appendicitis uncomplicated 141
We can also pivot this, although there is no great way to get all the info there. We’ll just move the severity variable across the top.
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by SexF, DiagnosisF, and SeverityF.
ℹ Output is grouped by SexF and DiagnosisF.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(SexF, DiagnosisF, SeverityF))` for per-operation
grouping (`?dplyr::dplyr_by`) instead.
# A tibble: 4 × 4
# Groups: SexF, DiagnosisF [4]
SexF DiagnosisF complicated uncomplicated
<fct> <fct> <int> <int>
1 Female appendicitis 55 145
2 Female no appendicitis 1 175
3 Male appendicitis 63 199
4 Male no appendicitis NA 141
Making it Pretty
When we create these kinds of tables, we often want to include them in some kind of final document. A great way to customize the look of tables is through the gt package!
library(gt)
We can use the gt() and tab_header() functions (among other things) to easily create nicer looking tables!
gt(app_data[1:10,] |>select(Age, Sex, Height, Severity, Diagnosis)) |>tab_header(title ="First 10 rows of Data",subtitle ="Data describes attributes of hospitalized patients" )
First 10 rows of Data
Data describes attributes of hospitalized patients
Age
Sex
Height
Severity
Diagnosis
12.68
female
148
uncomplicated
appendicitis
14.10
male
147
uncomplicated
no appendicitis
14.14
female
163
uncomplicated
no appendicitis
16.37
female
165
uncomplicated
no appendicitis
11.08
female
163
uncomplicated
appendicitis
11.05
male
121
uncomplicated
no appendicitis
8.98
female
140
uncomplicated
no appendicitis
7.06
female
NA
uncomplicated
no appendicitis
7.90
male
131
uncomplicated
no appendicitis
14.34
male
174
uncomplicated
appendicitis
We can take our contingency table and make it look a bit nicer.
We can use other functions such as sub_missing() to change the NA values and put a better label above the Complicated/Uncomplicated column headers with tab_spanner().
Warning: Since gt v0.6.0 `fmt_missing()` is deprecated and will soon be removed.
ℹ Use `sub_missing()` instead.
This warning is displayed once every 8 hours.
Patient Diagnosis by Severity
Stratified by Biological Sex
Diagnosis
Severity Levels
Complicated
Uncomplicated
Female
appendicitis
55
145
no appendicitis
1
175
Male
appendicitis
63
199
no appendicitis
0
141
Cool - I’m not expecting you to know a lot about this package but it is useful to know that it exists when you want to make a table look clearer!
Recap!
Contingency tables summarize the distribution of one or more categorical variables. We can create them using
table() - returns an array of counts
group_by() along with summarize() and the n() function
Use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!