Now that we know how to get our raw data into R, we are ready to do the fun stuff - investigating our data!
We discussed the main steps of an EDA and covered the most common data validation and basic manipulations for the data. The next few sets of notes dive into how to find summarize our data. Recall, how we summarize our data depends on the type of data we have!
Categorical (Qualitative) variable - entries are a label or attribute
Numeric (Quantitative) variable - entries are a numerical value where math can be performed
In either situation, we want to describe each variable’s distribution, perhaps comparing across different subgroups!
Let’s start with summaries of strictly categorical data (or numeric variables with only a few values).
Categorical data is usually stored as a character or factor type
Categorical Data Summaries
To summarize categorical variables numerically, we use contingency tables.
To do so visually, we use bar plots.
First, let’s read in the appendicitis data from the previous lecture.
one or more objects which can be interpreted as factors (including numbers or character strings), or a list (such as a data frame) whose components can be so interpreted.
Ok, so we can just pass it the vectors we want or we could pass it a data frame (which remember, is just a list of equal length vectors!).
Let’s create some contingency tables for the SexF, DiagnosisF, and SeverityF variables.
table(app_data$SexF)
Female Male
377 403
We can include NA if we want to via the useNA argument:
table(app_data$SexF, useNA ="always")
Female Male <NA>
377 403 2
We can create a two-way table (two-way for two variables) by adding the second variable in:
table(app_data$SexF, app_data$DiagnosisF)
appendicitis no appendicitis
Female 200 176
Male 262 141
What is returned from when we create a table? An array! (homogenous data structure - 1D array is a vector, 2D is a matrix)
That means we can subset them if want to! Let’s return the conditional one-way table of Sex based on only those that had appendicitis:
appendicitis no appendicitis
Female 145 175
Male 199 141
#orthree_way[, , 2]
appendicitis no appendicitis
Female 145 175
Male 199 141
We can also get a one-way table conditional on two of the variables. Here is the one-way table for sex for only those with an uncomplicated situation and no appendicitis:
three_way[, 2, 2]
Female Male
175 141
Lastly, just note that you can supply a data frame instead of the individual vectors.
table(app_data[, c("SexF", "DiagnosisF")])
DiagnosisF
SexF appendicitis no appendicitis
Female 200 176
Male 262 141
Via the tidyverse
Ok, great. But we might want to stay in the tidyverse. We can use the dplyr::summarize() function to compute summaries on a tibble. This generally outputs a tibble with fewer rows than the original (as we are summarizing the variables to view them in a more compact form). We often use group_by() to set a grouping variable. Any summary done will respect the groupings!
Any of the common summarization functions you can think of are likely permissible in summarize(). The one for counting values is simply n(). Let’s recreate all of our above tables under the tidyverse method.
One-way table:
app_data |>group_by(SexF) |>summarize(count =n())
# A tibble: 3 x 2
SexF count
<fct> <int>
1 Female 377
2 Male 403
3 <NA> 2
Notice that NA values are included by default (probably a good thing). We can remove those with tidyr::drop_na().
`summarise()` has grouped output by 'SexF'. You can override using the
`.groups` argument.
# A tibble: 4 x 3
# Groups: SexF [2]
SexF DiagnosisF count
<fct> <fct> <int>
1 Female appendicitis 200
2 Female no appendicitis 176
3 Male appendicitis 262
4 Male no appendicitis 141
Nice. But that isn’t in the best way for viewing (i.e. a wider format would be more compact for displaying). Let’s use tidyr::pivot_wider() to fix that!
`summarise()` has grouped output by 'SexF', 'DiagnosisF'. You can override
using the `.groups` argument.
# A tibble: 4 x 4
# Groups: SexF, DiagnosisF [4]
SexF DiagnosisF complicated uncomplicated
<fct> <fct> <int> <int>
1 Female appendicitis 55 145
2 Female no appendicitis 1 175
3 Male appendicitis 63 199
4 Male no appendicitis NA 141
Making it Pretty
When we create these kinds of tables, we often want to include them in some kind of final document. A great way to customize the look of tables is through the gt package!
library(gt)
We can use the gt() and tab_header() functions (among other things) to easily create nicer looking tables!
gt(app_data[1:10,] |>select(Age, Sex, Height, Severity, Diagnosis)) |>tab_header(title ="First 10 rows of Data",subtitle ="Data describes attributes of hospitalized patients" )
First 10 rows of Data
Data describes attributes of hospitalized patients
Age
Sex
Height
Severity
Diagnosis
12.68
female
148
uncomplicated
appendicitis
14.10
male
147
uncomplicated
no appendicitis
14.14
female
163
uncomplicated
no appendicitis
16.37
female
165
uncomplicated
no appendicitis
11.08
female
163
uncomplicated
appendicitis
11.05
male
121
uncomplicated
no appendicitis
8.98
female
140
uncomplicated
no appendicitis
7.06
female
NA
uncomplicated
no appendicitis
7.90
male
131
uncomplicated
no appendicitis
14.34
male
174
uncomplicated
appendicitis
We can take our contingency table and make it look a bit nicer.
We can use other functions such as sub_missing() to change the NA values and put a better label above the Complicated/Uncomplicated column headers with tab_spanner().