We now know how to summarize categorical data and we’ve learned the basics of ggplot2. Now we’re ready to investigate how to summarize numeric variables. Recall:
Numeric (Quantitative) variable - entries are a numerical value where math can be performed
As before, our goal is to describe the distribution of the variable. We talked about this briefly:
For a single numeric variable, describe the distribution via
Shape: Histogram, Density plot, …
Measures of center: Mean, Median, …
Measures of spread: Variance, Standard Deviation, Quartiles, IQR, …
For two numeric variables, describe the distribution via
Shape: Scatter plot, …
Measures of linear relationship: Covariance, Correlation
First, let’s read in the appendicitis data from the previous lecture.
We’ll utilize the summarize() function along with group_by() to find most of our numerical summaries.
As we discussed, we can’t really describe the entire distribution with a single number so we try to summarize different aspects of the distribution. In particular, center and spread.
Measures of Center
We can find the mean and median via the mean() and median() function.
# A tibble: 1 x 34
mean_Age median_Age mean_BMI median_BMI mean_Height median_Height mean_Weight
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA NA
# i 27 more variables: median_Weight <dbl>, mean_Length_of_Stay <dbl>,
# median_Length_of_Stay <dbl>, mean_Alvarado_Score <dbl>,
# median_Alvarado_Score <dbl>, mean_Paedriatic_Appendicitis_Score <dbl>,
# median_Paedriatic_Appendicitis_Score <dbl>, mean_Appendix_Diameter <dbl>,
# median_Appendix_Diameter <dbl>, mean_Body_Temperature <dbl>,
# median_Body_Temperature <dbl>, mean_WBC_Count <dbl>,
# median_WBC_Count <dbl>, mean_Neutrophil_Percentage <dbl>, ...
Oh, darn. That’s right, we have missing values. We can remove those just for a particular column instead of removing all the rows (as we did with drop_na()). This is a bit more complicated but we can specify some additional arguments of the mean and median function in our named list.
`summarise()` has grouped output by 'Diagnosis'. You can override using the
`.groups` argument.
# A tibble: 4 x 3
# Groups: Diagnosis [2]
Diagnosis Sex correlation
<chr> <chr> <dbl>
1 appendicitis female NA
2 appendicitis male NA
3 no appendicitis female NA
4 no appendicitis male NA
Oh yeah, missing values. Unfortunately, BaseR isn’t that consistent. To deal with missing values appropriately, we can look at the help.
use
an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings “everything”, “all.obs”, “complete.obs”, “na.or.complete”, or “pairwise.complete.obs”.
app_data |>group_by(Diagnosis, Sex) |>drop_na(Diagnosis, Sex) |>summarize(correlation =cor(BMI, Age, use ="pairwise.complete.obs"))
`summarise()` has grouped output by 'Diagnosis'. You can override using the
`.groups` argument.
# A tibble: 4 x 3
# Groups: Diagnosis [2]
Diagnosis Sex correlation
<chr> <chr> <dbl>
1 appendicitis female 0.556
2 appendicitis male 0.462
3 no appendicitis female 0.413
4 no appendicitis male 0.422
Great - we can do all our basic numerical summaries!
Recap!
We tend to describe the center and spread of a numeric variable’s distribution. Often we want to compare across groups and that can be done with group_by().
Use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!