Basic use of R for reading, manipulating, and plotting data!
Basic use of R for reading, manipulating, and plotting data!
Understand types of data and their distributions
Numerical summaries
Understand types of data and their distributions
Numerical summaries (across subgroups)
Understand types of data and their distributions
Numerical summaries (across subgroups)
Understand types of data and their distributions
Numerical summaries (across subgroups)
Graphical summaries (across subgroups)
How to summarize data?
Depends on data type:
Common goal: Describe the distribution of the variable
Distribution = pattern and frequency with which you observe a variable
Categorical variable - entries are a label or attribute
tabletitanic.csvtitanicData <- read_csv("../datasets/titanic.csv")
titanicData
## # A tibble: 1,310 x 14 ## pclass survived name sex age sibsp parch ticket fare cabin embarked ## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> ## 1 1 1 Allen, M~ fema~ 29 0 0 24160 211. B5 S ## 2 1 1 Allison,~ male 0.917 1 2 113781 152. C22 ~ S ## 3 1 0 Allison,~ fema~ 2 1 2 113781 152. C22 ~ S ## 4 1 0 Allison,~ male 30 1 2 113781 152. C22 ~ S ## 5 1 0 Allison,~ fema~ 25 1 2 113781 152. C22 ~ S ## # ... with 1,305 more rows, and 3 more variables: boat <chr>, body <dbl>, ## # home.dest <chr>
Create one-way contingency tables for each of three categorical variables:
table(titanicData$embarked)
## ## C Q S ## 270 123 914
table(titanicData$survived)
## ## 0 1 ## 809 500
table(titanicData$sex)
## ## female male ## 466 843
table(titanicData$survived,
titanicData$sex)
## ## female male ## 0 127 682 ## 1 339 161
table(titanicData$survived,
titanicData$embarked)
## ## C Q S ## 0 120 79 610 ## 1 150 44 304
table(titanicData$sex,
titanicData$embarked)
## ## C Q S ## female 113 60 291 ## male 157 63 623
table(titanicData$sex, titanicData$embarked, titanicData$survived)
## , , = 0 ## ## ## C Q S ## female 11 23 93 ## male 109 56 517 ## ## , , = 1 ## ## ## C Q S ## female 102 37 198 ## male 48 7 106
Create a three-way contingency table for three categorical variables (order matters for output!)
Example of an array! 3 dimensions [ , , ]
tab <- table(titanicData$sex, titanicData$embarked, titanicData$survived) str(tab)
## 'table' int [1:2, 1:3, 1:2] 11 109 23 56 93 517 102 48 37 7 ... ## - attr(*, "dimnames")=List of 3 ## ..$ : chr [1:2] "female" "male" ## ..$ : chr [1:3] "C" "Q" "S" ## ..$ : chr [1:2] "0" "1"
## 'table' int [1:2, 1:3, 1:2] 11 109 23 56 93 517 102 48 37 7 ... ## - attr(*, "dimnames")=List of 3 ## ..$ : chr [1:2] "female" "male" ## ..$ : chr [1:3] "C" "Q" "S" ## ..$ : chr [1:2] "0" "1"
#returns embarked vs survived table for females tab[1, , ]
## ## 0 1 ## C 11 102 ## Q 23 37 ## S 93 198
## 'table' int [1:2, 1:3, 1:2] 11 109 23 56 93 517 102 48 37 7 ... ## - attr(*, "dimnames")=List of 3 ## ..$ : chr [1:2] "female" "male" ## ..$ : chr [1:3] "C" "Q" "S" ## ..$ : chr [1:2] "0" "1"
#returns embarked vs survived table for males tab[2, , ]
## ## 0 1 ## C 109 48 ## Q 56 7 ## S 517 106
## 'table' int [1:2, 1:3, 1:2] 11 109 23 56 93 517 102 48 37 7 ... ## - attr(*, "dimnames")=List of 3 ## ..$ : chr [1:2] "female" "male" ## ..$ : chr [1:3] "C" "Q" "S" ## ..$ : chr [1:2] "0" "1"
#returns survived vs sex table for embarked "C" tab[, 1, ]
## ## 0 1 ## female 11 102 ## male 109 48
## 'table' int [1:2, 1:3, 1:2] 11 109 23 56 93 517 102 48 37 7 ... ## - attr(*, "dimnames")=List of 3 ## ..$ : chr [1:2] "female" "male" ## ..$ : chr [1:3] "C" "Q" "S" ## ..$ : chr [1:2] "0" "1"
#Survived status for males that embarked at "Q" tab[2, 2, ]
## 0 1 ## 56 7
Numeric variable - entries are a numerical value where math can be performed
Single variable: describe the distribution via
Numeric variable - entries are a numerical value where math can be performed
Single variable: describe the distribution via
Shape: Histogram, Density plot, …
Measures of center: Mean, Median, …
Measures of spread: Variance, Standard Deviation, Quartiles, IQR, …
Two Variables:
Shape: Scatter plot, …
Measures of linear relationship: Covariance, Correlation, …
Look at carbon dioxide (CO2) uptake data set
uptake CO2 uptake rates in grass plantsTreatment - chilled/nonchilledconcCO2 <- as_tibble(CO2) CO2
## # A tibble: 84 x 5 ## Plant Type Treatment conc uptake ## <ord> <fct> <fct> <dbl> <dbl> ## 1 Qn1 Quebec nonchilled 95 16 ## 2 Qn1 Quebec nonchilled 175 30.4 ## 3 Qn1 Quebec nonchilled 250 34.8 ## 4 Qn1 Quebec nonchilled 350 37.2 ## 5 Qn1 Quebec nonchilled 500 35.3 ## # ... with 79 more rows
Mean & Median
mean(CO2$uptake)
## [1] 27.2131
#note you can easily get a trimmed mean mean(CO2$uptake, trim = 0.05) #5% trimmed mean
## [1] 27.25263
median(CO2$uptake)
## [1] 28.3
Variance, Standard Deviation, Quartiles, & IQR
#quartiles and mean summary(CO2$uptake)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 7.70 17.90 28.30 27.21 37.12 45.50
var(CO2$uptake)
## [1] 116.9515
sd(CO2$uptake)
## [1] 10.81441
IQR(CO2$uptake)
## [1] 19.225
quantile(CO2$uptake, probs = c(0.1, 0.2))
## 10% 20% ## 12.36 15.64
Covariance & Correlation
cov(CO2$conc, CO2$uptake)
## [1] 1552.687
cor(CO2$conc, CO2$uptake)
## [1] 0.4851774
Usually want summaries for different subgroups of data
Usually want summaries for different subgroups of data
Ex: Get similar uptake summaries for each Treatment
dplyr easy to use but can only return one value
Usually want summaries for different subgroups of data
Ex: Get similar uptake summaries for each Treatment
dplyr easy to use (although it can only return one value)
Idea:
Use group_by to create subgroups associated with the data frame
Use summarize to create basic summaries for each subgroup
CO2 %>%
group_by(Treatment) %>%
summarise(avg = mean(uptake), med = median(uptake), var = var(uptake))
## # A tibble: 2 x 4 ## Treatment avg med var ## <fct> <dbl> <dbl> <dbl> ## 1 nonchilled 30.6 31.3 94.2 ## 2 chilled 23.8 19.7 118.
CO2 %>%
group_by(Treatment, conc) %>%
summarise(avg = mean(uptake), med = median(uptake), var = var(uptake))
## # A tibble: 14 x 5 ## # Groups: Treatment [2] ## Treatment conc avg med var ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 nonchilled 95 13.3 12.8 5.75 ## 2 nonchilled 175 25.1 24.6 32.6 ## 3 nonchilled 250 32.5 32.7 35.1 ## 4 nonchilled 350 35.1 34.5 37.4 ## 5 nonchilled 500 35.1 33.8 31.9 ## 6 nonchilled 675 36.0 35.8 40.2 ## 7 nonchilled 1000 37.4 37.6 49.8 ## 8 chilled 95 11.2 10.6 8.18 ## 9 chilled 175 19.4 19.5 34.7 ## 10 chilled 250 25.3 24.2 112. ## 11 chilled 350 26.2 26.4 117. ## 12 chilled 500 26.6 26 131. ## 13 chilled 675 27.9 28.8 120. ## 14 chilled 1000 29.8 30.3 154.
dplyr has variations on summarise that can be used:
summarise_all() - Apply functions to every column
summarise_at() - Apply functions to specific columns
summarise_if() - Apply functions to all columns of one type
Ex: Get similar uptake summaries for each Treatment
Built-in aggregate() function more general
Ex: Get similar uptake summaries for each Treatment
Built-in aggregate() function more general
Basic use gives response (x) and a list of variables to group by
aggregate(x = CO2$uptake, by = list(CO2$Treatment), FUN = summary)
## Group.1 x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max. ## 1 nonchilled 10.60000 26.47500 31.30000 30.64286 38.70000 45.50000 ## 2 chilled 7.70000 14.52500 19.70000 23.78333 34.90000 42.40000
aggregate() is commonly used with formula notation!uptake ~ Treatment - is an example of formula notation
aggregate() is commonly used with formula notation!uptake ~ Treatment - is an example of formula notation
aggregate(uptake ~ Treatment, data = CO2, FUN = summary)
aggregate() is commonly used with formula notation!uptake ~ Treatment + conc model uptake by levels of Treatment and conc
aggregate(uptake ~ Treatment + conc, data = CO2, FUN = summary)
## Treatment conc uptake.Min. uptake.1st Qu. uptake.Median uptake.Mean ## 1 nonchilled 95 10.60000 11.47500 12.80000 13.28333 ## 2 chilled 95 7.70000 9.60000 10.55000 11.23333 ## 3 nonchilled 175 19.20000 20.05000 24.65000 25.11667 ## 4 chilled 175 11.40000 15.67500 19.50000 19.45000 ## 5 nonchilled 250 25.80000 27.30000 32.70000 32.46667 ## 6 chilled 250 12.30000 17.95000 24.20000 25.28333 ## 7 nonchilled 350 27.90000 30.45000 34.50000 35.13333 ## 8 chilled 350 13.00000 18.15000 26.45000 26.20000 ## 9 nonchilled 500 28.50000 31.27500 33.85000 35.10000 ## 10 chilled 500 12.50000 18.30000 26.00000 26.65000 ## 11 nonchilled 675 28.10000 31.42500 35.80000 36.01667 ## 12 chilled 675 13.70000 19.72500 28.80000 27.88333 ## 13 nonchilled 1000 27.80000 32.50000 37.60000 37.38333 ## 14 chilled 1000 14.40000 20.40000 30.30000 29.78333 ## uptake.3rd Qu. uptake.Max. ## 1 15.40000 16.20000 ## 2 13.30000 15.10000 ## 3 29.62500 32.40000 ## 4 23.32500 27.30000 ## 5 36.52500 40.30000 ## 6 33.82500 38.10000 ## 7 40.65000 42.10000 ## 8 34.45000 38.80000 ## 9 39.27500 42.90000 ## 10 37.07500 38.90000 ## 11 40.85000 43.90000 ## 12 36.97500 39.60000 ## 13 43.15000 45.50000 ## 14 40.72500 42.40000
Understand types of data and their distributions
Numerical summaries
tablemean, mediansd, var, IQRquantile for more general quantilesAcross subgroups with dplyr::group_by and dplyr::summarize or aggregate
Understand types of data and their distributions
Numerical summaries (across subgroups)
Graphical summaries (across subgroups)
Three major systems for plotting:
Base R (built-in functions)
Lattice
ggplot2 (sort of part of the tidyverse - Cheat Sheet)
ggplot(data = data_frame) creates a plot instanceGreat reference book here!
ggplot2 Plottingggplot2 basics (Cheat Sheet)
ggplot(data = data_frame) creates a plot instanceggplot2 Plottingggplot2 basics (Cheat Sheet)
ggplot(data = data_frame) creates a plot instancefactorslevels attributeDay (Monday, Tuesday, …)
Name where new values may come up
factorstitanic.csv#convert survival status to a factor titanicData$survived <- as.factor(titanicData$survived) levels(titanicData$survived) #R knows it isn't numeric now
## [1] "0" "1"
titanicData$survived[1] <- "5"
## Warning in `[<-.factor`(`*tmp*`, 1, value = structure(c(NA, 2L, 1L, 1L, : ## invalid factor level, NA generated
factor levelslevels(titanicData$survived) <- c("Died", "Survived")
levels(titanicData$survived)
## [1] "Died" "Survived"
ggplot2 Plotting: Categorical variablesCategorical variable - entries are a label or attribute
Generally, describe distriubtion using a barplot!
ggplot + geom_bartitanicData <- read_csv("../datasets/titanic.csv")
titanicData$mySurvived <- as.factor(titanicData$survived)
levels(titanicData$mySurvived) <- c("Died", "Survived")
titanicData$myEmbarked <- as.factor(titanicData$embarked)
levels(titanicData$myEmbarked) <- c("Cherbourg", "Queenstown", "Southampton")
titanicData <- titanicData %>% drop_na(mySurvived, sex, myEmbarked)
ggplot2 barplotsBarplots via ggplot + geom_bar
Across x-axis we want our categories - specify with aes(x = ...)
ggplot(data = titanicData, aes(x = mySurvived))
ggplot2 barplotsBarplots via ggplot + geom_bar
Must add geom (or stat) layer!
ggplot(data = titanicData, aes(x = mySurvived)) + geom_bar()
ggplot2 barplotsg <- ggplot(data = titanicData, aes(x = mySurvived)) g + geom_bar()
ggplot2 barplotsg <- ggplot(data = titanicData, aes(x = mySurvived)) g + geom_bar()
aes() defines visual properties of objects in the plot    x = , y = , size = , shape = , color = , alpha = , ...
geomggplot2 barplotsg <- ggplot(data = titanicData, aes(x = mySurvived)) g + geom_bar()
aes() defines visual properties of objects in the plot    x = , y = , size = , shape = , color = , alpha = , ...
geom    d + geom_bar()
    x, alpha, color, fill, linetype, size, weight
ggplot2 global and local aestheticsdata and aes can be set in two ways;
‘globally’ (for all layers) via the ggplot statement
‘locally’ (for just that layer) via the geom, stat, etc. layer
ggplot2 global and local aestheticsdata and aes can be set in two ways;
‘globally’ (for all layers) via the ggplot statement
‘locally’ (for just that layer) via the geom, stat, etc. layer
#global ggplot(data = titanicData, aes(x = mySurvived)) + geom_bar() #local ggplot() + geom_bar(data = titanicData, aes(x = mySurvived))
ggplot2 global and local aestheticsdata and aes can be set in two ways;
‘globally’ (for all layers) via the ggplot statement
‘locally’ (for just that layer) via the geom, stat, etc. layer
#global ggplot(data = titanicData, aes(x = mySurvived)) + geom_bar() #local ggplot() + geom_bar(data = titanicData, aes(x = mySurvived))
color = 'blue'), generally place these outside of the aesggplot2 barplotsggplot(data = titanicData, aes(x = mySurvived)) + geom_bar() + labs(x = "Survival Status", title = "Bar Plot of Survival for Titanic Passengers")
ggplot2 stacked barplotsStacked barplot created by via fill aesthetic and same process
ggplot2 stacked barplotsStacked barplot created by via fill aesthetic
Automatic assignment of colors, creation of legends, etc. for aes elements (except with group)
ggplot(data = titanicData, aes(x = mySurvived, fill = sex)) + geom_bar()
ggplot2 labelingggplot(data = titanicData, aes(x = mySurvived, fill = sex)) +
geom_bar() +
labs(x = "Survival Status",
title = "Bar Plot of Survival Status for Titanic Passengers") +
scale_fill_discrete(name = "Sex", labels = c("Female", "Male"))
ggplot2 labelingscale_*_discreteaes(x = survived, fill = sex)
scale_x_discrete(labels = c("Person Died", "Person Survived"))
scale_fill_discrete(name = "Sex", labels = c("Female","Male"))
ggplot2 horizontal barplotscoord_flipggplot(data = titanicData, aes(x = mySurvived, fill = sex)) + geom_bar() +
labs(x = "Survival Status",
title = "Bar Plot of Survival Status for Titanic Passengers") +
scale_x_discrete(labels = c("Person Died", "Person Survived")) +
scale_fill_discrete(name = "Sex", labels = c("Female", "Male")) +
coord_flip()
ggplot2 stat vs geom layersNote: Most geoms have a corresponding stat that can be used
geom_bar(mapping = NULL, data = NULL, stat = "count", position = "stack", ..., width = NULL, binwidth = NULL, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
ggplot(data = titanicData, aes(x = survived, fill = sex)) + geom_bar() ggplot(data = titanicData, aes(x = survived, fill = sex)) + stat_count()
ggplot2 stat vs geom layersidentitysumData <- titanicData %>% group_by(survived, sex) %>% summarize(count = n()) ggplot(sumData, aes(x = survived, y = count, fill = sex)) + geom_bar(stat = "identity")
ggplot2 side-by-side barplotsSide-by-side barplot created by via position aesthetic
dodge for side-by-side bar plotjitter for continuous data with many points at same valuesfill stacks bars and standardises each stack to have constant heightstack stacks bars on top of each otherggplot2 side-by-side barplotsposition aestheticggplot(data = titanicData, aes(x = mySurvived, fill = sex)) + geom_bar(position = "dodge")
ggplot2 filled barplotsposition = fill stacks bars and standardises each stack to have constant height (especially useful with equal group sizes)ggplot(data = titanicData, aes(x = mySurvived, fill = sex)) + geom_bar(position = "fill")
ggplot2 facetingHow to create same plot for each myEmbarked value? Use faceting!
ggplot2 facetingHow to create same plot for each myEmbarked value? Use faceting!
facet_wrap(~ var) - creates a plot for each setting of var
nrow and ncol or let R figure it outggplot2 facetingHow to create same plot for each myEmbarked value? Use faceting!
facet_wrap(~ var) - creates a plot for each setting of var
nrow and ncol or let R figure it outfacet_grid(var1 ~ var2) - creats a plot for each combination of var1 and var2
var1 values across rows
var2 values across columns
Use . ~ var2 or var1 ~ . to have only one row or column
ggplot2 facetingHow to create same plot for each myEmbarked value? Use faceting!
facet_wrap(~ var) - creates a plot for each setting of varggplot(data = titanicData, aes(x = mySurvived)) + geom_bar(aes(fill =sex), position = "dodge") + facet_wrap(~ myEmbarked)
ggplot2 Plotting RecapGeneral ggplot things:
Can set local or global aes
Modify titles/labels by adding more layers
Faceting (multiple plots) via facet_grid or facet_wrap
Only need aes if setting a mapping value that is dependent on the data (or you want to create a custom legend!)
ggplot2 Plotting: Numeric VariablesNumeric variables - generally, describe distribution via a histogram or boxplot!
Same process:
ggplot2 smoothed histogramKernel Smoother - Smoothed version of a histogram
Common aes values (from cheat sheet):
    c + geom_density(kernel = "gaussian")
    x, y, alpha, color, fill, group, linetype, size, weight
x = is really neededggplot2 smoothed histogramg <- ggplot(titanicData, aes(x = age)) g + geom_density()
ggplot2 smoothed histogramKernel Smoother - Smoothed version of a histogram
fill a useful aesthetic!
g + geom_density(adjust = 0.5, alpha = 0.5, aes(fill = mySurvived))
ggplot2 smoothed histogramKernel Smoother - Smoothed version of a histogram
recall position choices of dodge, jitter, fill, and stack
g + geom_density(adjust = 0.5, alpha = 0.5, position = "stack", aes(fill = mySurvived))
ggplot2 boxplotsBoxplot - Provides the five number summary in a graph
Common aes values (from cheat sheet):
    f + geom_boxplot()
    x, y, lower, middle, upper, ymax, ymin, alpha, color, fill, group, linetype, shape, size, weight
x =, y = are really neededggplot2 boxplotsg <- ggplot(titanicData, aes(x = mySurvived, y = age)) g + geom_boxplot(fill = "grey")
ggplot2 boxplots with pointsg + geom_boxplot(fill = "grey") + geom_jitter(width = 0.1, alpha = 0.3)
ggplot2 boxplots with pointsg + geom_jitter(width = 0.1, alpha = 0.3) + geom_boxplot(fill = "grey")
ggplot2 facetingfacet easily!g + geom_boxplot(fill = "grey") + geom_jitter(width = 0.1, alpha = 0.3) + facet_wrap(~ myEmbarked)
ggplot2 scatter plotsTwo numerical variables
Scatter Plot - graphs points corresponding to each observation
Common aes values (from cheat sheet):
    e + geom_point()
    x, y, alpha, color, fill, shape, size, stroke
x =, y = are really neededggplot2 scatter plotsg <- ggplot(titanicData, aes(x = age, y = fare)) g + geom_point()
ggplot2 scatter plots with trend lineg + geom_point() +
geom_smooth(aes(col = "loess")) +
geom_smooth(method = lm, aes(col = "linear")) +
scale_colour_manual(name = 'Smoother', values =c('linear'='red', 'loess'='purple'),
labels = c('Linear','GAM'), guide = 'legend')
ggplot2 scatter plots with textMay want to add value of correlation to plot
paste() or paste0() handy
paste("Hi", "What", "Is", "Going", "On", "?", sep = " ")
## [1] "Hi What Is Going On ?"
paste("Hi", "What", "Is", "Going", "On", "?", sep = ".")
## [1] "Hi.What.Is.Going.On.?"
paste0("Hi", "What", "Is", "Going", "On", "?")
## [1] "HiWhatIsGoingOn?"
ggplot2 scatter plots with textcorrelation <- cor(titanicData$fare, titanicData$age, use = "complete.obs")
g + geom_point() +
geom_smooth(method = lm, col = "Red") +
geom_text(x = 40, y = 400, size = 5,
label = paste0("Correlation = ", round(correlation, 2)))
ggplot2 scatter plots with text pointsgeom_textg + geom_text(aes(label = survived, color = mySurvived))
ggplot2 facetingg + geom_point(aes(color = sex), size = 2.5) + facet_wrap(~ myEmbarked)
ggpairsMany extension packages that do nice things!
library(GGally) #install GGally if needed ggpairs(iris, aes(colour = Species, alpha = 0.4))
ggplot2 Plotting: Numeric variablesNumeric variable - entries are a numerical value where math can be performed
Most common plots:
Histogram (geom_hist), Density (geom_density)
Boxplot (geom_boxplot), Violin plot (geom_violin)
Scatter plot (geom_point), Smoothers (geom_smooth)
Jittered points (geom_jitter)
Text on plot (geom_text)
ggplot2 Plotting RecapGeneral ggplot things:
Can set local or global aes
Modify titles/labels by adding more layers
Use either stat or geom layer
Faceting (multiple plots) via facet_grid or facet_wrap
Only need aes if setting a mapping value that is dependent on the data (or you want to create a custom legend!)
esquisse is a great package for exploring ggplot2!