Basic use of R for reading, manipulating, and plotting data!
Basic use of R for reading, manipulating, and plotting data!
Understand types of data and their distributions
Numerical summaries
Understand types of data and their distributions
Numerical summaries (across subgroups)
Understand types of data and their distributions
Numerical summaries (across subgroups)
Understand types of data and their distributions
Numerical summaries (across subgroups)
Graphical summaries (across subgroups)
How to summarize data?
Depends on data type:
Common goal: Describe the distribution of the variable
Distribution = pattern and frequency with which you observe a variable
Categorical variable - entries are a label or attribute
table
titanic.csv
titanicData <- read_csv("../datasets/titanic.csv") titanicData
## # A tibble: 1,310 x 14 ## pclass survived name sex age sibsp parch ticket fare cabin embarked ## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> ## 1 1 1 Allen, M~ fema~ 29 0 0 24160 211. B5 S ## 2 1 1 Allison,~ male 0.917 1 2 113781 152. C22 ~ S ## 3 1 0 Allison,~ fema~ 2 1 2 113781 152. C22 ~ S ## 4 1 0 Allison,~ male 30 1 2 113781 152. C22 ~ S ## 5 1 0 Allison,~ fema~ 25 1 2 113781 152. C22 ~ S ## # ... with 1,305 more rows, and 3 more variables: boat <chr>, body <dbl>, ## # home.dest <chr>
Create one-way contingency tables for each of three categorical variables:
table(titanicData$embarked)
## ## C Q S ## 270 123 914
table(titanicData$survived)
## ## 0 1 ## 809 500
table(titanicData$sex)
## ## female male ## 466 843
table(titanicData$survived, titanicData$sex)
## ## female male ## 0 127 682 ## 1 339 161
table(titanicData$survived, titanicData$embarked)
## ## C Q S ## 0 120 79 610 ## 1 150 44 304
table(titanicData$sex, titanicData$embarked)
## ## C Q S ## female 113 60 291 ## male 157 63 623
table(titanicData$sex, titanicData$embarked, titanicData$survived)
## , , = 0 ## ## ## C Q S ## female 11 23 93 ## male 109 56 517 ## ## , , = 1 ## ## ## C Q S ## female 102 37 198 ## male 48 7 106
Create a three-way contingency table for three categorical variables (order matters for output!)
Example of an array! 3 dimensions [ , , ]
tab <- table(titanicData$sex, titanicData$embarked, titanicData$survived) str(tab)
## 'table' int [1:2, 1:3, 1:2] 11 109 23 56 93 517 102 48 37 7 ... ## - attr(*, "dimnames")=List of 3 ## ..$ : chr [1:2] "female" "male" ## ..$ : chr [1:3] "C" "Q" "S" ## ..$ : chr [1:2] "0" "1"
## 'table' int [1:2, 1:3, 1:2] 11 109 23 56 93 517 102 48 37 7 ... ## - attr(*, "dimnames")=List of 3 ## ..$ : chr [1:2] "female" "male" ## ..$ : chr [1:3] "C" "Q" "S" ## ..$ : chr [1:2] "0" "1"
#returns embarked vs survived table for females tab[1, , ]
## ## 0 1 ## C 11 102 ## Q 23 37 ## S 93 198
## 'table' int [1:2, 1:3, 1:2] 11 109 23 56 93 517 102 48 37 7 ... ## - attr(*, "dimnames")=List of 3 ## ..$ : chr [1:2] "female" "male" ## ..$ : chr [1:3] "C" "Q" "S" ## ..$ : chr [1:2] "0" "1"
#returns embarked vs survived table for males tab[2, , ]
## ## 0 1 ## C 109 48 ## Q 56 7 ## S 517 106
## 'table' int [1:2, 1:3, 1:2] 11 109 23 56 93 517 102 48 37 7 ... ## - attr(*, "dimnames")=List of 3 ## ..$ : chr [1:2] "female" "male" ## ..$ : chr [1:3] "C" "Q" "S" ## ..$ : chr [1:2] "0" "1"
#returns survived vs sex table for embarked "C" tab[, 1, ]
## ## 0 1 ## female 11 102 ## male 109 48
## 'table' int [1:2, 1:3, 1:2] 11 109 23 56 93 517 102 48 37 7 ... ## - attr(*, "dimnames")=List of 3 ## ..$ : chr [1:2] "female" "male" ## ..$ : chr [1:3] "C" "Q" "S" ## ..$ : chr [1:2] "0" "1"
#Survived status for males that embarked at "Q" tab[2, 2, ]
## 0 1 ## 56 7
Numeric variable - entries are a numerical value where math can be performed
Single variable: describe the distribution via
Numeric variable - entries are a numerical value where math can be performed
Single variable: describe the distribution via
Shape: Histogram, Density plot, …
Measures of center: Mean, Median, …
Measures of spread: Variance, Standard Deviation, Quartiles, IQR, …
Two Variables:
Shape: Scatter plot, …
Measures of linear relationship: Covariance, Correlation, …
Look at carbon dioxide (CO2) uptake data set
uptake
CO2 uptake rates in grass plantsTreatment
- chilled/nonchilledconc
CO2 <- as_tibble(CO2) CO2
## # A tibble: 84 x 5 ## Plant Type Treatment conc uptake ## <ord> <fct> <fct> <dbl> <dbl> ## 1 Qn1 Quebec nonchilled 95 16 ## 2 Qn1 Quebec nonchilled 175 30.4 ## 3 Qn1 Quebec nonchilled 250 34.8 ## 4 Qn1 Quebec nonchilled 350 37.2 ## 5 Qn1 Quebec nonchilled 500 35.3 ## # ... with 79 more rows
Mean & Median
mean(CO2$uptake)
## [1] 27.2131
#note you can easily get a trimmed mean mean(CO2$uptake, trim = 0.05) #5% trimmed mean
## [1] 27.25263
median(CO2$uptake)
## [1] 28.3
Variance, Standard Deviation, Quartiles, & IQR
#quartiles and mean summary(CO2$uptake)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 7.70 17.90 28.30 27.21 37.12 45.50
var(CO2$uptake)
## [1] 116.9515
sd(CO2$uptake)
## [1] 10.81441
IQR(CO2$uptake)
## [1] 19.225
quantile(CO2$uptake, probs = c(0.1, 0.2))
## 10% 20% ## 12.36 15.64
Covariance & Correlation
cov(CO2$conc, CO2$uptake)
## [1] 1552.687
cor(CO2$conc, CO2$uptake)
## [1] 0.4851774
Usually want summaries for different subgroups of data
Usually want summaries for different subgroups of data
Ex: Get similar uptake summaries for each Treatment
dplyr
easy to use but can only return one value
Usually want summaries for different subgroups of data
Ex: Get similar uptake summaries for each Treatment
dplyr
easy to use (although it can only return one value)
Idea:
Use group_by
to create subgroups associated with the data frame
Use summarize
to create basic summaries for each subgroup
CO2 %>% group_by(Treatment) %>% summarise(avg = mean(uptake), med = median(uptake), var = var(uptake))
## # A tibble: 2 x 4 ## Treatment avg med var ## <fct> <dbl> <dbl> <dbl> ## 1 nonchilled 30.6 31.3 94.2 ## 2 chilled 23.8 19.7 118.
CO2 %>% group_by(Treatment, conc) %>% summarise(avg = mean(uptake), med = median(uptake), var = var(uptake))
## # A tibble: 14 x 5 ## # Groups: Treatment [2] ## Treatment conc avg med var ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 nonchilled 95 13.3 12.8 5.75 ## 2 nonchilled 175 25.1 24.6 32.6 ## 3 nonchilled 250 32.5 32.7 35.1 ## 4 nonchilled 350 35.1 34.5 37.4 ## 5 nonchilled 500 35.1 33.8 31.9 ## 6 nonchilled 675 36.0 35.8 40.2 ## 7 nonchilled 1000 37.4 37.6 49.8 ## 8 chilled 95 11.2 10.6 8.18 ## 9 chilled 175 19.4 19.5 34.7 ## 10 chilled 250 25.3 24.2 112. ## 11 chilled 350 26.2 26.4 117. ## 12 chilled 500 26.6 26 131. ## 13 chilled 675 27.9 28.8 120. ## 14 chilled 1000 29.8 30.3 154.
dplyr
has variations on summarise
that can be used:
summarise_all()
- Apply functions to every column
summarise_at()
- Apply functions to specific columns
summarise_if()
- Apply functions to all columns of one type
Ex: Get similar uptake summaries for each Treatment
Built-in aggregate()
function more general
Ex: Get similar uptake summaries for each Treatment
Built-in aggregate()
function more general
Basic use gives response (x
) and a list
of variables to group by
aggregate(x = CO2$uptake, by = list(CO2$Treatment), FUN = summary)
## Group.1 x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max. ## 1 nonchilled 10.60000 26.47500 31.30000 30.64286 38.70000 45.50000 ## 2 chilled 7.70000 14.52500 19.70000 23.78333 34.90000 42.40000
aggregate()
is commonly used with formula
notation!uptake ~ Treatment
- is an example of formula notation
aggregate()
is commonly used with formula
notation!uptake ~ Treatment
- is an example of formula notation
aggregate(uptake ~ Treatment, data = CO2, FUN = summary)
aggregate()
is commonly used with formula
notation!uptake ~ Treatment + conc
model uptake by levels of Treatment and conc
aggregate(uptake ~ Treatment + conc, data = CO2, FUN = summary)
## Treatment conc uptake.Min. uptake.1st Qu. uptake.Median uptake.Mean ## 1 nonchilled 95 10.60000 11.47500 12.80000 13.28333 ## 2 chilled 95 7.70000 9.60000 10.55000 11.23333 ## 3 nonchilled 175 19.20000 20.05000 24.65000 25.11667 ## 4 chilled 175 11.40000 15.67500 19.50000 19.45000 ## 5 nonchilled 250 25.80000 27.30000 32.70000 32.46667 ## 6 chilled 250 12.30000 17.95000 24.20000 25.28333 ## 7 nonchilled 350 27.90000 30.45000 34.50000 35.13333 ## 8 chilled 350 13.00000 18.15000 26.45000 26.20000 ## 9 nonchilled 500 28.50000 31.27500 33.85000 35.10000 ## 10 chilled 500 12.50000 18.30000 26.00000 26.65000 ## 11 nonchilled 675 28.10000 31.42500 35.80000 36.01667 ## 12 chilled 675 13.70000 19.72500 28.80000 27.88333 ## 13 nonchilled 1000 27.80000 32.50000 37.60000 37.38333 ## 14 chilled 1000 14.40000 20.40000 30.30000 29.78333 ## uptake.3rd Qu. uptake.Max. ## 1 15.40000 16.20000 ## 2 13.30000 15.10000 ## 3 29.62500 32.40000 ## 4 23.32500 27.30000 ## 5 36.52500 40.30000 ## 6 33.82500 38.10000 ## 7 40.65000 42.10000 ## 8 34.45000 38.80000 ## 9 39.27500 42.90000 ## 10 37.07500 38.90000 ## 11 40.85000 43.90000 ## 12 36.97500 39.60000 ## 13 43.15000 45.50000 ## 14 40.72500 42.40000
Understand types of data and their distributions
Numerical summaries
table
mean
, median
sd
, var
, IQR
quantile
for more general quantilesAcross subgroups with dplyr::group_by
and dplyr::summarize
or aggregate
Understand types of data and their distributions
Numerical summaries (across subgroups)
Graphical summaries (across subgroups)
Three major systems for plotting:
Base R (built-in functions)
Lattice
ggplot2 (sort of part of the tidyverse - Cheat Sheet)
ggplot(data = data_frame)
creates a plot instanceGreat reference book here!
ggplot2
Plottingggplot2 basics (Cheat Sheet)
ggplot(data = data_frame)
creates a plot instanceggplot2
Plottingggplot2 basics (Cheat Sheet)
ggplot(data = data_frame)
creates a plot instancefactors
levels
attributeDay
(Monday, Tuesday, …)
Name
where new values may come up
factors
titanic.csv
#convert survival status to a factor titanicData$survived <- as.factor(titanicData$survived) levels(titanicData$survived) #R knows it isn't numeric now
## [1] "0" "1"
titanicData$survived[1] <- "5"
## Warning in `[<-.factor`(`*tmp*`, 1, value = structure(c(NA, 2L, 1L, 1L, : ## invalid factor level, NA generated
factor
levelslevels(titanicData$survived) <- c("Died", "Survived") levels(titanicData$survived)
## [1] "Died" "Survived"
ggplot2
Plotting: Categorical variablesCategorical variable - entries are a label or attribute
Generally, describe distriubtion using a barplot!
ggplot + geom_bar
titanicData <- read_csv("../datasets/titanic.csv") titanicData$mySurvived <- as.factor(titanicData$survived) levels(titanicData$mySurvived) <- c("Died", "Survived") titanicData$myEmbarked <- as.factor(titanicData$embarked) levels(titanicData$myEmbarked) <- c("Cherbourg", "Queenstown", "Southampton") titanicData <- titanicData %>% drop_na(mySurvived, sex, myEmbarked)
ggplot2
barplotsBarplots via ggplot + geom_bar
Across x-axis we want our categories - specify with aes(x = ...)
ggplot(data = titanicData, aes(x = mySurvived))
ggplot2
barplotsBarplots via ggplot + geom_bar
Must add geom (or stat) layer!
ggplot(data = titanicData, aes(x = mySurvived)) + geom_bar()
ggplot2
barplotsg <- ggplot(data = titanicData, aes(x = mySurvived)) g + geom_bar()
ggplot2
barplotsg <- ggplot(data = titanicData, aes(x = mySurvived)) g + geom_bar()
aes()
defines visual properties of objects in the plot    x = , y = , size = , shape = , color = , alpha = , ...
geom
ggplot2
barplotsg <- ggplot(data = titanicData, aes(x = mySurvived)) g + geom_bar()
aes()
defines visual properties of objects in the plot    x = , y = , size = , shape = , color = , alpha = , ...
geom
    d + geom_bar()
    x, alpha, color, fill, linetype, size, weight
ggplot2
global and local aestheticsdata
and aes
can be set in two ways;
‘globally’ (for all layers) via the ggplot
statement
‘locally’ (for just that layer) via the geom
, stat
, etc. layer
ggplot2
global and local aestheticsdata
and aes
can be set in two ways;
‘globally’ (for all layers) via the ggplot
statement
‘locally’ (for just that layer) via the geom
, stat
, etc. layer
#global ggplot(data = titanicData, aes(x = mySurvived)) + geom_bar() #local ggplot() + geom_bar(data = titanicData, aes(x = mySurvived))
ggplot2
global and local aestheticsdata
and aes
can be set in two ways;
‘globally’ (for all layers) via the ggplot
statement
‘locally’ (for just that layer) via the geom
, stat
, etc. layer
#global ggplot(data = titanicData, aes(x = mySurvived)) + geom_bar() #local ggplot() + geom_bar(data = titanicData, aes(x = mySurvived))
color = 'blue'
), generally place these outside of the aes
ggplot2
barplotsggplot(data = titanicData, aes(x = mySurvived)) + geom_bar() + labs(x = "Survival Status", title = "Bar Plot of Survival for Titanic Passengers")
ggplot2
stacked barplotsStacked barplot created by via fill
aesthetic and same process
ggplot2
stacked barplotsStacked barplot created by via fill
aesthetic
Automatic assignment of colors, creation of legends, etc. for aes
elements (except with group
)
ggplot(data = titanicData, aes(x = mySurvived, fill = sex)) + geom_bar()
ggplot2
labelingggplot(data = titanicData, aes(x = mySurvived, fill = sex)) + geom_bar() + labs(x = "Survival Status", title = "Bar Plot of Survival Status for Titanic Passengers") + scale_fill_discrete(name = "Sex", labels = c("Female", "Male"))
ggplot2
labelingscale_*_discrete
aes(x = survived, fill = sex)
scale_x_discrete(labels = c("Person Died", "Person Survived"))
scale_fill_discrete(name = "Sex", labels = c("Female","Male"))
ggplot2
horizontal barplotscoord_flip
ggplot(data = titanicData, aes(x = mySurvived, fill = sex)) + geom_bar() + labs(x = "Survival Status", title = "Bar Plot of Survival Status for Titanic Passengers") + scale_x_discrete(labels = c("Person Died", "Person Survived")) + scale_fill_discrete(name = "Sex", labels = c("Female", "Male")) + coord_flip()
ggplot2
stat vs geom layersNote: Most geoms have a corresponding stat that can be used
geom_bar(mapping = NULL, data = NULL, stat = "count", position = "stack", ..., width = NULL, binwidth = NULL, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
ggplot(data = titanicData, aes(x = survived, fill = sex)) + geom_bar() ggplot(data = titanicData, aes(x = survived, fill = sex)) + stat_count()
ggplot2
stat vs geom layersidentity
sumData <- titanicData %>% group_by(survived, sex) %>% summarize(count = n()) ggplot(sumData, aes(x = survived, y = count, fill = sex)) + geom_bar(stat = "identity")
ggplot2
side-by-side barplotsSide-by-side barplot created by via position
aesthetic
dodge
for side-by-side bar plotjitter
for continuous data with many points at same valuesfill
stacks bars and standardises each stack to have constant heightstack
stacks bars on top of each otherggplot2
side-by-side barplotsposition
aestheticggplot(data = titanicData, aes(x = mySurvived, fill = sex)) + geom_bar(position = "dodge")
ggplot2
filled barplotsposition = fill
stacks bars and standardises each stack to have constant height (especially useful with equal group sizes)ggplot(data = titanicData, aes(x = mySurvived, fill = sex)) + geom_bar(position = "fill")
ggplot2
facetingHow to create same plot for each myEmbarked
value? Use faceting!
ggplot2
facetingHow to create same plot for each myEmbarked
value? Use faceting!
facet_wrap(~ var)
- creates a plot for each setting of var
nrow
and ncol
or let R figure it outggplot2
facetingHow to create same plot for each myEmbarked
value? Use faceting!
facet_wrap(~ var)
- creates a plot for each setting of var
nrow
and ncol
or let R figure it outfacet_grid(var1 ~ var2)
- creats a plot for each combination of var1
and var2
var1
values across rows
var2
values across columns
Use . ~ var2
or var1 ~ .
to have only one row or column
ggplot2
facetingHow to create same plot for each myEmbarked
value? Use faceting!
facet_wrap(~ var)
- creates a plot for each setting of var
ggplot(data = titanicData, aes(x = mySurvived)) + geom_bar(aes(fill =sex), position = "dodge") + facet_wrap(~ myEmbarked)
ggplot2
Plotting RecapGeneral ggplot
things:
Can set local or global aes
Modify titles/labels by adding more layers
Faceting (multiple plots) via facet_grid
or facet_wrap
Only need aes
if setting a mapping value that is dependent on the data (or you want to create a custom legend!)
ggplot2
Plotting: Numeric VariablesNumeric variables - generally, describe distribution via a histogram or boxplot!
Same process:
ggplot2
smoothed histogramKernel Smoother - Smoothed version of a histogram
Common aes
values (from cheat sheet):
    c + geom_density(kernel = "gaussian")
    x, y, alpha, color, fill, group, linetype, size, weight
x =
is really neededggplot2
smoothed histogramg <- ggplot(titanicData, aes(x = age)) g + geom_density()
ggplot2
smoothed histogramKernel Smoother - Smoothed version of a histogram
fill
a useful aesthetic!
g + geom_density(adjust = 0.5, alpha = 0.5, aes(fill = mySurvived))
ggplot2
smoothed histogramKernel Smoother - Smoothed version of a histogram
recall position
choices of dodge
, jitter
, fill
, and stack
g + geom_density(adjust = 0.5, alpha = 0.5, position = "stack", aes(fill = mySurvived))
ggplot2
boxplotsBoxplot - Provides the five number summary in a graph
Common aes
values (from cheat sheet):
    f + geom_boxplot()
    x, y, lower, middle, upper, ymax, ymin, alpha, color, fill, group, linetype, shape, size, weight
x =, y =
are really neededggplot2
boxplotsg <- ggplot(titanicData, aes(x = mySurvived, y = age)) g + geom_boxplot(fill = "grey")
ggplot2
boxplots with pointsg + geom_boxplot(fill = "grey") + geom_jitter(width = 0.1, alpha = 0.3)
ggplot2
boxplots with pointsg + geom_jitter(width = 0.1, alpha = 0.3) + geom_boxplot(fill = "grey")
ggplot2
facetingfacet
easily!g + geom_boxplot(fill = "grey") + geom_jitter(width = 0.1, alpha = 0.3) + facet_wrap(~ myEmbarked)
ggplot2
scatter plotsTwo numerical variables
Scatter Plot - graphs points corresponding to each observation
Common aes
values (from cheat sheet):
    e + geom_point()
    x, y, alpha, color, fill, shape, size, stroke
x =, y =
are really neededggplot2
scatter plotsg <- ggplot(titanicData, aes(x = age, y = fare)) g + geom_point()
ggplot2
scatter plots with trend lineg + geom_point() + geom_smooth(aes(col = "loess")) + geom_smooth(method = lm, aes(col = "linear")) + scale_colour_manual(name = 'Smoother', values =c('linear'='red', 'loess'='purple'), labels = c('Linear','GAM'), guide = 'legend')
ggplot2
scatter plots with textMay want to add value of correlation to plot
paste()
or paste0()
handy
paste("Hi", "What", "Is", "Going", "On", "?", sep = " ")
## [1] "Hi What Is Going On ?"
paste("Hi", "What", "Is", "Going", "On", "?", sep = ".")
## [1] "Hi.What.Is.Going.On.?"
paste0("Hi", "What", "Is", "Going", "On", "?")
## [1] "HiWhatIsGoingOn?"
ggplot2
scatter plots with textcorrelation <- cor(titanicData$fare, titanicData$age, use = "complete.obs") g + geom_point() + geom_smooth(method = lm, col = "Red") + geom_text(x = 40, y = 400, size = 5, label = paste0("Correlation = ", round(correlation, 2)))
ggplot2
scatter plots with text pointsgeom_text
g + geom_text(aes(label = survived, color = mySurvived))
ggplot2
facetingg + geom_point(aes(color = sex), size = 2.5) + facet_wrap(~ myEmbarked)
ggpairs
Many extension packages that do nice things!
library(GGally) #install GGally if needed ggpairs(iris, aes(colour = Species, alpha = 0.4))
ggplot2
Plotting: Numeric variablesNumeric variable - entries are a numerical value where math can be performed
Most common plots:
Histogram (geom_hist
), Density (geom_density
)
Boxplot (geom_boxplot
), Violin plot (geom_violin
)
Scatter plot (geom_point
), Smoothers (geom_smooth
)
Jittered points (geom_jitter
)
Text on plot (geom_text
)
ggplot2
Plotting RecapGeneral ggplot
things:
Can set local or global aes
Modify titles/labels by adding more layers
Use either stat or geom layer
Faceting (multiple plots) via facet_grid
or facet_wrap
Only need aes
if setting a mapping value that is dependent on the data (or you want to create a custom legend!)
esquisse
is a great package for exploring ggplot2!