Summarizing Data

What is this course about?

Basic use of R for reading, manipulating, and plotting data!

Where do we start?

Understand types of data and their distributions
Numerical summaries

Where do we start?

Understand types of data and their distributions
Numerical summaries (across subgroups)

Where do we start?

Understand types of data and their distributions
Numerical summaries (across subgroups)
- Contingency Tables
- Mean/Median
- Standard Deviation/Variance/IQR
- Quantiles/Percentiles

Where do we start?

Understand types of data and their distributions
Numerical summaries (across subgroups)
- Contingency Tables
- Mean/Median
- Standard Deviation/Variance/IQR
- Quantiles/Percentiles
Graphical summaries (across subgroups)
- Bar plots
- Histograms
- Box plots
- Scatter plots

Understanding Data

How to summarize data?
Depends on data type:
- Categorical (Qualitative) variable - entries are a label or attribute
- Numeric (Quantitative) variable - entries are a numerical value where math can be performed

Categorical Data

Common goal: Describe the distribution of the variable

Distribution = pattern and frequency with which you observe a variable
Categorical variable - entries are a label or attribute

Try to describe relative frequency (or count) in each category

Create contingency tables via table

Contingency tables

Consider data on titanic passengers in titanic.csv

titanicData <- read_csv("../datasets/titanic.csv")
titanicData

## # A tibble: 1,310 x 14
##   pclass survived name      sex      age sibsp parch ticket  fare cabin embarked
##    <dbl>    <dbl> <chr>     <chr>  <dbl> <dbl> <dbl> <chr>  <dbl> <chr> <chr>   
## 1      1        1 Allen, M~ fema~ 29         0     0 24160   211. B5    S       
## 2      1        1 Allison,~ male   0.917     1     2 113781  152. C22 ~ S       
## 3      1        0 Allison,~ fema~  2         1     2 113781  152. C22 ~ S       
## 4      1        0 Allison,~ male  30         1     2 113781  152. C22 ~ S       
## 5      1        0 Allison,~ fema~ 25         1     2 113781  152. C22 ~ S       
## # ... with 1,305 more rows, and 3 more variables: boat <chr>, body <dbl>,
## #   home.dest <chr>

Contingency tables

Create one-way contingency tables for each of three categorical variables:
- embarked (where journey started)
- survived (survive or not)
- sex (Male or Female)

table(titanicData$embarked)

## 
##   C   Q   S 
## 270 123 914

table(titanicData$survived)

## 
##   0   1 
## 809 500

table(titanicData$sex)

## 
## female   male 
##    466    843

Two-way contingency tables

Create two-way contingency tables for pairs of categorical variables

table(titanicData$survived,
      titanicData$sex)

##    
##     female male
##   0    127  682
##   1    339  161

table(titanicData$survived,
      titanicData$embarked)

##    
##       C   Q   S
##   0 120  79 610
##   1 150  44 304

table(titanicData$sex,
      titanicData$embarked)

##         
##            C   Q   S
##   female 113  60 291
##   male   157  63 623

Three-way contingency tables

Create a three-way contingency table for three categorical variables

table(titanicData$sex, titanicData$embarked, titanicData$survived)

## , ,  = 0
## 
##         
##            C   Q   S
##   female  11  23  93
##   male   109  56 517
## 
## , ,  = 1
## 
##         
##            C   Q   S
##   female 102  37 198
##   male    48   7 106

Three-way contingency tables

Create a three-way contingency table for three categorical variables (order matters for output!)
Example of an array! 3 dimensions [ , , ]

tab <- table(titanicData$sex, titanicData$embarked, titanicData$survived)

str(tab)

##  'table' int [1:2, 1:3, 1:2] 11 109 23 56 93 517 102 48 37 7 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ : chr [1:2] "female" "male"
##   ..$ : chr [1:3] "C" "Q" "S"
##   ..$ : chr [1:2] "0" "1"

Conditional contingency tables

Can obtain conditional bivariate info!

##  'table' int [1:2, 1:3, 1:2] 11 109 23 56 93 517 102 48 37 7 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ : chr [1:2] "female" "male"
##   ..$ : chr [1:3] "C" "Q" "S"
##   ..$ : chr [1:2] "0" "1"

#returns embarked vs survived table for females
tab[1, , ]

##    
##       0   1
##   C  11 102
##   Q  23  37
##   S  93 198

Conditional contingency tables

Can obtain conditional bivariate info!

##  'table' int [1:2, 1:3, 1:2] 11 109 23 56 93 517 102 48 37 7 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ : chr [1:2] "female" "male"
##   ..$ : chr [1:3] "C" "Q" "S"
##   ..$ : chr [1:2] "0" "1"

#returns embarked vs survived table for males
tab[2, , ]

##    
##       0   1
##   C 109  48
##   Q  56   7
##   S 517 106

Conditional contingency tables

Can obtain conditional bivariate info!

##  'table' int [1:2, 1:3, 1:2] 11 109 23 56 93 517 102 48 37 7 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ : chr [1:2] "female" "male"
##   ..$ : chr [1:3] "C" "Q" "S"
##   ..$ : chr [1:2] "0" "1"

#returns survived vs sex table for embarked "C"
tab[, 1, ]

##         
##            0   1
##   female  11 102
##   male   109  48

Conditional contingency tables

Can obtain conditional univariate info too!

##  'table' int [1:2, 1:3, 1:2] 11 109 23 56 93 517 102 48 37 7 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ : chr [1:2] "female" "male"
##   ..$ : chr [1:3] "C" "Q" "S"
##   ..$ : chr [1:2] "0" "1"

#Survived status for males that embarked at "Q"
tab[2, 2, ]

##  0  1 
## 56  7

Numerical summaries: Numeric variables

Numeric variable - entries are a numerical value where math can be performed

Single variable: describe the distribution via

Shape: Histogram, Density plot, …

Measures of center: Mean, Median, …

Measures of spread: Variance, Standard Deviation, Quartiles, IQR, …

Numerical summaries: Numeric variables

Numeric variable - entries are a numerical value where math can be performed

Single variable: describe the distribution via

Shape: Histogram, Density plot, …
Measures of center: Mean, Median, …
Measures of spread: Variance, Standard Deviation, Quartiles, IQR, …

Two Variables:

Shape: Scatter plot, …
Measures of linear relationship: Covariance, Correlation, …

Numerical summaries: Numeric variables

Look at carbon dioxide (CO2) uptake data set
- Response recorded: uptake CO2 uptake rates in grass plants
- Environment manipulated: Treatment - chilled/nonchilled
- Ambient CO2 specified and measured: conc

CO2 <- as_tibble(CO2)
CO2

## # A tibble: 84 x 5
##   Plant Type   Treatment   conc uptake
##   <ord> <fct>  <fct>      <dbl>  <dbl>
## 1 Qn1   Quebec nonchilled    95   16  
## 2 Qn1   Quebec nonchilled   175   30.4
## 3 Qn1   Quebec nonchilled   250   34.8
## 4 Qn1   Quebec nonchilled   350   37.2
## 5 Qn1   Quebec nonchilled   500   35.3
## # ... with 79 more rows

Measures of center

Mean & Median

mean(CO2$uptake)

## [1] 27.2131

#note you can easily get a trimmed mean
mean(CO2$uptake, trim = 0.05) #5% trimmed mean

## [1] 27.25263

median(CO2$uptake)

## [1] 28.3

Measures of spread

Variance, Standard Deviation, Quartiles, & IQR

#quartiles and mean
summary(CO2$uptake)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.70   17.90   28.30   27.21   37.12   45.50

var(CO2$uptake)

## [1] 116.9515

sd(CO2$uptake)

## [1] 10.81441

IQR(CO2$uptake)

## [1] 19.225

quantile(CO2$uptake, probs = c(0.1, 0.2))

##   10%   20% 
## 12.36 15.64

Measures of linear relationship

Covariance & Correlation

cov(CO2$conc, CO2$uptake)

## [1] 1552.687

cor(CO2$conc, CO2$uptake)

## [1] 0.4851774

Numerical summaries: Numeric variables

Usually want summaries for different subgroups of data

Ex: Get similar uptake summaries for each Treatment

Numerical summaries: Numeric variables

Usually want summaries for different subgroups of data

Ex: Get similar uptake summaries for each Treatment
dplyr easy to use but can only return one value

Numerical summaries: Numeric variables

Usually want summaries for different subgroups of data

Ex: Get similar uptake summaries for each Treatment
dplyr easy to use (although it can only return one value)

Idea:

Use group_by to create subgroups associated with the data frame
Use summarize to create basic summaries for each subgroup

Summarizing across groups

Ex: Get similar uptake summaries for each Treatment

CO2 %>% 
  group_by(Treatment) %>% 
    summarise(avg = mean(uptake), med = median(uptake), var = var(uptake))

## # A tibble: 2 x 4
##   Treatment    avg   med   var
##   <fct>      <dbl> <dbl> <dbl>
## 1 nonchilled  30.6  31.3  94.2
## 2 chilled     23.8  19.7 118.

Summarizing across groups

Ex: Get similar uptake summaries for each Treatment and Concentration

CO2 %>% 
  group_by(Treatment, conc) %>% 
        summarise(avg = mean(uptake), med = median(uptake), var = var(uptake))

## # A tibble: 14 x 5
## # Groups:   Treatment [2]
##    Treatment   conc   avg   med    var
##    <fct>      <dbl> <dbl> <dbl>  <dbl>
##  1 nonchilled    95  13.3  12.8   5.75
##  2 nonchilled   175  25.1  24.6  32.6 
##  3 nonchilled   250  32.5  32.7  35.1 
##  4 nonchilled   350  35.1  34.5  37.4 
##  5 nonchilled   500  35.1  33.8  31.9 
##  6 nonchilled   675  36.0  35.8  40.2 
##  7 nonchilled  1000  37.4  37.6  49.8 
##  8 chilled       95  11.2  10.6   8.18
##  9 chilled      175  19.4  19.5  34.7 
## 10 chilled      250  25.3  24.2 112.  
## 11 chilled      350  26.2  26.4 117.  
## 12 chilled      500  26.6  26   131.  
## 13 chilled      675  27.9  28.8 120.  
## 14 chilled     1000  29.8  30.3 154.

Summarizing across groups

dplyr has variations on summarise that can be used:

summarise_all() - Apply functions to every column
summarise_at() - Apply functions to specific columns
summarise_if() - Apply functions to all columns of one type

Summarizing across groups

Ex: Get similar uptake summaries for each Treatment
Built-in aggregate() function more general

Summarizing across groups

Ex: Get similar uptake summaries for each Treatment
Built-in aggregate() function more general
Basic use gives response (x) and a list of variables to group by

aggregate(x = CO2$uptake, by = list(CO2$Treatment), FUN = summary)

##      Group.1   x.Min. x.1st Qu. x.Median   x.Mean x.3rd Qu.   x.Max.
## 1 nonchilled 10.60000  26.47500 31.30000 30.64286  38.70000 45.50000
## 2    chilled  7.70000  14.52500 19.70000 23.78333  34.90000 42.40000

Summarizing across groups

aggregate() is commonly used with formula notation!

uptake ~ Treatment - is an example of formula notation

Idea: uptake (LHS) modeled by Treatment levels (RHS)

Summarizing across groups

aggregate() is commonly used with formula notation!

uptake ~ Treatment - is an example of formula notation

Idea: uptake (LHS) modeled by Treatment levels (RHS)

aggregate(uptake ~ Treatment, data = CO2, FUN = summary)

Summarizing across groups

aggregate() is commonly used with formula notation!

uptake ~ Treatment + conc model uptake by levels of Treatment and conc

aggregate(uptake ~ Treatment + conc, data = CO2, FUN = summary)

##     Treatment conc uptake.Min. uptake.1st Qu. uptake.Median uptake.Mean
## 1  nonchilled   95    10.60000       11.47500      12.80000    13.28333
## 2     chilled   95     7.70000        9.60000      10.55000    11.23333
## 3  nonchilled  175    19.20000       20.05000      24.65000    25.11667
## 4     chilled  175    11.40000       15.67500      19.50000    19.45000
## 5  nonchilled  250    25.80000       27.30000      32.70000    32.46667
## 6     chilled  250    12.30000       17.95000      24.20000    25.28333
## 7  nonchilled  350    27.90000       30.45000      34.50000    35.13333
## 8     chilled  350    13.00000       18.15000      26.45000    26.20000
## 9  nonchilled  500    28.50000       31.27500      33.85000    35.10000
## 10    chilled  500    12.50000       18.30000      26.00000    26.65000
## 11 nonchilled  675    28.10000       31.42500      35.80000    36.01667
## 12    chilled  675    13.70000       19.72500      28.80000    27.88333
## 13 nonchilled 1000    27.80000       32.50000      37.60000    37.38333
## 14    chilled 1000    14.40000       20.40000      30.30000    29.78333
##    uptake.3rd Qu. uptake.Max.
## 1        15.40000    16.20000
## 2        13.30000    15.10000
## 3        29.62500    32.40000
## 4        23.32500    27.30000
## 5        36.52500    40.30000
## 6        33.82500    38.10000
## 7        40.65000    42.10000
## 8        34.45000    38.80000
## 9        39.27500    42.90000
## 10       37.07500    38.90000
## 11       40.85000    43.90000
## 12       36.97500    39.60000
## 13       43.15000    45.50000
## 14       40.72500    42.40000

Recap

Understand types of data and their distributions
Numerical summaries
- Contingency Tables: table
- Mean/Median: mean, median
- Standard Deviation/Variance/IQR: sd, var, IQR
- Quantiles/Percentiles: quantile for more general quantiles
Across subgroups with dplyr::group_by and dplyr::summarize or aggregate

Quick Examples

Go to the course files page and try Exercise 9 - Numeric Summaries

Where are we at?

Understand types of data and their distributions
Numerical summaries (across subgroups)
- Contingency Tables
- Mean/Median
- Standard Deviation/Variance/IQR
- Quantiles/Percentiles
Graphical summaries (across subgroups)
- Bar plots (categorical data)
- Histograms
- Box plots
- Scatter plots

Graphical Summaries

Three major systems for plotting:

Base R (built-in functions)
Lattice
ggplot2 (sort of part of the tidyverse - Cheat Sheet)
- ggplot(data = data_frame) creates a plot instance
- Add “layers” to the system (geoms or stats)

Great reference book here!

`ggplot2` Plotting

ggplot2 basics (Cheat Sheet)

ggplot(data = data_frame) creates a plot instance
Add “layers” to the system (geoms or stats)
- Creates a visualization of the data

`ggplot2` Plotting

ggplot2 basics (Cheat Sheet)

ggplot(data = data_frame) creates a plot instance
Add “layers” to the system (geoms or stats)
- Creates a visualization of the data
Modify layer “mapping” args (aes)
- Ex: size, color, and x, y location(s)
Coordinate system (mostly use Cartesian plane)
Optional: Titles, etc.

`factors`

factor - special class of vector with a levels attribute

Levels define all possible values for that variable
- Great for variable like Day (Monday, Tuesday, …)
- Not great for variable like Name where new values may come up

Quite useful with plotting
- Allows for easy labeling of subgroups

`factors`

Consider data on titanic passengers in titanic.csv

#convert survival status to a factor
titanicData$survived <- as.factor(titanicData$survived)
levels(titanicData$survived) #R knows it isn't numeric now

## [1] "0" "1"

Can’t add value unless it is a level

titanicData$survived[1] <- "5"

## Warning in `[<-.factor`(`*tmp*`, 1, value = structure(c(NA, 2L, 1L, 1L, :
## invalid factor level, NA generated

`factor` levels

Useful if you want to create better labels (or change the ordering)

levels(titanicData$survived) <- c("Died", "Survived")
levels(titanicData$survived)

## [1] "Died"     "Survived"

`ggplot2` Plotting: Categorical variables

Categorical variable - entries are a label or attribute

Generally, describe distriubtion using a barplot!

Barplots via ggplot + geom_bar

titanicData <- read_csv("../datasets/titanic.csv")
titanicData$mySurvived <- as.factor(titanicData$survived)
levels(titanicData$mySurvived) <- c("Died", "Survived")
titanicData$myEmbarked <- as.factor(titanicData$embarked)
levels(titanicData$myEmbarked) <- c("Cherbourg", "Queenstown", "Southampton")
titanicData <- titanicData %>% drop_na(mySurvived, sex, myEmbarked)

`ggplot2` barplots

Barplots via ggplot + geom_bar
Across x-axis we want our categories - specify with aes(x = ...)

ggplot(data = titanicData, aes(x = mySurvived))

`ggplot2` barplots

Barplots via ggplot + geom_bar
Must add geom (or stat) layer!

ggplot(data = titanicData, aes(x = mySurvived)) + geom_bar()

`ggplot2` barplots

Generally: Save base object, then “add layers”

g <- ggplot(data = titanicData, aes(x = mySurvived))
g + geom_bar()

`ggplot2` barplots

Generally: Save base object, then “add layers”

g <- ggplot(data = titanicData, aes(x = mySurvived))
g + geom_bar()

aes() defines visual properties of objects in the plot

x = , y = , size = , shape = , color = , alpha = , ...

Cheat Sheet gives most common properties for a given geom

`ggplot2` barplots

Generally: Save base object, then “add layers”

g <- ggplot(data = titanicData, aes(x = mySurvived))
g + geom_bar()

aes() defines visual properties of objects in the plot

x = , y = , size = , shape = , color = , alpha = , ...

Cheat Sheet gives most common properties for a given geom

d + geom_bar()

x, alpha, color, fill, linetype, size, weight

`ggplot2` global and local aesthetics

data and aes can be set in two ways;

‘globally’ (for all layers) via the ggplot statement
‘locally’ (for just that layer) via the geom, stat, etc. layer

`ggplot2` global and local aesthetics

data and aes can be set in two ways;

‘globally’ (for all layers) via the ggplot statement
‘locally’ (for just that layer) via the geom, stat, etc. layer

#global 
ggplot(data = titanicData, aes(x = mySurvived)) + geom_bar()
#local 
ggplot() + geom_bar(data = titanicData, aes(x = mySurvived))

`ggplot2` global and local aesthetics

data and aes can be set in two ways;

‘globally’ (for all layers) via the ggplot statement
‘locally’ (for just that layer) via the geom, stat, etc. layer

#global 
ggplot(data = titanicData, aes(x = mySurvived)) + geom_bar()
#local 
ggplot() + geom_bar(data = titanicData, aes(x = mySurvived))

To set an attribute that doesn’t depend on the data (i.e. color = 'blue'), generally place these outside of the aes

`ggplot2` barplots

Add better labels and a title (new layers, see cheat sheet!)

ggplot(data = titanicData, aes(x = mySurvived)) + 
  geom_bar() + 
  labs(x = "Survival Status", title = "Bar Plot of Survival for Titanic Passengers")

`ggplot2` stacked barplots

Stacked barplot created by via fill aesthetic and same process
- Create base object
- Add geoms
- Use aes to specify aspects of the plot

`ggplot2` stacked barplots

Stacked barplot created by via fill aesthetic
Automatic assignment of colors, creation of legends, etc. for aes elements (except with group)

ggplot(data = titanicData, aes(x = mySurvived, fill = sex)) + geom_bar()

`ggplot2` labeling

Add custom labels by adding more layers

ggplot(data = titanicData, aes(x = mySurvived, fill = sex)) + 
  geom_bar() +
  labs(x = "Survival Status", 
       title = "Bar Plot of Survival Status for Titanic Passengers") +
  scale_fill_discrete(name = "Sex", labels = c("Female", "Male"))

`ggplot2` labeling

Adjusting appropriate labeling via scale_*_discrete

aes(x = survived, fill = sex)

scale_x_discrete(labels = c("Person Died", "Person Survived"))

scale_fill_discrete(name = "Sex", labels = c("Female","Male"))

`ggplot2` horizontal barplots

Easy to rotate a plot with coord_flip

ggplot(data = titanicData, aes(x = mySurvived, fill = sex)) + geom_bar() +
  labs(x = "Survival Status", 
       title = "Bar Plot of Survival Status for Titanic Passengers") + 
  scale_x_discrete(labels = c("Person Died", "Person Survived")) + 
  scale_fill_discrete(name = "Sex", labels = c("Female", "Male")) + 
  coord_flip()

`ggplot2` stat vs geom layers

Note: Most geoms have a corresponding stat that can be used

geom_bar(mapping = NULL, data = NULL, stat = "count", position = "stack", ..., width = NULL, binwidth = NULL, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)

Equivalent plots via:

ggplot(data = titanicData, aes(x = survived, fill = sex)) + geom_bar()
ggplot(data = titanicData, aes(x = survived, fill = sex)) + stat_count()

`ggplot2` stat vs geom layers

Can modify the stat: if you have summary data, use identity

sumData <- titanicData %>% 
  group_by(survived, sex) %>% 
  summarize(count = n())
ggplot(sumData, aes(x = survived, y = count, fill = sex)) + 
  geom_bar(stat = "identity")

`ggplot2` side-by-side barplots

Side-by-side barplot created by via position aesthetic
- dodge for side-by-side bar plot
- jitter for continuous data with many points at same values
- fill stacks bars and standardises each stack to have constant height
- stack stacks bars on top of each other

`ggplot2` side-by-side barplots

Side-by-side barplot created by via position aesthetic

ggplot(data = titanicData, aes(x = mySurvived, fill = sex)) + 
  geom_bar(position = "dodge")

`ggplot2` filled barplots

position = fill stacks bars and standardises each stack to have constant height (especially useful with equal group sizes)

ggplot(data = titanicData, aes(x = mySurvived, fill = sex)) + 
  geom_bar(position = "fill")

`ggplot2` faceting

How to create same plot for each myEmbarked value? Use faceting!

`ggplot2` faceting

How to create same plot for each myEmbarked value? Use faceting!

facet_wrap(~ var) - creates a plot for each setting of var

Can specify nrow and ncol or let R figure it out

`ggplot2` faceting

How to create same plot for each myEmbarked value? Use faceting!

facet_wrap(~ var) - creates a plot for each setting of var

Can specify nrow and ncol or let R figure it out

facet_grid(var1 ~ var2) - creats a plot for each combination of var1 and var2

var1 values across rows
var2 values across columns
Use . ~ var2 or var1 ~ . to have only one row or column

`ggplot2` faceting

How to create same plot for each myEmbarked value? Use faceting!

facet_wrap(~ var) - creates a plot for each setting of var

ggplot(data = titanicData, aes(x = mySurvived)) + 
  geom_bar(aes(fill =sex), position = "dodge") +
  facet_wrap(~ myEmbarked)

`ggplot2` Plotting Recap

General ggplot things:

Can set local or global aes
Modify titles/labels by adding more layers
Faceting (multiple plots) via facet_grid or facet_wrap
Only need aes if setting a mapping value that is dependent on the data (or you want to create a custom legend!)

Quick Examples

Go to the course files page and try Exercise 10 - Bar Plots

`ggplot2` Plotting: Numeric Variables

Numeric variables - generally, describe distribution via a histogram or boxplot!

Same process:

Create base object
Add geoms
Use aes to specify aspects of the plot

`ggplot2` smoothed histogram

Kernel Smoother - Smoothed version of a histogram
Common aes values (from cheat sheet):

c + geom_density(kernel = "gaussian")

x, y, alpha, color, fill, group, linetype, size, weight

Only x = is really needed

`ggplot2` smoothed histogram

Kernel Smoother - Smoothed version of a histogram

g <- ggplot(titanicData, aes(x = age))
g + geom_density()

`ggplot2` smoothed histogram

Kernel Smoother - Smoothed version of a histogram
fill a useful aesthetic!

g + geom_density(adjust = 0.5, alpha = 0.5, aes(fill = mySurvived))

`ggplot2` smoothed histogram

Kernel Smoother - Smoothed version of a histogram
recall position choices of dodge, jitter, fill, and stack

g + geom_density(adjust = 0.5, alpha = 0.5, position = "stack", aes(fill = mySurvived))

`ggplot2` boxplots

Boxplot - Provides the five number summary in a graph
Common aes values (from cheat sheet):

f + geom_boxplot()

x, y, lower, middle, upper, ymax, ymin, alpha, color, fill, group, linetype, shape, size, weight

Only x =, y = are really needed

`ggplot2` boxplots

g <- ggplot(titanicData, aes(x = mySurvived, y = age))
g + geom_boxplot(fill = "grey")

`ggplot2` boxplots with points

Can add data points (jittered) to see shape of data better (or use violin plot)

g + 
  geom_boxplot(fill = "grey") +
  geom_jitter(width = 0.1, alpha = 0.3)

`ggplot2` boxplots with points

Order of layers important!

g + 
  geom_jitter(width = 0.1, alpha = 0.3) + 
  geom_boxplot(fill = "grey")

`ggplot2` faceting

Can facet easily!

g + geom_boxplot(fill = "grey") +
  geom_jitter(width = 0.1, alpha = 0.3) + 
  facet_wrap(~ myEmbarked)

`ggplot2` scatter plots

Two numerical variables

Scatter Plot - graphs points corresponding to each observation
Common aes values (from cheat sheet):

e + geom_point()

x, y, alpha, color, fill, shape, size, stroke

Only x =, y = are really needed

`ggplot2` scatter plots

Scatter Plot - graphs points corresponding to each observation

g <- ggplot(titanicData, aes(x = age, y = fare))
g + geom_point()

`ggplot2` scatter plots with trend line

Add trend lines easily (linear and loess - a smoother)

g + geom_point() +  
  geom_smooth(aes(col = "loess")) +
  geom_smooth(method = lm, aes(col = "linear")) + 
  scale_colour_manual(name = 'Smoother', values =c('linear'='red', 'loess'='purple'), 
                      labels = c('Linear','GAM'), guide = 'legend')

`ggplot2` scatter plots with text

May want to add value of correlation to plot
paste() or paste0() handy

paste("Hi", "What", "Is", "Going", "On", "?", sep = " ")

## [1] "Hi What Is Going On ?"

paste("Hi", "What", "Is", "Going", "On", "?", sep = ".")

## [1] "Hi.What.Is.Going.On.?"

paste0("Hi", "What", "Is", "Going", "On", "?")

## [1] "HiWhatIsGoingOn?"

`ggplot2` scatter plots with text

correlation <- cor(titanicData$fare, titanicData$age, use = "complete.obs")
g + geom_point() +
  geom_smooth(method = lm, col = "Red") + 
  geom_text(x = 40, y = 400, size = 5, 
            label = paste0("Correlation = ", round(correlation, 2)))

`ggplot2` scatter plots with text points

Text for points with geom_text

g + geom_text(aes(label = survived, color = mySurvived))

`ggplot2` faceting

g + geom_point(aes(color = sex), size = 2.5) +
  facet_wrap(~ myEmbarked)

`ggpairs`

Many extension packages that do nice things!

library(GGally) #install GGally if needed
ggpairs(iris, aes(colour = Species, alpha = 0.4))

`ggplot2` Plotting: Numeric variables

Numeric variable - entries are a numerical value where math can be performed

Most common plots:

Histogram (geom_hist), Density (geom_density)
Boxplot (geom_boxplot), Violin plot (geom_violin)
Scatter plot (geom_point), Smoothers (geom_smooth)
Jittered points (geom_jitter)
Text on plot (geom_text)

`ggplot2` Plotting Recap

General ggplot things:

Can set local or global aes
Modify titles/labels by adding more layers
Use either stat or geom layer
Faceting (multiple plots) via facet_grid or facet_wrap
Only need aes if setting a mapping value that is dependent on the data (or you want to create a custom legend!)

esquisse is a great package for exploring ggplot2!

Quick Examples

Go to the course files page and try Exercise 11 - ggplot

What is this course about?

Where do we start?

Where do we start?

Where do we start?

Where do we start?

Understanding Data

Categorical Data

Contingency tables

Contingency tables

Two-way contingency tables

Three-way contingency tables

Three-way contingency tables

Conditional contingency tables

Conditional contingency tables

Conditional contingency tables

Conditional contingency tables

Numerical summaries: Numeric variables

Numerical summaries: Numeric variables

Numerical summaries: Numeric variables

Measures of center

Measures of spread

Measures of linear relationship

Numerical summaries: Numeric variables

Numerical summaries: Numeric variables

Numerical summaries: Numeric variables

Summarizing across groups

Summarizing across groups

Summarizing across groups

Summarizing across groups

Summarizing across groups

Summarizing across groups

Summarizing across groups

Summarizing across groups

Recap

Quick Examples

Where are we at?

Graphical Summaries

ggplot2 Plotting

ggplot2 Plotting

factors

factors

factor levels

ggplot2 Plotting: Categorical variables

ggplot2 barplots

ggplot2 barplots

ggplot2 barplots

ggplot2 barplots

ggplot2 barplots

ggplot2 global and local aesthetics

ggplot2 global and local aesthetics

ggplot2 global and local aesthetics

ggplot2 barplots

ggplot2 stacked barplots

ggplot2 stacked barplots

ggplot2 labeling

ggplot2 labeling

ggplot2 horizontal barplots

ggplot2 stat vs geom layers

ggplot2 stat vs geom layers

ggplot2 side-by-side barplots

ggplot2 side-by-side barplots

ggplot2 filled barplots

ggplot2 faceting

ggplot2 faceting

ggplot2 faceting

ggplot2 faceting

ggplot2 Plotting Recap

Quick Examples

ggplot2 Plotting: Numeric Variables

ggplot2 smoothed histogram

ggplot2 smoothed histogram

ggplot2 smoothed histogram

ggplot2 smoothed histogram

ggplot2 boxplots

ggplot2 boxplots

ggplot2 boxplots with points

ggplot2 boxplots with points

ggplot2 faceting

ggplot2 scatter plots

ggplot2 scatter plots

`ggplot2` Plotting

`ggplot2` Plotting

`factors`

`factors`

`factor` levels

`ggplot2` Plotting: Categorical variables

`ggplot2` barplots

`ggplot2` barplots

`ggplot2` barplots

`ggplot2` barplots

`ggplot2` barplots

`ggplot2` global and local aesthetics

`ggplot2` global and local aesthetics

`ggplot2` global and local aesthetics

`ggplot2` barplots

`ggplot2` stacked barplots

`ggplot2` stacked barplots

`ggplot2` labeling

`ggplot2` labeling

`ggplot2` horizontal barplots

`ggplot2` stat vs geom layers

`ggplot2` stat vs geom layers

`ggplot2` side-by-side barplots

`ggplot2` side-by-side barplots

`ggplot2` filled barplots

`ggplot2` faceting

`ggplot2` faceting

`ggplot2` faceting

`ggplot2` faceting

`ggplot2` Plotting Recap

`ggplot2` Plotting: Numeric Variables

`ggplot2` smoothed histogram

`ggplot2` smoothed histogram

`ggplot2` smoothed histogram

`ggplot2` smoothed histogram

`ggplot2` boxplots

`ggplot2` boxplots

`ggplot2` boxplots with points

`ggplot2` boxplots with points

`ggplot2` faceting

`ggplot2` scatter plots

`ggplot2` scatter plots

`ggplot2` scatter plots with trend line

`ggplot2` scatter plots with text

`ggplot2` scatter plots with text

`ggplot2` scatter plots with text points

`ggplot2` faceting

`ggpairs`

`ggplot2` Plotting: Numeric variables

`ggplot2` Plotting Recap