What is this course about?

Basic use of R for reading, manipulating, and plotting data!

Where do we start?

  • R & RStudio installed

  • Explore the RStudio IDE (Integrated Development Environment)

  • Investigate common R objects and classes

  • Read in raw data

RStudio IDE

In RStudio, four main ‘areas’

  • Console (& Terminal)

  • Scripting and Viewing Window

  • Plots/Help (& Files/Packages)

  • Environment (& Connections/Git)

Console

  • Type code directly into the console for evaluation
#simple math operations
# <-- is a comment - code not evaluated
3 + 7
## [1] 10
10 * exp(3) #exp is exponential function
## [1] 200.8554
log(pi^2) #log is natural log by default
## [1] 2.28946
mean(cars$speed)
## [1] 15.4
hist(cars$speed)

Scripting and Viewing Window

  • Usually want to keep code for later use!
  • Write code in a ‘script’ and save script (or use markdown - covered later)
  • From script can send code to console via:
    • “Run” button (runs current line)
    • CTRL+Enter (PC) or Command+Enter (MAC)
    • Highlight section and do above

Plots/Help

  • Created plots stored in Plots tab

    • Cycle through past plots
    • Easily save
  • Type help(...) into the console for documentation

    • help(seq)
    • help(data.frame)

Environment

  • Store data/info/function/etc. in R objects

  • Create an R object via <- (recommended) or =

#save for later
avg <- (5 + 7 + 6) / 3
#call avg object
avg
## [1] 6
#strings (text) can be saved as well
words <- c("Hello there!", "How are you?")
words
## [1] "Hello there!" "How are you?"

Environment

  • Look at all current objects with ls()
ls()
## [1] "avg"   "words"
  • rm() to remove
rm(avg)
ls()
## [1] "words"
  • rm(list=ls()) to remove all stored objects

Environment

  • Built-in objects exist like letters and cars
letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
head(cars, n = 3)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
  • data() shows available built-in datasets

RStudio IDE

Four main ‘areas’

  • Console (& Terminal)

  • Scripting and Viewing Window

  • Plots/Help (& Files/Packages)

  • Environment (& Connections/Git)

Quick Example

R Objects and Classes

  • R has strong Object Oriented Programming (OOP) tools

  • Object: data structure with attributes (class)

  • Method: procedures (functions) act on object based on attributes

R Objects and Classes

  • R has strong Object Oriented Programming (OOP) tools

  • Object: data structure with attributes (class)

  • Method: procedures (functions) act on object based on attributes

  • R functions like plot() act differently depending on object class

class(cars)
## [1] "data.frame"

           

class(exp)
## [1] "function"

R Objects and Classes

  • R has strong Object Oriented Programming (OOP) tools

  • Object: data structure with attributes (often a ‘class’)

  • Method: procedures (often ‘functions’) act on object based on attributes

  • R functions like plot() act differently depending on object class

plot(cars)

                      

plot(exp)

R Objects and Classes

  • Create an R object via <- (recommended) or =

    • allocates memory to object
vec <- c(1, 4, 10)
vec
## [1]  1  4 10

R Objects and Classes

  • Create an R object via <- (recommended) or =

    • allocates memory to object
fit <- lm(dist ~ speed, data = cars)
fit
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

R Objects and Classes

  • The function used to create objects determines the type of object
class(vec)
## [1] "numeric"
summary(vec)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     2.5     4.0     5.0     7.0    10.0

R Objects and Classes

  • The function used to create objects determines the type of object
class(fit)
## [1] "lm"
summary(fit)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Investigating Objects

Many functions to help understand an R Object

  • class()

  • describes the class attribute of an R object

class(cars)
## [1] "data.frame"

Investigating Objects

Many functions to help understand an R Object

  • typeof()

  • determines the (R internal) type or storage mode of any object

typeof(cars)
## [1] "list"

Investigating Objects

Many functions to help understand an R Object

  • str()

  • compactly displays the internal structure of an R object

str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

Where we are

  • RStudio provides a nice environment for coding

  • R has functions that can be used to create objects

  • Create an R Object with <-

  • Objects have attributes that determine how functions act

  • class(), typeof(), and str() help understand your object

Quick Example

Data Objects

  • Understand data structures first: Five major types

    1. Atomic Vector (1d)
    2. Matrix (2d)
    3. Array (nd)
    4. Data Frame (2d)
    5. List (1d)
Dimension Homogeneous Heterogeneous
1d Atomic Vector List
2d Matrix Data Frame

Vector

  1. Atomic Vector (1D group of elements with an ordering)

  • Elements must be same ‘type’

    • numeric (integer or double), character, or logical

Vector

  1. Atomic Vector (1D group of elements with an ordering)
  • Create with c() function (‘combine’)
#vectors (1 dimensional) objects
x <- c(17, 22, 1, 3, -3)
y <- c("cat", "dog", "bird", "frog")
x
## [1] 17 22  1  3 -3
y
## [1] "cat"  "dog"  "bird" "frog"

Vector

  • Many ‘functions’ output a numeric vector

  • Ex: seq()

    • Inputs = from, to, by (among others)

    • Output = a sequence of numbers

From help(seq)

seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), length.out = NULL, along.with = NULL, ...)

v <- seq(from = 1, to = 5, by = 1)
v
## [1] 1 2 3 4 5
str(v)
##  num [1:5] 1 2 3 4 5
  • num says it is numeric

  • [1:5] implies one dimensional with length 5

: to Create a Sequence

1:20 
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
  • R generally does elementwise math
1:20/20
##  [1] 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75
## [16] 0.80 0.85 0.90 0.95 1.00
1:20 + 1
##  [1]  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21

Vector

  1. Atomic Vector (1D group of elements with an ordering)
  • Vectors useful to know about

  • Not usually useful for a dataset

  • Often consider as ‘building blocks’ for other data types

Matrix

  1. Matrix (2D data structure)
  • (think) columns are vectors of the same type and length

Matrix

  1. Matrix (2D data structure)
  • (think) columns are vectors of the same type and length
#populate vectors
x <- c(17, 3, 13, 11)
y <- rep(-3, times = 4)
z <- 1:4

         

Matrix

  1. Matrix (2D data structure)
  • (think) columns are vectors of the same type and length
#populate vectors
x <- c(17, 3, 13, 11)
y <- rep(-3, times = 4)
z <- 1:4

         

#check 'type'
is.numeric(x)
## [1] TRUE
is.numeric(y)
## [1] TRUE
is.numeric(z)
## [1] TRUE

Matrix

  1. Matrix (2D data structure)
  • (think) columns are vectors of the same type and length
#populate vectors
x <- c(17, 3, 13, 11)
y <- rep(-3, times = 4)
z <- 1:4

         

#check 'type'
is.numeric(x)
## [1] TRUE
is.numeric(y)
## [1] TRUE
is.numeric(z)
## [1] TRUE

         

#check 'length'
length(x)
## [1] 4
length(y)
## [1] 4
length(z)
## [1] 4

Matrix

  1. Matrix (2D data structure)
  • (think) columns are vectors of the same type and length

  • Create with matrix() function (see help)

Matrix

  1. Matrix (2D data structure)
  • (think) columns are vectors of the same type and length

  • Create with matrix() function (see help)

#populate vectors
x <- c(17, 3, 13, 11)
y <- rep(-3, times = 4)
z <- 1:4
#combine in a matrix
matrix(c(x, y, z), ncol = 3)
##      [,1] [,2] [,3]
## [1,]   17   -3    1
## [2,]    3   -3    2
## [3,]   13   -3    3
## [4,]   11   -3    4

Matrix

  1. Matrix (2D data structure)
  • (think) columns are vectors of the same type and length

  • Create with matrix() function

x <- c("Hi", "There", "Friend", "!")
y <- c("a", "b", "c", "d")
z <- c("One", "Two", "Three", "Four")
is.character(x)
## [1] TRUE

         

matrix(c(x, y, z), nrow = 6)
##      [,1]     [,2]   
## [1,] "Hi"     "c"    
## [2,] "There"  "d"    
## [3,] "Friend" "One"  
## [4,] "!"      "Two"  
## [5,] "a"      "Three"
## [6,] "b"      "Four"

Matrix

  1. Matrix (2D data structure)
  • (think) columns are vectors of the same type and length

  • Useful for some data but often some numeric and some character variables:

Data Frame

  1. Data Frame (2D data structure)
  • collection (list) of vectors of the same length

Data Frame

  1. Data Frame (2D data structure)
  • collection (list) of vectors of the same length

  • Create with data.frame() function

x <- c("a", "b", "c", "d", "e", "f")
y <- c(1, 3, 4, -1, 5, 6)
z <- 10:15
data.frame(x, y, z)
##   x  y  z
## 1 a  1 10
## 2 b  3 11
## 3 c  4 12
## 4 d -1 13
## 5 e  5 14
## 6 f  6 15

Data Frame

  1. Data Frame (2D data structure)
  • collection (list) of vectors of the same length

  • Create with data.frame() function

data.frame(char = x, data1 = y, data2 = z)
##   char data1 data2
## 1    a     1    10
## 2    b     3    11
## 3    c     4    12
## 4    d    -1    13
## 5    e     5    14
## 6    f     6    15
  • char, data1, and data2 become the variable names for the data frame

Data Frame

  1. Data Frame (2D data structure)
  • collection (list) of vectors of the same length

  • Create with data.frame() function

  • Perfect for most data sets!

  • Most functions that read 2D data store it as a data frame

List

  1. List (1D group of objects with ordering)
  • a vector that can have differing elements

List

  1. List (1D group of objects with ordering)
  • a vector that can have differing elements

  • Create with list()

list(1:3, rnorm(2), c("!", "?"))
## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] 0.6233416 0.2943405
## 
## [[3]]
## [1] "!" "?"

List

  1. List (1D group of objects with ordering)
  • Add names to the list elements
list(seq = 1:3, normVals = rnorm(2), punctuation = c("!", "?"))
## $seq
## [1] 1 2 3
## 
## $normVals
## [1] -1.207736 -2.413757
## 
## $punctuation
## [1] "!" "?"

List

  1. List (1D group of objects with ordering)
  • a vector that can have differing elements

  • Create with list()

  • More flexible than a Data Frame!

  • Useful for more complex types of data

Recap!

Dimension Homogeneous Heterogeneous
1d Atomic Vector List
2d Matrix Data Frame

 

  • For most data analysis you’ll use data frames!

  • Next up: How do we access/change parts of our objects?

Accessing Parts of a Data Object

  • For data may want
    • One element
    • Certain columns
    • Certain rows

Accessing Parts of an Atomic Vector (1D)

  • Return elements using square brackets []
letters #built-in vector
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
letters[1] #R starts counting at 1!
## [1] "a"

       

letters[26]
## [1] "z"

Accessing Parts of an Atomic Vector (1D)

  • Return elements using square brackets []

  • Can ‘feed’ in a vector of indices to []

letters[1:4]
## [1] "a" "b" "c" "d"
letters[c(5, 10, 15, 20, 25)]
## [1] "e" "j" "o" "t" "y"
x <- c(1, 2, 5); letters[x]
## [1] "a" "b" "e"

Accessing Parts of an Atomic Vector (1D)

  • Return elements using square brackets []

  • Can ‘feed’ in a vector of indices to []

  • Use negative indices to return without

letters[-(1:4)]
##  [1] "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w"
## [20] "x" "y" "z"
x <- c(1, 2, 5); letters[-x]
##  [1] "c" "d" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v"
## [20] "w" "x" "y" "z"

Accessing Parts of a Matrix (2D)

  • Use square brackets with a comma [ , ]

  • Notice default row and column names!

mat <- matrix(c(1:4, 20:17), ncol = 2)
mat
##      [,1] [,2]
## [1,]    1   20
## [2,]    2   19
## [3,]    3   18
## [4,]    4   17

Accessing Parts of a Matrix (2D)

  • Use square brackets with a comma [ , ]
mat
##      [,1] [,2]
## [1,]    1   20
## [2,]    2   19
## [3,]    3   18
## [4,]    4   17
mat[c(2, 4), ]
##      [,1] [,2]
## [1,]    2   19
## [2,]    4   17
mat[, 1]
## [1] 1 2 3 4
mat[2, ]
## [1]  2 19
mat[2, 1]
## [1] 2

Accessing Parts of a Data Frame (2D)

  • Consider ‘built-in’ iris data frame
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Accessing Parts of a Data Frame (2D)

  • Data Frame is 2D similar to a matrix - access similarly!

  • Use square brackets with a comma [ , ]

iris[1:4, 2:4]
##   Sepal.Width Petal.Length Petal.Width
## 1         3.5          1.4         0.2
## 2         3.0          1.4         0.2
## 3         3.2          1.3         0.2
## 4         3.1          1.5         0.2

Accessing Parts of a Data Frame (2D)

  • Data Frame is 2D similar toa matrix - access similarly!

  • Use square brackets with a comma [ , ]

iris[1, ]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa

Accessing Parts of a Data Frame (2D)

  • Can use columns names to subset
iris[ , c("Sepal.Length", "Species")]
##     Sepal.Length    Species
## 1            5.1     setosa
## 2            4.9     setosa
## 3            4.7     setosa
## 4            4.6     setosa
## 5            5.0     setosa
## 6            5.4     setosa
## 7            4.6     setosa
## 8            5.0     setosa
## 9            4.4     setosa
## 10           4.9     setosa
## 11           5.4     setosa
## 12           4.8     setosa
## 13           4.8     setosa
## 14           4.3     setosa
## 15           5.8     setosa
## 16           5.7     setosa
## 17           5.4     setosa
## 18           5.1     setosa
## 19           5.7     setosa
## 20           5.1     setosa
## 21           5.4     setosa
## 22           5.1     setosa
## 23           4.6     setosa
## 24           5.1     setosa
## 25           4.8     setosa
## 26           5.0     setosa
## 27           5.0     setosa
## 28           5.2     setosa
## 29           5.2     setosa
## 30           4.7     setosa
## 31           4.8     setosa
## 32           5.4     setosa
## 33           5.2     setosa
## 34           5.5     setosa
## 35           4.9     setosa
## 36           5.0     setosa
## 37           5.5     setosa
## 38           4.9     setosa
## 39           4.4     setosa
## 40           5.1     setosa
## 41           5.0     setosa
## 42           4.5     setosa
## 43           4.4     setosa
## 44           5.0     setosa
## 45           5.1     setosa
## 46           4.8     setosa
## 47           5.1     setosa
## 48           4.6     setosa
## 49           5.3     setosa
## 50           5.0     setosa
## 51           7.0 versicolor
## 52           6.4 versicolor
## 53           6.9 versicolor
## 54           5.5 versicolor
## 55           6.5 versicolor
## 56           5.7 versicolor
## 57           6.3 versicolor
## 58           4.9 versicolor
## 59           6.6 versicolor
## 60           5.2 versicolor
## 61           5.0 versicolor
## 62           5.9 versicolor
## 63           6.0 versicolor
## 64           6.1 versicolor
## 65           5.6 versicolor
## 66           6.7 versicolor
## 67           5.6 versicolor
## 68           5.8 versicolor
## 69           6.2 versicolor
## 70           5.6 versicolor
## 71           5.9 versicolor
## 72           6.1 versicolor
## 73           6.3 versicolor
## 74           6.1 versicolor
## 75           6.4 versicolor
## 76           6.6 versicolor
## 77           6.8 versicolor
## 78           6.7 versicolor
## 79           6.0 versicolor
## 80           5.7 versicolor
## 81           5.5 versicolor
## 82           5.5 versicolor
## 83           5.8 versicolor
## 84           6.0 versicolor
## 85           5.4 versicolor
## 86           6.0 versicolor
## 87           6.7 versicolor
## 88           6.3 versicolor
## 89           5.6 versicolor
## 90           5.5 versicolor
## 91           5.5 versicolor
## 92           6.1 versicolor
## 93           5.8 versicolor
## 94           5.0 versicolor
## 95           5.6 versicolor
## 96           5.7 versicolor
## 97           5.7 versicolor
## 98           6.2 versicolor
## 99           5.1 versicolor
## 100          5.7 versicolor
## 101          6.3  virginica
## 102          5.8  virginica
## 103          7.1  virginica
## 104          6.3  virginica
## 105          6.5  virginica
## 106          7.6  virginica
## 107          4.9  virginica
## 108          7.3  virginica
## 109          6.7  virginica
## 110          7.2  virginica
## 111          6.5  virginica
## 112          6.4  virginica
## 113          6.8  virginica
## 114          5.7  virginica
## 115          5.8  virginica
## 116          6.4  virginica
## 117          6.5  virginica
## 118          7.7  virginica
## 119          7.7  virginica
## 120          6.0  virginica
## 121          6.9  virginica
## 122          5.6  virginica
## 123          7.7  virginica
## 124          6.3  virginica
## 125          6.7  virginica
## 126          7.2  virginica
## 127          6.2  virginica
## 128          6.1  virginica
## 129          6.4  virginica
## 130          7.2  virginica
## 131          7.4  virginica
## 132          7.9  virginica
## 133          6.4  virginica
## 134          6.3  virginica
## 135          6.1  virginica
## 136          7.7  virginica
## 137          6.3  virginica
## 138          6.4  virginica
## 139          6.0  virginica
## 140          6.9  virginica
## 141          6.7  virginica
## 142          6.9  virginica
## 143          5.8  virginica
## 144          6.8  virginica
## 145          6.7  virginica
## 146          6.7  virginica
## 147          6.3  virginica
## 148          6.5  virginica
## 149          6.2  virginica
## 150          5.9  virginica

Accessing Parts of a Data Frame (2D)

  • Dollar sign allows easy access to a single column!
iris$Sepal.Length
##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
##  [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
##  [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
##  [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
##  [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
##  [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9

Accessing Parts of a Data Frame (2D)

  • Dollar sign allows easy access to a single column!

  • Most used method for accessing a single variable

  • RStudio fills in options.

    • Type iris$
    • If no choices - hit tab
    • Hit tab again to choose

Accessing Parts of a List (1D)

  • Use single square brackets [ ] for multiple list elements
x <- list("HI", c(10:20), 1)
x
## [[1]]
## [1] "HI"
## 
## [[2]]
##  [1] 10 11 12 13 14 15 16 17 18 19 20
## 
## [[3]]
## [1] 1

Accessing Parts of a List (1D)

  • Use single square brackets [ ] for multiple list elements
x <- list("HI", c(10:20), 1)
x[2:3]
## [[1]]
##  [1] 10 11 12 13 14 15 16 17 18 19 20
## 
## [[2]]
## [1] 1

Accessing Parts of a List (1D)

  • Use double square brackets [[ ]] (or [ ]) for single list element
x <- list("HI", c(10:20), 1)
x[1]
## [[1]]
## [1] "HI"
x[[1]]
## [1] "HI"
x[[2]]
##  [1] 10 11 12 13 14 15 16 17 18 19 20
x[[2]][4:5]
## [1] 13 14

Accessing Parts of a List (1D)

  • If named list elements, can use $
x <- list("HI", c(10:20), 1)
str(x)
## List of 3
##  $ : chr "HI"
##  $ : int [1:11] 10 11 12 13 14 15 16 17 18 19 ...
##  $ : num 1
x <- list(First = "Hi", Second = c(10:20), Third = 1)
x$Second
##  [1] 10 11 12 13 14 15 16 17 18 19 20

Lists & Data Frames

  • Connection: Data Frame = List of equal length vectors
str(x)
## List of 3
##  $ First : chr "Hi"
##  $ Second: int [1:11] 10 11 12 13 14 15 16 17 18 19 ...
##  $ Third : num 1
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Lists & Data Frames

  • Connection: Data Frame = List of equal length vectors
typeof(x)
## [1] "list"
typeof(iris)
## [1] "list"

Lists & Data Frames

  • Connection: Data Frame = List of equal length vectors
iris[2]
##     Sepal.Width
## 1           3.5
## 2           3.0
## 3           3.2
## 4           3.1
## 5           3.6
## 6           3.9
## 7           3.4
## 8           3.4
## 9           2.9
## 10          3.1
## 11          3.7
## 12          3.4
## 13          3.0
## 14          3.0
## 15          4.0
## 16          4.4
## 17          3.9
## 18          3.5
## 19          3.8
## 20          3.8
## 21          3.4
## 22          3.7
## 23          3.6
## 24          3.3
## 25          3.4
## 26          3.0
## 27          3.4
## 28          3.5
## 29          3.4
## 30          3.2
## 31          3.1
## 32          3.4
## 33          4.1
## 34          4.2
## 35          3.1
## 36          3.2
## 37          3.5
## 38          3.6
## 39          3.0
## 40          3.4
## 41          3.5
## 42          2.3
## 43          3.2
## 44          3.5
## 45          3.8
## 46          3.0
## 47          3.8
## 48          3.2
## 49          3.7
## 50          3.3
## 51          3.2
## 52          3.2
## 53          3.1
## 54          2.3
## 55          2.8
## 56          2.8
## 57          3.3
## 58          2.4
## 59          2.9
## 60          2.7
## 61          2.0
## 62          3.0
## 63          2.2
## 64          2.9
## 65          2.9
## 66          3.1
## 67          3.0
## 68          2.7
## 69          2.2
## 70          2.5
## 71          3.2
## 72          2.8
## 73          2.5
## 74          2.8
## 75          2.9
## 76          3.0
## 77          2.8
## 78          3.0
## 79          2.9
## 80          2.6
## 81          2.4
## 82          2.4
## 83          2.7
## 84          2.7
## 85          3.0
## 86          3.4
## 87          3.1
## 88          2.3
## 89          3.0
## 90          2.5
## 91          2.6
## 92          3.0
## 93          2.6
## 94          2.3
## 95          2.7
## 96          3.0
## 97          2.9
## 98          2.9
## 99          2.5
## 100         2.8
## 101         3.3
## 102         2.7
## 103         3.0
## 104         2.9
## 105         3.0
## 106         3.0
## 107         2.5
## 108         2.9
## 109         2.5
## 110         3.6
## 111         3.2
## 112         2.7
## 113         3.0
## 114         2.5
## 115         2.8
## 116         3.2
## 117         3.0
## 118         3.8
## 119         2.6
## 120         2.2
## 121         3.2
## 122         2.8
## 123         2.8
## 124         2.7
## 125         3.3
## 126         3.2
## 127         2.8
## 128         3.0
## 129         2.8
## 130         3.0
## 131         2.8
## 132         3.8
## 133         2.8
## 134         2.8
## 135         2.6
## 136         3.0
## 137         3.4
## 138         3.1
## 139         3.0
## 140         3.1
## 141         3.1
## 142         3.1
## 143         2.7
## 144         3.2
## 145         3.3
## 146         3.0
## 147         2.5
## 148         3.0
## 149         3.4
## 150         3.0
iris[[2]]
##   [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5
##  [19] 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2
##  [37] 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3
##  [55] 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5 3.2 2.8
##  [73] 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5
##  [91] 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0 3.0 2.5 2.9
## [109] 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2
## [127] 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2
## [145] 3.3 3.0 2.5 3.0 3.4 3.0

Recap!

Dimension Homogeneous Heterogeneous
1d Atomic Vector List
2d Matrix Data Frame


Basic access via
  • Atomic vectors - x[ ]
  • Matrices - x[ , ]
  • Data Frames - x[ , ] or x$name
  • Lists - x[ ], x[[ ]], or x$name

Quick Examples

Reading Raw Data Into R

Plan:

  • Common raw data formats
  • Comma Separated Value (CSV) files
  • Asides: R projects and R packages
  • Read ‘clean’ delimited data
  • Excel, SAS, & SPSS data
  • Resources for JSON, databases, and APIs

Importing Data

How to read in data depends on raw/external data type!

  • Delimited data

    • Delimiter - Character (such as a ,) that separates data entries

  

  

  

       Comma: usually .csv                              Space: usually .txt or .dat                      Tab: usually .tsv or .txt                         General: usually .txt or .dat

Importing Delimited Data: Standard R Methods

  • When you open R a few packages are loaded
  • R package
    • Collection of functions/datasets/etc. in one place
    • Packages exist to do almost anything
    • List of CRAN approved packages on R’s website
    • Plenty of other packages on places like GitHub

Importing Delimited Data: Standard R Methods

  • When you open R a few packages are loaded

  • utils package has family of read. functions ready for use!

Reading Delimited Data

  • Functions from read. family work well

  • Concerns:

    • poor default function behavior

      • (formerly, prior to R 4.0) strings are read as factors

      • row & column names can be troublesome

    • (Slightly) different behavior on different computers

    • Want to have most of our functions we use ‘feel’ the same…

Aside: R Packages

  • R package
    • Collection of functions in one place
    • Packages exist to do almost anything
    • List of CRAN approved packages on R’s website
    • Plenty of other packages on places like GitHub
  • TidyVerse” - collection of R packages that share common philosophies and are designed to work together!

Aside: R Packages

  • First time using a package
    • Must install package (download files)
    • Can use code or menus
install.packages("readr")

Aside: R Packages

  • Only install once!

  • Each session: read in package using library() or require()

library("tidyverse")

Aside: R Packages

  • Can call functions without loading full library with ::

  • If not specified, most recently loaded package takes precedent

#stats::filter(...) calls time-series function from stats package
dplyr::filter(iris, Species == "virginica")
##    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1           6.3         3.3          6.0         2.5 virginica
## 2           5.8         2.7          5.1         1.9 virginica
## 3           7.1         3.0          5.9         2.1 virginica
## 4           6.3         2.9          5.6         1.8 virginica
## 5           6.5         3.0          5.8         2.2 virginica
## 6           7.6         3.0          6.6         2.1 virginica
## 7           4.9         2.5          4.5         1.7 virginica
## 8           7.3         2.9          6.3         1.8 virginica
## 9           6.7         2.5          5.8         1.8 virginica
## 10          7.2         3.6          6.1         2.5 virginica
## 11          6.5         3.2          5.1         2.0 virginica
## 12          6.4         2.7          5.3         1.9 virginica
## 13          6.8         3.0          5.5         2.1 virginica
## 14          5.7         2.5          5.0         2.0 virginica
## 15          5.8         2.8          5.1         2.4 virginica
## 16          6.4         3.2          5.3         2.3 virginica
## 17          6.5         3.0          5.5         1.8 virginica
## 18          7.7         3.8          6.7         2.2 virginica
## 19          7.7         2.6          6.9         2.3 virginica
## 20          6.0         2.2          5.0         1.5 virginica
## 21          6.9         3.2          5.7         2.3 virginica
## 22          5.6         2.8          4.9         2.0 virginica
## 23          7.7         2.8          6.7         2.0 virginica
## 24          6.3         2.7          4.9         1.8 virginica
## 25          6.7         3.3          5.7         2.1 virginica
## 26          7.2         3.2          6.0         1.8 virginica
## 27          6.2         2.8          4.8         1.8 virginica
## 28          6.1         3.0          4.9         1.8 virginica
## 29          6.4         2.8          5.6         2.1 virginica
## 30          7.2         3.0          5.8         1.6 virginica
## 31          7.4         2.8          6.1         1.9 virginica
## 32          7.9         3.8          6.4         2.0 virginica
## 33          6.4         2.8          5.6         2.2 virginica
## 34          6.3         2.8          5.1         1.5 virginica
## 35          6.1         2.6          5.6         1.4 virginica
## 36          7.7         3.0          6.1         2.3 virginica
## 37          6.3         3.4          5.6         2.4 virginica
## 38          6.4         3.1          5.5         1.8 virginica
## 39          6.0         3.0          4.8         1.8 virginica
## 40          6.9         3.1          5.4         2.1 virginica
## 41          6.7         3.1          5.6         2.4 virginica
## 42          6.9         3.1          5.1         2.3 virginica
## 43          5.8         2.7          5.1         1.9 virginica
## 44          6.8         3.2          5.9         2.3 virginica
## 45          6.7         3.3          5.7         2.5 virginica
## 46          6.7         3.0          5.2         2.3 virginica
## 47          6.3         2.5          5.0         1.9 virginica
## 48          6.5         3.0          5.2         2.0 virginica
## 49          6.2         3.4          5.4         2.3 virginica
## 50          5.9         3.0          5.1         1.8 virginica

Reading Delimited Data

baseR and tidyverse (readr package does the heavy lifting) function and purpose:

Type of Delimeter utils Function readr Function
Comma read.csv() read_csv()
Semicolon (, for decimal) read.csv2() read_csv2()
Tab read.delim() read_tsv()
General read.table(sep = "") read_delim()
White Space read.table(sep = "") read_table() read_table2()

Working Directory

  • Let’s read in the ‘neuralgia.csv’ file

  • By default, R looks in the working directory for the file

getwd()
## [1] "C:/repos/Basics-of-R-for-Data-Science-and-Statistics/slides"

Working Directory

  • Can change working directory via code or menus

setwd("C:/Users/jbpost2/repos/Basics-of-R-for-Data-Science-and-Statistics/datasets")
#or
setwd("C:\\Users\\jbpost2\\repos\\camp\\Basics-of-R-for-Data-Science-and-Statistics\\datasets")
#better to use R projects!

Reading a .csv File

With neuralgia.csv file in the working directory:

neuralgiaData <- read_csv("neuralgia.csv")
neuralgiaData
## # A tibble: 60 x 5
##   Treatment Sex     Age Duration Pain 
##   <chr>     <chr> <dbl>    <dbl> <chr>
## 1 P         F        68        1 No   
## 2 B         M        74       16 No   
## 3 P         F        67       30 No   
## 4 P         M        66       26 Yes  
## 5 B         F        67       28 No   
## # ... with 55 more rows

Reading a .csv File

  • Use full local path
neuralgiaData <- read_csv(
"C:/Users/jbpost2/repos/Basics-of-R-for-Data-Science-and-Statistics/datasets/neuralgia.csv"
           )

Reading a .csv File

  • Use relative path (../ drops down a folder)
neuralgiaData <- read_csv("../datasets/neuralgia.csv")
  • Working directory: “…/Basics-of-R-for-Data-Science-and-Statistics/slides”

  • File location: “…/Basics-of-R-for-Data-Science-and-Statistics/datasets/neuralgia.csv”

  • As long others have the same folder structure, can share code with no path change needed!

Aside: RStudio Project

  • Often have many files associated with an analysis

  • With multiple analyses things get cluttered…

Aside: RStudio Project

  • Often have many files associated with an analysis

  • With multiple analyses things get cluttered…

  • Want to associate different

    • environments

    • histories

    • working directories

    • source documents

    with each analysis

  • Can use “Project” feature in R Studio

Aside: RStudio - Project

  • Easy to create! Use an existing folder or create one:

  • Place all files for that analysis in that directory

  • Swap between projects using menu in top right

Reading a .csv File

  • Back to reading in data!

  • R can pull from URLs as well!

neuralgiaData <- read_csv("https://www4.stat.ncsu.edu/~online/datasets/neuralgia.csv")
neuralgiaData
## # A tibble: 60 x 5
##   Treatment Sex     Age Duration Pain 
##   <chr>     <chr> <dbl>    <dbl> <chr>
## 1 P         F        68        1 No   
## 2 B         M        74       16 No   
## 3 P         F        67       30 No   
## 4 P         M        66       26 Yes  
## 5 B         F        67       28 No   
## # ... with 55 more rows

tibbles

  • Notice: fancy printing!

  • Checking column type is a basic data validation step

  • tidyverse data frames are called tibbles

class(neuralgiaData)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

tibbles

  • Behavior slightly different than a standard data frame. No simplification!
neuralgiaData2 <- as.data.frame(neuralgiaData)
neuralgiaData2[,1]
##  [1] "P" "B" "P" "P" "B" "B" "A" "B" "B" "A" "A" "A" "B" "A" "P" "A" "P" "A" "P"
## [20] "B" "B" "A" "A" "A" "B" "P" "B" "B" "P" "P" "A" "A" "B" "B" "B" "A" "P" "B"
## [39] "B" "P" "P" "P" "A" "B" "A" "P" "P" "A" "B" "P" "P" "P" "B" "A" "P" "A" "P"
## [58] "A" "B" "A"
neuralgiaData[,1]
## # A tibble: 60 x 1
##   Treatment
##   <chr>    
## 1 P        
## 2 B        
## 3 P        
## 4 P        
## 5 B        
## # ... with 55 more rows

tibbles

  • Behavior slightly different than a standard data frame. No simplification!

  • Use either dplyr::pull() or $

pull(neuralgiaData, Treatment) #or pull(neuralgiaData, 1)
##  [1] "P" "B" "P" "P" "B" "B" "A" "B" "B" "A" "A" "A" "B" "A" "P" "A" "P" "A" "P"
## [20] "B" "B" "A" "A" "A" "B" "P" "B" "B" "P" "P" "A" "A" "B" "B" "B" "A" "P" "B"
## [39] "B" "P" "P" "P" "A" "B" "A" "P" "P" "A" "B" "P" "P" "P" "B" "A" "P" "A" "P"
## [58] "A" "B" "A"
neuralgiaData$Treatment 
##  [1] "P" "B" "P" "P" "B" "B" "A" "B" "B" "A" "A" "A" "B" "A" "P" "A" "P" "A" "P"
## [20] "B" "B" "A" "A" "A" "B" "P" "B" "B" "P" "P" "A" "A" "B" "B" "B" "A" "P" "B"
## [39] "B" "P" "P" "P" "A" "B" "A" "P" "P" "A" "B" "P" "P" "P" "B" "A" "P" "A" "P"
## [58] "A" "B" "A"

Reading Space Delimited Data

  • Reading clean delimited data pretty easy with the tidyverse!

  • Let’s read in the ‘chemical.txt’ file (space delimited)

  • read_table2() allows multiple white space characaters between entries

Reading Space Delimited Dtaa

  • Reading clean delimited data pretty easy with the tidyverse!

  • Let’s read in the ‘chemical.txt’ file (space delimited)

  • read_table2() allows multiple white space characaters between entries

read_table2("https://www4.stat.ncsu.edu/~online/datasets/chemical.txt")
## # A tibble: 19 x 4
##     temp  conc  time percent
##    <dbl> <dbl> <dbl>   <dbl>
##  1  -1    -1    -1      45.9
##  2   1    -1    -1      60.6
##  3  -1     1    -1      57.5
##  4   1     1    -1      58.6
##  5  -1    -1     1      53.3
##  6   1    -1     1      58  
##  7  -1     1     1      58.8
##  8   1     1     1      52.4
##  9  -2     0     0      46.9
## 10   2     0     0      55.4
## 11   0    -2     0      55  
## 12   0     2     0      57.5
## 13   0     0    -2      56.3
## 14   0     0     2      58.9
## 15   0     0     0      56.9
## 16   2    -3     0      61.1
## 17   2    -3     0      62.9
## 18  -1.4   2.6   0.7    60  
## 19  -1.4   2.6   0.7    60.6

Reading Tab Delimited Data

  • Reading clean delimited data pretty easy with the tidyverse!

  • Let’s read in the ‘crabs.txt’ file (tab delimited)

Reading Tab Delimited Data

  • Reading clean delimited data pretty easy with the tidyverse!

  • Let’s read in the ‘crabs.txt’ file (tab delimited)

read_tsv("https://www4.stat.ncsu.edu/~online/datasets/crabs.txt")
## # A tibble: 173 x 6
##   color spine width satell weight     y
##   <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>
## 1     3     3  28.3      8   3050     1
## 2     4     3  22.5      0   1550     0
## 3     2     1  26        9   2300     1
## 4     4     3  24.8      0   2100     0
## 5     4     3  26        4   2600     1
## # ... with 168 more rows

Reading Generic Delimited Data

  • Reading clean delimited data pretty easy with the tidyverse!

  • Let’s read in the ‘umps2012.txt’ file (‘>’ delimited)

  • In raw data, no column names provided

Reading Generic Delimted Data

read_delim( file, delim, col_names = TRUE, col_types = NULL, na = c("", "NA"), skip = 0, guess_max = min(1000, n_max), ... )

read_delim("https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt", 
           delim = ">",
           col_names = c("Year", "Month", "Day", "Home", "Away", "HPUmpire"))
## # A tibble: 2,359 x 6
##    Year Month   Day Home  Away  HPUmpire        
##   <dbl> <dbl> <dbl> <chr> <chr> <chr>           
## 1  2012     4    12 MIN   LAA   D.J. Reyburn    
## 2  2012     4    12 SD    ARI   Marty Foster    
## 3  2012     4    12 WSH   CIN   Mike Everitt    
## 4  2012     4    12 PHI   MIA   Jeff Nelson     
## 5  2012     4    12 CHC   MIL   Fieldin Culbreth
## # ... with 2,354 more rows

Reading Delimited Data

  • How do readr functions determine the column types? From the help:

col_types
One of NULL, a cols() specification, or a string. See vignette("readr") for more details.

If NULL, all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself.

  • Other useful inputs:
    • skip = 0
    • col_names = TRUE
    • na = c("", "NA")

Reading Fixed Field & Tricky Non-Standard Data

  • read_fwf()
    • reads in data where entries are very structured
  • read_file()
    • reads an entire file into a single string
  • read_lines()
    • reads a file into a character vector with one element per line
  • Usually parse the last two with regular expressions :(

Reading Data From Other Sources

Type of file Package Function
Delimited readr read_csv(), read_tsv(),read_table(), read_delim()
Excel (.xls,.xlsx) readxl read_excel()
SAS (.sas7bdat) haven read_sas()
SPSS (.sav) haven read_spss()


  • Look at resources for JSON, databases, and APIs

Excel Data

  • Read in censusEd.xlsx

  • Use read_excel() from readxl package!

    • Reads both xls and xlsx files

    • Detects format from extension given

    • Can’t pull from web though!

read_excel

#install package if necessary
library(readxl)
#reads first sheet by default
edData <- read_excel("../datasets/censusEd.xlsx")
edData
## # A tibble: 3,198 x 42
##   Area_name     STCOU EDU010187F EDU010187D EDU010187N1 EDU010187N2 EDU010188F
##   <chr>         <chr>      <dbl>      <dbl> <chr>       <chr>            <dbl>
## 1 UNITED STATES 00000          0   40024299 0000        0000                 0
## 2 ALABAMA       01000          0     733735 0000        0000                 0
## 3 Autauga, AL   01001          0       6829 0000        0000                 0
## 4 Baldwin, AL   01003          0      16417 0000        0000                 0
## 5 Barbour, AL   01005          0       5071 0000        0000                 0
## # ... with 3,193 more rows, and 35 more variables: EDU010188D <dbl>,
## #   EDU010188N1 <chr>, EDU010188N2 <chr>, EDU010189F <dbl>, EDU010189D <dbl>,
## #   EDU010189N1 <chr>, EDU010189N2 <chr>, EDU010190F <dbl>, EDU010190D <dbl>,
## #   EDU010190N1 <chr>, EDU010190N2 <chr>, EDU010191F <dbl>, EDU010191D <dbl>,
## #   EDU010191N1 <chr>, EDU010191N2 <chr>, EDU010192F <dbl>, EDU010192D <dbl>,
## #   EDU010192N1 <chr>, EDU010192N2 <chr>, EDU010193F <dbl>, EDU010193D <dbl>,
## #   EDU010193N1 <chr>, EDU010193N2 <chr>, EDU010194F <dbl>, EDU010194D <dbl>,
## #   EDU010194N1 <chr>, EDU010194N2 <chr>, EDU010195F <dbl>, EDU010195D <dbl>,
## #   EDU010195N1 <chr>, EDU010195N2 <chr>, EDU010196F <dbl>, EDU010196D <dbl>,
## #   EDU010196N1 <chr>, EDU010196N2 <chr>

Dealing with Excel Sheets

  • Can look at sheets available
excel_sheets("../datasets/censusEd.xlsx")
##  [1] "EDU01A" "EDU01B" "EDU01C" "EDU01D" "EDU01E" "EDU01F" "EDU01G" "EDU01H"
##  [9] "EDU01I" "EDU01J"
  • Specify sheet with name or integers (or NULL for 1st) using sheet =
read_excel("../datasets/censusEd.xlsx", sheet = "EDU01D")

SAS Data

  • Use read_sas() from haven package
  • Not many options!

SAS Data

  • SAS data has extension ‘.sas7bdat’

  • Read in smoke2003.sas7bdat

  • Use read_sas() from haven package

  • Not many options!

#install if necessary
library(haven)
smokeData <- read_sas("https://www4.stat.ncsu.edu/~online/datasets/smoke2003.sas7bdat")
smokeData
## # A tibble: 443 x 54
##    SEQN SDDSRVYR RIDSTATR RIDEXMON RIAGENDR RIDAGEYR RIDAGEMN RIDAGEEX RIDRETH1
##   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 21010        3        2        2        2       52      633      634        3
## 2 21012        3        2        2        1       63      765      766        4
## 3 21048        3        2        1        2       42      504      504        1
## 4 21084        3        2        1        2       57      692      693        3
## 5 21093        3        2        1        2       64      778      778        2
## # ... with 438 more rows, and 45 more variables: RIDRETH2 <dbl>,
## #   DMQMILIT <dbl>, DMDBORN <dbl>, DMDCITZN <dbl>, DMDYRSUS <dbl>,
## #   DMDEDUC3 <dbl>, DMDEDUC2 <dbl>, DMDEDUC <dbl>, DMDSCHOL <dbl>,
## #   DMDMARTL <dbl>, DMDHHSIZ <dbl>, INDHHINC <dbl>, INDFMINC <dbl>,
## #   INDFMPIR <dbl>, RIDEXPRG <dbl>, DMDHRGND <dbl>, DMDHRAGE <dbl>,
## #   DMDHRBRN <dbl>, DMDHREDU <dbl>, DMDHRMAR <dbl>, DMDHSEDU <dbl>,
## #   SIALANG <dbl>, SIAPROXY <dbl>, SIAINTRP <dbl>, FIALANG <dbl>,
## #   FIAPROXY <dbl>, FIAINTRP <dbl>, MIALANG <dbl>, MIAPROXY <dbl>,
## #   MIAINTRP <dbl>, AIALANG <dbl>, WTINT2YR <dbl>, WTMEC2YR <dbl>,
## #   SDMVPSU <dbl>, SDMVSTRA <dbl>, Gender <dbl>, Age <dbl>, IncomeGroup <chr>,
## #   Ethnicity <chr>, Education <dbl>, SMD070 <dbl>, SMQ077 <dbl>, SMD650 <dbl>,
## #   PacksPerDay <dbl>, lbdvid <dbl>

SPSS Data

  • SPSS data has extension “.sav”

  • Read in bodyFat.sav

  • Use read_spss() from haven package
  • Not many options!

SPSS Data

  • SPSS data has extension “.sav”

  • Read in bodyFat.sav

  • Use read_spss() from haven package

  • Not many options!

bodyFatData <- read_spss("https://www4.stat.ncsu.edu/~online/datasets/bodyFat.sav")
bodyFatData
## # A tibble: 20 x 4
##        y    x1    x2    x3
##    <dbl> <dbl> <dbl> <dbl>
##  1  19.5  43.1  29.1  11.9
##  2  24.7  49.8  28.2  22.8
##  3  30.7  51.9  37    18.7
##  4  29.8  54.3  31.1  20.1
##  5  19.1  42.2  30.9  12.9
##  6  25.6  53.9  23.7  21.7
##  7  31.4  58.5  27.6  27.1
##  8  27.9  52.1  30.6  25.4
##  9  22.1  49.9  23.2  21.3
## 10  25.5  53.5  24.8  19.3
## 11  31.1  56.6  30    25.4
## 12  30.4  56.7  28.3  27.2
## 13  18.7  46.5  23    11.7
## 14  19.7  44.2  28.6  17.8
## 15  14.6  42.7  21.3  12.8
## 16  29.5  54.4  30.1  23.9
## 17  27.7  55.3  25.7  22.6
## 18  30.2  58.6  24.6  25.4
## 19  22.7  48.2  27.1  14.8
## 20  25.2  51    27.5  21.1

tidyverse

Notice the ease of use of the functions across the tidyverse so far:

  • function_name('path-to-file', options)

  • All functions read the data into a tibble

  • Good defaults that do the work for you

Quick Examples

Resources for Other Data Sources

JSON - JavaScript Object Notation

  • Used widely across the internet and databases

  • Can represent usual 2D data or heirarchical data

JSON - JavaScript Object Notation

  • Uses key-value pairs
{  
  {  
    "name": "Barry Sanders"  
    "games" : 153  
    "position": "RB"  
  },  
  {  
    "name": "Joe Montana"  
    "games": 192  
    "position": "QB"  
  }  
} 

JSON - JavaScript Object Notation

Three major R packages

  1. rjson

  2. RJSONIO

  3. jsonlite

    • many nice features

    • a little slower implementation

  4. tidyjson - new tidyverse package

jsonlite Package

jsonlite basic functions:

Function Description
fromJSON Reads JSON data from file path or character string. Converts and simplfies to R object
toJSON Writes R object to JSON object
stream_in Accepts a file connection - can read streaming JSON data

APIs - Application Programming Interfaces

A defined method for asking for information from a computer

  • Useful for getting data

  • Useful for allowing others to run your model without a GUI (like Shiny)

  • Many open APIs, just need key
  • Often just need to construct proper URL

APIs - Quick Example

Registered for a key at newsapi.org. An API for looking at news articles

  • Look at documentation for API (most have this!)

  • Example URL to obtain data is given

https://newsapi.org/v2/everything?q=bitcoin&apiKey=myKeyGoesHere

Example https://newsapi.org/

  • Can add in date for instance:

https://newsapi.org/v2/everything?q=bitcoin&from=2021-06-01&apiKey=myKeyGoesHere

Using R to Obtain the Data

  • Use GET from httr package (make sure to load package!)

  • Modify for what you have interest in!

library(httr)
GET("http://newsapi.org/v2/everything?qlnTitle=Juneteenth&from=2021-06-01&language=en&
    apiKey=myKeyGoesHere")

Returned data

  • Usually what you want is stored in something like content
str(myData, max.level = 1)
## List of 10
##  $ url        : chr "http://newsapi.org/v2/everything?qInTitle=tesla&from=2021-06-01&language=en&pageSize=100&apiKey=aa4b545bfbd64d4"| __truncated__
##  $ status_code: int 426
##  $ headers    :List of 13
##   ..- attr(*, "class")= chr [1:2] "insensitive" "list"
##  $ all_headers:List of 1
##  $ cookies    :'data.frame': 0 obs. of  7 variables:
##  $ content    : raw [1:255] 7b 22 73 74 ...
##  $ date       : POSIXct[1:1], format: "2021-08-07 05:38:26"
##  $ times      : Named num [1:6] 0 0.0363 0.0594 0.0595 0.1199 ...
##   ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
##  $ request    :List of 7
##   ..- attr(*, "class")= chr "request"
##  $ handle     :Class 'curl_handle' <externalptr> 
##  - attr(*, "class")= chr "response"

Parse with jsonlite

Common steps:

  • Grab the list element we want
  • Convert it to characters (it will have a JSON structure)
  • Convert it to a data frame with fromJSON from the jsonlite package
library(dplyr)
library(jsonlite)
parsed <- fromJSON(rawToChar(myData$content))
str(parsed, max.level = 1)
## List of 3
##  $ status : chr "error"
##  $ code   : chr "parameterInvalid"
##  $ message: chr "You are trying to request results too far in the past. Your plan permits you to request articles as far back as"| __truncated__

APIs - Application Programming Interfaces

Access in R

  • Article here discusses accessing APIs generically with R

  • Same website gives a list of APIs

Resources for Other Data Sources

Databases

  • Collection of data, usually a bunch of related 2D tables

Many common database management systems

  • Oracle

  • SQL Server - Microsoft product

  • DB2 - IBM product

  • MySQL (open source) - Not as many features but popular

  • PostgreSQL (open source)

Basic SQL language constant across all - features differ

Example database structure

Source: oreilly.com

Source: oreilly.com

Databases - Common flow in R

  1. Connect to the database with DBI::dbConnect()
  • Need appropriate R package for database backend
    • RSQLite::SQLite() for RSQLite
    • RMySQL::MySQL() for RMySQL
    • RPostgreSQL::PostgreSQL() for RPostgreSQL
    • odbc::odbc() for Open Database Connectivity
    • bigrquery::bigquery() for google’s bigQuery
con <- DBI::dbConnect(RMySQL::MySQL(), 
  host = "hostname.website",
  user = "username",
  password = rstudioapi::askForPassword("DB password")
)

Databases - Common flow in R

  1. Connect to the database with DBI::dbConnect()
  • Need appropriate R package for database backend
  1. Use tbl() to reference a table in the database
tbl(con, "name_of_table")

Databases - Common flow in R

  1. Connect to the database with DBI::dbConnect()
  • Need appropriate R package for database backend
  1. Use tbl() to reference a table in the database
  2. Query the database with SQL or dplyr/dbplyr (we’ll learn dplyr soon!)

Databases - Common flow in R

  1. Connect to the database with DBI::dbConnect()
  • Need appropriate R package for database backend
  1. Use tbl() to reference a table in the database
  2. Query the database with SQL or dplyr/dbplyr (we’ll learn dplyr soon!)
  3. Disconnect from database with dbDisconnect()

Recap

  • Read data from other sources
Type of file Package Function
Delimited readr read_csv(), read_tsv(),read_table(), read_delim()
Excel (.xls,.xlsx) readxl read_excel()
SAS (.sas7bdat) haven read_sas()
SPSS (.sav) haven read_spss()


  • Resources for JSON, databases, and APIs
  • Quick break!