Basic use of R for reading, manipulating, and plotting data!
Basic use of R for reading, manipulating, and plotting data!
R & RStudio installed
Explore the RStudio IDE (Integrated Development Environment)
Investigate common R objects and classes
Read in raw data
In RStudio, four main ‘areas’
Console (& Terminal)
Scripting and Viewing Window
Plots/Help (& Files/Packages)
Environment (& Connections/Git)
#simple math operations # <-- is a comment - code not evaluated 3 + 7
## [1] 10
10 * exp(3) #exp is exponential function
## [1] 200.8554
log(pi^2) #log is natural log by default
## [1] 2.28946
mean(cars$speed)
## [1] 15.4
hist(cars$speed)
Created plots stored in Plots
tab
Type help(...)
into the console for documentation
help(seq)
help(data.frame)
Store data/info/function/etc. in R objects
Create an R object via <-
(recommended) or =
#save for later avg <- (5 + 7 + 6) / 3 #call avg object avg
## [1] 6
#strings (text) can be saved as well words <- c("Hello there!", "How are you?") words
## [1] "Hello there!" "How are you?"
ls()
ls()
## [1] "avg" "words"
rm()
to removerm(avg) ls()
## [1] "words"
rm(list=ls())
to remove all stored objectsletters
and cars
letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" ## [20] "t" "u" "v" "w" "x" "y" "z"
head(cars, n = 3)
## speed dist ## 1 4 2 ## 2 4 10 ## 3 7 4
data()
shows available built-in datasetsFour main ‘areas’
Console (& Terminal)
Scripting and Viewing Window
Plots/Help (& Files/Packages)
Environment (& Connections/Git)
R has strong Object Oriented Programming (OOP) tools
Object: data structure with attributes (class)
Method: procedures (functions) act on object based on attributes
R has strong Object Oriented Programming (OOP) tools
Object: data structure with attributes (class)
Method: procedures (functions) act on object based on attributes
R functions like plot()
act differently depending on object class
class(cars)
## [1] "data.frame"
class(exp)
## [1] "function"
R has strong Object Oriented Programming (OOP) tools
Object: data structure with attributes (often a ‘class’)
Method: procedures (often ‘functions’) act on object based on attributes
R functions like plot()
act differently depending on object class
plot(cars)
plot(exp)
Create an R object via <-
(recommended) or =
vec <- c(1, 4, 10) vec
## [1] 1 4 10
Create an R object via <-
(recommended) or =
fit <- lm(dist ~ speed, data = cars) fit
## ## Call: ## lm(formula = dist ~ speed, data = cars) ## ## Coefficients: ## (Intercept) speed ## -17.579 3.932
class(vec)
## [1] "numeric"
summary(vec)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.0 2.5 4.0 5.0 7.0 10.0
class(fit)
## [1] "lm"
summary(fit)
## ## Call: ## lm(formula = dist ~ speed, data = cars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -29.069 -9.525 -2.272 9.215 43.201 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -17.5791 6.7584 -2.601 0.0123 * ## speed 3.9324 0.4155 9.464 1.49e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 15.38 on 48 degrees of freedom ## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438 ## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Many functions to help understand an R Object
class()
describes the class
attribute of an R object
class(cars)
## [1] "data.frame"
Many functions to help understand an R Object
typeof()
determines the (R internal) type or storage mode of any object
typeof(cars)
## [1] "list"
Many functions to help understand an R Object
str()
compactly displays the internal structure of an R object
str(cars)
## 'data.frame': 50 obs. of 2 variables: ## $ speed: num 4 4 7 7 8 9 10 10 10 11 ... ## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
RStudio provides a nice environment for coding
R has functions that can be used to create objects
Create an R Object with <-
Objects have attributes that determine how functions act
class()
, typeof()
, and str()
help understand your object
Understand data structures first: Five major types
Dimension | Homogeneous | Heterogeneous |
---|---|---|
1d | Atomic Vector | List |
2d | Matrix | Data Frame |
Elements must be same ‘type’
c()
function (‘combine’)#vectors (1 dimensional) objects x <- c(17, 22, 1, 3, -3) y <- c("cat", "dog", "bird", "frog") x
## [1] 17 22 1 3 -3
y
## [1] "cat" "dog" "bird" "frog"
Many ‘functions’ output a numeric vector
Ex: seq()
Inputs = from, to, by (among others)
Output = a sequence of numbers
help(seq)
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
length.out = NULL, along.with = NULL, ...)
v <- seq(from = 1, to = 5, by = 1) v
## [1] 1 2 3 4 5
str(v)
## num [1:5] 1 2 3 4 5
num
says it is numeric
[1:5]
implies one dimensional with length 5
:
to Create a Sequence1:20
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1:20/20
## [1] 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 ## [16] 0.80 0.85 0.90 0.95 1.00
1:20 + 1
## [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Vectors useful to know about
Not usually useful for a dataset
Often consider as ‘building blocks’ for other data types
#populate vectors x <- c(17, 3, 13, 11) y <- rep(-3, times = 4) z <- 1:4
#populate vectors x <- c(17, 3, 13, 11) y <- rep(-3, times = 4) z <- 1:4
#check 'type' is.numeric(x)
## [1] TRUE
is.numeric(y)
## [1] TRUE
is.numeric(z)
## [1] TRUE
#populate vectors x <- c(17, 3, 13, 11) y <- rep(-3, times = 4) z <- 1:4
#check 'type' is.numeric(x)
## [1] TRUE
is.numeric(y)
## [1] TRUE
is.numeric(z)
## [1] TRUE
#check 'length' length(x)
## [1] 4
length(y)
## [1] 4
length(z)
## [1] 4
(think) columns are vectors of the same type and length
Create with matrix()
function (see help)
(think) columns are vectors of the same type and length
Create with matrix()
function (see help)
#populate vectors x <- c(17, 3, 13, 11) y <- rep(-3, times = 4) z <- 1:4 #combine in a matrix matrix(c(x, y, z), ncol = 3)
## [,1] [,2] [,3] ## [1,] 17 -3 1 ## [2,] 3 -3 2 ## [3,] 13 -3 3 ## [4,] 11 -3 4
(think) columns are vectors of the same type and length
Create with matrix()
function
x <- c("Hi", "There", "Friend", "!") y <- c("a", "b", "c", "d") z <- c("One", "Two", "Three", "Four") is.character(x)
## [1] TRUE
matrix(c(x, y, z), nrow = 6)
## [,1] [,2] ## [1,] "Hi" "c" ## [2,] "There" "d" ## [3,] "Friend" "One" ## [4,] "!" "Two" ## [5,] "a" "Three" ## [6,] "b" "Four"
(think) columns are vectors of the same type and length
Useful for some data but often some numeric and some character variables:
collection (list) of vectors of the same length
Create with data.frame()
function
x <- c("a", "b", "c", "d", "e", "f") y <- c(1, 3, 4, -1, 5, 6) z <- 10:15 data.frame(x, y, z)
## x y z ## 1 a 1 10 ## 2 b 3 11 ## 3 c 4 12 ## 4 d -1 13 ## 5 e 5 14 ## 6 f 6 15
collection (list) of vectors of the same length
Create with data.frame()
function
data.frame(char = x, data1 = y, data2 = z)
## char data1 data2 ## 1 a 1 10 ## 2 b 3 11 ## 3 c 4 12 ## 4 d -1 13 ## 5 e 5 14 ## 6 f 6 15
collection (list) of vectors of the same length
Create with data.frame()
function
Perfect for most data sets!
Most functions that read 2D data store it as a data frame
a vector that can have differing elements
Create with list()
list(1:3, rnorm(2), c("!", "?"))
## [[1]] ## [1] 1 2 3 ## ## [[2]] ## [1] 0.6233416 0.2943405 ## ## [[3]] ## [1] "!" "?"
list(seq = 1:3, normVals = rnorm(2), punctuation = c("!", "?"))
## $seq ## [1] 1 2 3 ## ## $normVals ## [1] -1.207736 -2.413757 ## ## $punctuation ## [1] "!" "?"
a vector that can have differing elements
Create with list()
More flexible than a Data Frame!
Useful for more complex types of data
Dimension | Homogeneous | Heterogeneous |
---|---|---|
1d | Atomic Vector | List |
2d | Matrix | Data Frame |
For most data analysis you’ll use data frames!
Next up: How do we access/change parts of our objects?
[]
letters #built-in vector
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" ## [20] "t" "u" "v" "w" "x" "y" "z"
letters[1] #R starts counting at 1!
## [1] "a"
letters[26]
## [1] "z"
Return elements using square brackets []
Can ‘feed’ in a vector of indices to []
letters[1:4]
## [1] "a" "b" "c" "d"
letters[c(5, 10, 15, 20, 25)]
## [1] "e" "j" "o" "t" "y"
x <- c(1, 2, 5); letters[x]
## [1] "a" "b" "e"
Return elements using square brackets []
Can ‘feed’ in a vector of indices to []
Use negative indices to return without
letters[-(1:4)]
## [1] "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" ## [20] "x" "y" "z"
x <- c(1, 2, 5); letters[-x]
## [1] "c" "d" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" ## [20] "w" "x" "y" "z"
Use square brackets with a comma [ , ]
Notice default row and column names!
mat <- matrix(c(1:4, 20:17), ncol = 2) mat
## [,1] [,2] ## [1,] 1 20 ## [2,] 2 19 ## [3,] 3 18 ## [4,] 4 17
[ , ]
mat
## [,1] [,2] ## [1,] 1 20 ## [2,] 2 19 ## [3,] 3 18 ## [4,] 4 17
mat[c(2, 4), ]
## [,1] [,2] ## [1,] 2 19 ## [2,] 4 17
mat[, 1]
## [1] 1 2 3 4
mat[2, ]
## [1] 2 19
mat[2, 1]
## [1] 2
iris
data framestr(iris)
## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Data Frame is 2D similar to a matrix - access similarly!
Use square brackets with a comma [ , ]
iris[1:4, 2:4]
## Sepal.Width Petal.Length Petal.Width ## 1 3.5 1.4 0.2 ## 2 3.0 1.4 0.2 ## 3 3.2 1.3 0.2 ## 4 3.1 1.5 0.2
Data Frame is 2D similar toa matrix - access similarly!
Use square brackets with a comma [ , ]
iris[1, ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa
iris[ , c("Sepal.Length", "Species")]
## Sepal.Length Species ## 1 5.1 setosa ## 2 4.9 setosa ## 3 4.7 setosa ## 4 4.6 setosa ## 5 5.0 setosa ## 6 5.4 setosa ## 7 4.6 setosa ## 8 5.0 setosa ## 9 4.4 setosa ## 10 4.9 setosa ## 11 5.4 setosa ## 12 4.8 setosa ## 13 4.8 setosa ## 14 4.3 setosa ## 15 5.8 setosa ## 16 5.7 setosa ## 17 5.4 setosa ## 18 5.1 setosa ## 19 5.7 setosa ## 20 5.1 setosa ## 21 5.4 setosa ## 22 5.1 setosa ## 23 4.6 setosa ## 24 5.1 setosa ## 25 4.8 setosa ## 26 5.0 setosa ## 27 5.0 setosa ## 28 5.2 setosa ## 29 5.2 setosa ## 30 4.7 setosa ## 31 4.8 setosa ## 32 5.4 setosa ## 33 5.2 setosa ## 34 5.5 setosa ## 35 4.9 setosa ## 36 5.0 setosa ## 37 5.5 setosa ## 38 4.9 setosa ## 39 4.4 setosa ## 40 5.1 setosa ## 41 5.0 setosa ## 42 4.5 setosa ## 43 4.4 setosa ## 44 5.0 setosa ## 45 5.1 setosa ## 46 4.8 setosa ## 47 5.1 setosa ## 48 4.6 setosa ## 49 5.3 setosa ## 50 5.0 setosa ## 51 7.0 versicolor ## 52 6.4 versicolor ## 53 6.9 versicolor ## 54 5.5 versicolor ## 55 6.5 versicolor ## 56 5.7 versicolor ## 57 6.3 versicolor ## 58 4.9 versicolor ## 59 6.6 versicolor ## 60 5.2 versicolor ## 61 5.0 versicolor ## 62 5.9 versicolor ## 63 6.0 versicolor ## 64 6.1 versicolor ## 65 5.6 versicolor ## 66 6.7 versicolor ## 67 5.6 versicolor ## 68 5.8 versicolor ## 69 6.2 versicolor ## 70 5.6 versicolor ## 71 5.9 versicolor ## 72 6.1 versicolor ## 73 6.3 versicolor ## 74 6.1 versicolor ## 75 6.4 versicolor ## 76 6.6 versicolor ## 77 6.8 versicolor ## 78 6.7 versicolor ## 79 6.0 versicolor ## 80 5.7 versicolor ## 81 5.5 versicolor ## 82 5.5 versicolor ## 83 5.8 versicolor ## 84 6.0 versicolor ## 85 5.4 versicolor ## 86 6.0 versicolor ## 87 6.7 versicolor ## 88 6.3 versicolor ## 89 5.6 versicolor ## 90 5.5 versicolor ## 91 5.5 versicolor ## 92 6.1 versicolor ## 93 5.8 versicolor ## 94 5.0 versicolor ## 95 5.6 versicolor ## 96 5.7 versicolor ## 97 5.7 versicolor ## 98 6.2 versicolor ## 99 5.1 versicolor ## 100 5.7 versicolor ## 101 6.3 virginica ## 102 5.8 virginica ## 103 7.1 virginica ## 104 6.3 virginica ## 105 6.5 virginica ## 106 7.6 virginica ## 107 4.9 virginica ## 108 7.3 virginica ## 109 6.7 virginica ## 110 7.2 virginica ## 111 6.5 virginica ## 112 6.4 virginica ## 113 6.8 virginica ## 114 5.7 virginica ## 115 5.8 virginica ## 116 6.4 virginica ## 117 6.5 virginica ## 118 7.7 virginica ## 119 7.7 virginica ## 120 6.0 virginica ## 121 6.9 virginica ## 122 5.6 virginica ## 123 7.7 virginica ## 124 6.3 virginica ## 125 6.7 virginica ## 126 7.2 virginica ## 127 6.2 virginica ## 128 6.1 virginica ## 129 6.4 virginica ## 130 7.2 virginica ## 131 7.4 virginica ## 132 7.9 virginica ## 133 6.4 virginica ## 134 6.3 virginica ## 135 6.1 virginica ## 136 7.7 virginica ## 137 6.3 virginica ## 138 6.4 virginica ## 139 6.0 virginica ## 140 6.9 virginica ## 141 6.7 virginica ## 142 6.9 virginica ## 143 5.8 virginica ## 144 6.8 virginica ## 145 6.7 virginica ## 146 6.7 virginica ## 147 6.3 virginica ## 148 6.5 virginica ## 149 6.2 virginica ## 150 5.9 virginica
iris$Sepal.Length
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 ## [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 ## [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5 ## [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1 ## [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 ## [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 ## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 ## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 ## [145] 6.7 6.7 6.3 6.5 6.2 5.9
Dollar sign allows easy access to a single column!
Most used method for accessing a single variable
RStudio fills in options.
iris$
[ ]
for multiple list elementsx <- list("HI", c(10:20), 1) x
## [[1]] ## [1] "HI" ## ## [[2]] ## [1] 10 11 12 13 14 15 16 17 18 19 20 ## ## [[3]] ## [1] 1
[ ]
for multiple list elementsx <- list("HI", c(10:20), 1) x[2:3]
## [[1]] ## [1] 10 11 12 13 14 15 16 17 18 19 20 ## ## [[2]] ## [1] 1
[[ ]]
(or [ ]
) for single list elementx <- list("HI", c(10:20), 1) x[1]
## [[1]] ## [1] "HI"
x[[1]]
## [1] "HI"
x[[2]]
## [1] 10 11 12 13 14 15 16 17 18 19 20
x[[2]][4:5]
## [1] 13 14
x <- list("HI", c(10:20), 1) str(x)
## List of 3 ## $ : chr "HI" ## $ : int [1:11] 10 11 12 13 14 15 16 17 18 19 ... ## $ : num 1
x <- list(First = "Hi", Second = c(10:20), Third = 1) x$Second
## [1] 10 11 12 13 14 15 16 17 18 19 20
str(x)
## List of 3 ## $ First : chr "Hi" ## $ Second: int [1:11] 10 11 12 13 14 15 16 17 18 19 ... ## $ Third : num 1
str(iris)
## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
typeof(x)
## [1] "list"
typeof(iris)
## [1] "list"
iris[2]
## Sepal.Width ## 1 3.5 ## 2 3.0 ## 3 3.2 ## 4 3.1 ## 5 3.6 ## 6 3.9 ## 7 3.4 ## 8 3.4 ## 9 2.9 ## 10 3.1 ## 11 3.7 ## 12 3.4 ## 13 3.0 ## 14 3.0 ## 15 4.0 ## 16 4.4 ## 17 3.9 ## 18 3.5 ## 19 3.8 ## 20 3.8 ## 21 3.4 ## 22 3.7 ## 23 3.6 ## 24 3.3 ## 25 3.4 ## 26 3.0 ## 27 3.4 ## 28 3.5 ## 29 3.4 ## 30 3.2 ## 31 3.1 ## 32 3.4 ## 33 4.1 ## 34 4.2 ## 35 3.1 ## 36 3.2 ## 37 3.5 ## 38 3.6 ## 39 3.0 ## 40 3.4 ## 41 3.5 ## 42 2.3 ## 43 3.2 ## 44 3.5 ## 45 3.8 ## 46 3.0 ## 47 3.8 ## 48 3.2 ## 49 3.7 ## 50 3.3 ## 51 3.2 ## 52 3.2 ## 53 3.1 ## 54 2.3 ## 55 2.8 ## 56 2.8 ## 57 3.3 ## 58 2.4 ## 59 2.9 ## 60 2.7 ## 61 2.0 ## 62 3.0 ## 63 2.2 ## 64 2.9 ## 65 2.9 ## 66 3.1 ## 67 3.0 ## 68 2.7 ## 69 2.2 ## 70 2.5 ## 71 3.2 ## 72 2.8 ## 73 2.5 ## 74 2.8 ## 75 2.9 ## 76 3.0 ## 77 2.8 ## 78 3.0 ## 79 2.9 ## 80 2.6 ## 81 2.4 ## 82 2.4 ## 83 2.7 ## 84 2.7 ## 85 3.0 ## 86 3.4 ## 87 3.1 ## 88 2.3 ## 89 3.0 ## 90 2.5 ## 91 2.6 ## 92 3.0 ## 93 2.6 ## 94 2.3 ## 95 2.7 ## 96 3.0 ## 97 2.9 ## 98 2.9 ## 99 2.5 ## 100 2.8 ## 101 3.3 ## 102 2.7 ## 103 3.0 ## 104 2.9 ## 105 3.0 ## 106 3.0 ## 107 2.5 ## 108 2.9 ## 109 2.5 ## 110 3.6 ## 111 3.2 ## 112 2.7 ## 113 3.0 ## 114 2.5 ## 115 2.8 ## 116 3.2 ## 117 3.0 ## 118 3.8 ## 119 2.6 ## 120 2.2 ## 121 3.2 ## 122 2.8 ## 123 2.8 ## 124 2.7 ## 125 3.3 ## 126 3.2 ## 127 2.8 ## 128 3.0 ## 129 2.8 ## 130 3.0 ## 131 2.8 ## 132 3.8 ## 133 2.8 ## 134 2.8 ## 135 2.6 ## 136 3.0 ## 137 3.4 ## 138 3.1 ## 139 3.0 ## 140 3.1 ## 141 3.1 ## 142 3.1 ## 143 2.7 ## 144 3.2 ## 145 3.3 ## 146 3.0 ## 147 2.5 ## 148 3.0 ## 149 3.4 ## 150 3.0
iris[[2]]
## [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5 ## [19] 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2 ## [37] 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3 ## [55] 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5 3.2 2.8 ## [73] 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5 ## [91] 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0 3.0 2.5 2.9 ## [109] 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2 ## [127] 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2 ## [145] 3.3 3.0 2.5 3.0 3.4 3.0
Dimension | Homogeneous | Heterogeneous |
---|---|---|
1d | Atomic Vector | List |
2d | Matrix | Data Frame |
x[ ]
x[ , ]
x[ , ]
or x$name
x[ ]
, x[[ ]]
, or x$name
Plan:
How to read in data depends on raw/external data type!
Delimited data
,
) that separates data entries
Comma: usually .csv Space: usually .txt or .dat Tab: usually .tsv or .txt General: usually .txt or .dat
packages
are loadedpackages
are loadedutils
package has family of read.
functions ready for use!Functions from read.
family work well
Concerns:
poor default function behavior
(formerly, prior to R 4.0) strings are read as factors
row & column names can be troublesome
(Slightly) different behavior on different computers
Want to have most of our functions we use ‘feel’ the same…
install.packages("readr")
Only install once!
Each session: read in package using library()
or require()
library("tidyverse")
Can call functions without loading full library with ::
If not specified, most recently loaded package takes precedent
#stats::filter(...) calls time-series function from stats package dplyr::filter(iris, Species == "virginica")
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 6.3 3.3 6.0 2.5 virginica ## 2 5.8 2.7 5.1 1.9 virginica ## 3 7.1 3.0 5.9 2.1 virginica ## 4 6.3 2.9 5.6 1.8 virginica ## 5 6.5 3.0 5.8 2.2 virginica ## 6 7.6 3.0 6.6 2.1 virginica ## 7 4.9 2.5 4.5 1.7 virginica ## 8 7.3 2.9 6.3 1.8 virginica ## 9 6.7 2.5 5.8 1.8 virginica ## 10 7.2 3.6 6.1 2.5 virginica ## 11 6.5 3.2 5.1 2.0 virginica ## 12 6.4 2.7 5.3 1.9 virginica ## 13 6.8 3.0 5.5 2.1 virginica ## 14 5.7 2.5 5.0 2.0 virginica ## 15 5.8 2.8 5.1 2.4 virginica ## 16 6.4 3.2 5.3 2.3 virginica ## 17 6.5 3.0 5.5 1.8 virginica ## 18 7.7 3.8 6.7 2.2 virginica ## 19 7.7 2.6 6.9 2.3 virginica ## 20 6.0 2.2 5.0 1.5 virginica ## 21 6.9 3.2 5.7 2.3 virginica ## 22 5.6 2.8 4.9 2.0 virginica ## 23 7.7 2.8 6.7 2.0 virginica ## 24 6.3 2.7 4.9 1.8 virginica ## 25 6.7 3.3 5.7 2.1 virginica ## 26 7.2 3.2 6.0 1.8 virginica ## 27 6.2 2.8 4.8 1.8 virginica ## 28 6.1 3.0 4.9 1.8 virginica ## 29 6.4 2.8 5.6 2.1 virginica ## 30 7.2 3.0 5.8 1.6 virginica ## 31 7.4 2.8 6.1 1.9 virginica ## 32 7.9 3.8 6.4 2.0 virginica ## 33 6.4 2.8 5.6 2.2 virginica ## 34 6.3 2.8 5.1 1.5 virginica ## 35 6.1 2.6 5.6 1.4 virginica ## 36 7.7 3.0 6.1 2.3 virginica ## 37 6.3 3.4 5.6 2.4 virginica ## 38 6.4 3.1 5.5 1.8 virginica ## 39 6.0 3.0 4.8 1.8 virginica ## 40 6.9 3.1 5.4 2.1 virginica ## 41 6.7 3.1 5.6 2.4 virginica ## 42 6.9 3.1 5.1 2.3 virginica ## 43 5.8 2.7 5.1 1.9 virginica ## 44 6.8 3.2 5.9 2.3 virginica ## 45 6.7 3.3 5.7 2.5 virginica ## 46 6.7 3.0 5.2 2.3 virginica ## 47 6.3 2.5 5.0 1.9 virginica ## 48 6.5 3.0 5.2 2.0 virginica ## 49 6.2 3.4 5.4 2.3 virginica ## 50 5.9 3.0 5.1 1.8 virginica
baseR
and tidyverse
(readr
package does the heavy lifting) function and purpose:
Type of Delimeter | utils Function |
readr Function |
---|---|---|
Comma | read.csv() |
read_csv() |
Semicolon (, for decimal) |
read.csv2() |
read_csv2() |
Tab | read.delim() |
read_tsv() |
General | read.table(sep = "") |
read_delim() |
White Space | read.table(sep = "") |
read_table() read_table2() |
Let’s read in the ‘neuralgia.csv’ file
By default, R looks in the working directory
for the file
getwd()
## [1] "C:/repos/Basics-of-R-for-Data-Science-and-Statistics/slides"
setwd("C:/Users/jbpost2/repos/Basics-of-R-for-Data-Science-and-Statistics/datasets") #or setwd("C:\\Users\\jbpost2\\repos\\camp\\Basics-of-R-for-Data-Science-and-Statistics\\datasets") #better to use R projects!
With neuralgia.csv
file in the working directory:
neuralgiaData <- read_csv("neuralgia.csv") neuralgiaData
## # A tibble: 60 x 5 ## Treatment Sex Age Duration Pain ## <chr> <chr> <dbl> <dbl> <chr> ## 1 P F 68 1 No ## 2 B M 74 16 No ## 3 P F 67 30 No ## 4 P M 66 26 Yes ## 5 B F 67 28 No ## # ... with 55 more rows
neuralgiaData <- read_csv( "C:/Users/jbpost2/repos/Basics-of-R-for-Data-Science-and-Statistics/datasets/neuralgia.csv" )
../
drops down a folder)neuralgiaData <- read_csv("../datasets/neuralgia.csv")
Working directory: “…/Basics-of-R-for-Data-Science-and-Statistics/slides”
File location: “…/Basics-of-R-for-Data-Science-and-Statistics/datasets/neuralgia.csv”
As long others have the same folder structure, can share code with no path change needed!
Often have many files associated with an analysis
With multiple analyses things get cluttered…
Often have many files associated with an analysis
With multiple analyses things get cluttered…
Want to associate different
environments
histories
working directories
source documents
with each analysis
Place all files for that analysis in that directory
Swap between projects using menu in top right
Back to reading in data!
R can pull from URLs as well!
neuralgiaData <- read_csv("https://www4.stat.ncsu.edu/~online/datasets/neuralgia.csv") neuralgiaData
## # A tibble: 60 x 5 ## Treatment Sex Age Duration Pain ## <chr> <chr> <dbl> <dbl> <chr> ## 1 P F 68 1 No ## 2 B M 74 16 No ## 3 P F 67 30 No ## 4 P M 66 26 Yes ## 5 B F 67 28 No ## # ... with 55 more rows
tibbles
Notice: fancy printing!
Checking column type is a basic data validation step
tidyverse
data frames are called tibbles
class(neuralgiaData)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
tibbles
data frame
. No simplification!neuralgiaData2 <- as.data.frame(neuralgiaData) neuralgiaData2[,1]
## [1] "P" "B" "P" "P" "B" "B" "A" "B" "B" "A" "A" "A" "B" "A" "P" "A" "P" "A" "P" ## [20] "B" "B" "A" "A" "A" "B" "P" "B" "B" "P" "P" "A" "A" "B" "B" "B" "A" "P" "B" ## [39] "B" "P" "P" "P" "A" "B" "A" "P" "P" "A" "B" "P" "P" "P" "B" "A" "P" "A" "P" ## [58] "A" "B" "A"
neuralgiaData[,1]
## # A tibble: 60 x 1 ## Treatment ## <chr> ## 1 P ## 2 B ## 3 P ## 4 P ## 5 B ## # ... with 55 more rows
tibbles
Behavior slightly different than a standard data frame
. No simplification!
Use either dplyr::pull()
or $
pull(neuralgiaData, Treatment) #or pull(neuralgiaData, 1)
## [1] "P" "B" "P" "P" "B" "B" "A" "B" "B" "A" "A" "A" "B" "A" "P" "A" "P" "A" "P" ## [20] "B" "B" "A" "A" "A" "B" "P" "B" "B" "P" "P" "A" "A" "B" "B" "B" "A" "P" "B" ## [39] "B" "P" "P" "P" "A" "B" "A" "P" "P" "A" "B" "P" "P" "P" "B" "A" "P" "A" "P" ## [58] "A" "B" "A"
neuralgiaData$Treatment
## [1] "P" "B" "P" "P" "B" "B" "A" "B" "B" "A" "A" "A" "B" "A" "P" "A" "P" "A" "P" ## [20] "B" "B" "A" "A" "A" "B" "P" "B" "B" "P" "P" "A" "A" "B" "B" "B" "A" "P" "B" ## [39] "B" "P" "P" "P" "A" "B" "A" "P" "P" "A" "B" "P" "P" "P" "B" "A" "P" "A" "P" ## [58] "A" "B" "A"
Reading clean delimited data pretty easy with the tidyverse!
Let’s read in the ‘chemical.txt’ file (space delimited)
read_table2()
allows multiple white space characaters between entries
Reading clean delimited data pretty easy with the tidyverse!
Let’s read in the ‘chemical.txt’ file (space delimited)
read_table2()
allows multiple white space characaters between entries
read_table2("https://www4.stat.ncsu.edu/~online/datasets/chemical.txt")
## # A tibble: 19 x 4 ## temp conc time percent ## <dbl> <dbl> <dbl> <dbl> ## 1 -1 -1 -1 45.9 ## 2 1 -1 -1 60.6 ## 3 -1 1 -1 57.5 ## 4 1 1 -1 58.6 ## 5 -1 -1 1 53.3 ## 6 1 -1 1 58 ## 7 -1 1 1 58.8 ## 8 1 1 1 52.4 ## 9 -2 0 0 46.9 ## 10 2 0 0 55.4 ## 11 0 -2 0 55 ## 12 0 2 0 57.5 ## 13 0 0 -2 56.3 ## 14 0 0 2 58.9 ## 15 0 0 0 56.9 ## 16 2 -3 0 61.1 ## 17 2 -3 0 62.9 ## 18 -1.4 2.6 0.7 60 ## 19 -1.4 2.6 0.7 60.6
Reading clean delimited data pretty easy with the tidyverse!
Let’s read in the ‘crabs.txt’ file (tab delimited)
Reading clean delimited data pretty easy with the tidyverse!
Let’s read in the ‘crabs.txt’ file (tab delimited)
read_tsv("https://www4.stat.ncsu.edu/~online/datasets/crabs.txt")
## # A tibble: 173 x 6 ## color spine width satell weight y ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 3 3 28.3 8 3050 1 ## 2 4 3 22.5 0 1550 0 ## 3 2 1 26 9 2300 1 ## 4 4 3 24.8 0 2100 0 ## 5 4 3 26 4 2600 1 ## # ... with 168 more rows
Reading clean delimited data pretty easy with the tidyverse!
Let’s read in the ‘umps2012.txt’ file (‘>’ delimited)
In raw data, no column names provided
read_delim(
file,
delim,
col_names = TRUE,
col_types = NULL,
na = c("", "NA"),
skip = 0,
guess_max = min(1000, n_max), ...
)
read_delim("https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt", delim = ">", col_names = c("Year", "Month", "Day", "Home", "Away", "HPUmpire"))
## # A tibble: 2,359 x 6 ## Year Month Day Home Away HPUmpire ## <dbl> <dbl> <dbl> <chr> <chr> <chr> ## 1 2012 4 12 MIN LAA D.J. Reyburn ## 2 2012 4 12 SD ARI Marty Foster ## 3 2012 4 12 WSH CIN Mike Everitt ## 4 2012 4 12 PHI MIA Jeff Nelson ## 5 2012 4 12 CHC MIL Fieldin Culbreth ## # ... with 2,354 more rows
readr
functions determine the column types? From the help:col_types
One of NULL, a cols() specification, or a string. See vignette("readr") for more details.
If NULL, all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself.
skip = 0
col_names = TRUE
na = c("", "NA")
regular expressions
:(Type of file | Package | Function |
---|---|---|
Delimited | readr |
read_csv() , read_tsv() ,read_table() , read_delim() |
Excel (.xls,.xlsx) | readxl |
read_excel() |
SAS (.sas7bdat) | haven |
read_sas() |
SPSS (.sav) | haven |
read_spss() |
Read in censusEd.xlsx
Use read_excel()
from readxl
package!
Reads both xls and xlsx files
Detects format from extension given
Can’t pull from web though!
read_excel
#install package if necessary library(readxl) #reads first sheet by default edData <- read_excel("../datasets/censusEd.xlsx") edData
## # A tibble: 3,198 x 42 ## Area_name STCOU EDU010187F EDU010187D EDU010187N1 EDU010187N2 EDU010188F ## <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl> ## 1 UNITED STATES 00000 0 40024299 0000 0000 0 ## 2 ALABAMA 01000 0 733735 0000 0000 0 ## 3 Autauga, AL 01001 0 6829 0000 0000 0 ## 4 Baldwin, AL 01003 0 16417 0000 0000 0 ## 5 Barbour, AL 01005 0 5071 0000 0000 0 ## # ... with 3,193 more rows, and 35 more variables: EDU010188D <dbl>, ## # EDU010188N1 <chr>, EDU010188N2 <chr>, EDU010189F <dbl>, EDU010189D <dbl>, ## # EDU010189N1 <chr>, EDU010189N2 <chr>, EDU010190F <dbl>, EDU010190D <dbl>, ## # EDU010190N1 <chr>, EDU010190N2 <chr>, EDU010191F <dbl>, EDU010191D <dbl>, ## # EDU010191N1 <chr>, EDU010191N2 <chr>, EDU010192F <dbl>, EDU010192D <dbl>, ## # EDU010192N1 <chr>, EDU010192N2 <chr>, EDU010193F <dbl>, EDU010193D <dbl>, ## # EDU010193N1 <chr>, EDU010193N2 <chr>, EDU010194F <dbl>, EDU010194D <dbl>, ## # EDU010194N1 <chr>, EDU010194N2 <chr>, EDU010195F <dbl>, EDU010195D <dbl>, ## # EDU010195N1 <chr>, EDU010195N2 <chr>, EDU010196F <dbl>, EDU010196D <dbl>, ## # EDU010196N1 <chr>, EDU010196N2 <chr>
excel_sheets("../datasets/censusEd.xlsx")
## [1] "EDU01A" "EDU01B" "EDU01C" "EDU01D" "EDU01E" "EDU01F" "EDU01G" "EDU01H" ## [9] "EDU01I" "EDU01J"
NULL
for 1st) using sheet =
read_excel("../datasets/censusEd.xlsx", sheet = "EDU01D")
SAS data has extension ‘.sas7bdat’
Read in smoke2003.sas7bdat
read_sas()
from haven
packageSAS data has extension ‘.sas7bdat’
Read in smoke2003.sas7bdat
Use read_sas()
from haven
package
Not many options!
#install if necessary library(haven) smokeData <- read_sas("https://www4.stat.ncsu.edu/~online/datasets/smoke2003.sas7bdat") smokeData
## # A tibble: 443 x 54 ## SEQN SDDSRVYR RIDSTATR RIDEXMON RIAGENDR RIDAGEYR RIDAGEMN RIDAGEEX RIDRETH1 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21010 3 2 2 2 52 633 634 3 ## 2 21012 3 2 2 1 63 765 766 4 ## 3 21048 3 2 1 2 42 504 504 1 ## 4 21084 3 2 1 2 57 692 693 3 ## 5 21093 3 2 1 2 64 778 778 2 ## # ... with 438 more rows, and 45 more variables: RIDRETH2 <dbl>, ## # DMQMILIT <dbl>, DMDBORN <dbl>, DMDCITZN <dbl>, DMDYRSUS <dbl>, ## # DMDEDUC3 <dbl>, DMDEDUC2 <dbl>, DMDEDUC <dbl>, DMDSCHOL <dbl>, ## # DMDMARTL <dbl>, DMDHHSIZ <dbl>, INDHHINC <dbl>, INDFMINC <dbl>, ## # INDFMPIR <dbl>, RIDEXPRG <dbl>, DMDHRGND <dbl>, DMDHRAGE <dbl>, ## # DMDHRBRN <dbl>, DMDHREDU <dbl>, DMDHRMAR <dbl>, DMDHSEDU <dbl>, ## # SIALANG <dbl>, SIAPROXY <dbl>, SIAINTRP <dbl>, FIALANG <dbl>, ## # FIAPROXY <dbl>, FIAINTRP <dbl>, MIALANG <dbl>, MIAPROXY <dbl>, ## # MIAINTRP <dbl>, AIALANG <dbl>, WTINT2YR <dbl>, WTMEC2YR <dbl>, ## # SDMVPSU <dbl>, SDMVSTRA <dbl>, Gender <dbl>, Age <dbl>, IncomeGroup <chr>, ## # Ethnicity <chr>, Education <dbl>, SMD070 <dbl>, SMQ077 <dbl>, SMD650 <dbl>, ## # PacksPerDay <dbl>, lbdvid <dbl>
SPSS data has extension “.sav”
Read in bodyFat.sav
read_spss()
from haven
packageSPSS data has extension “.sav”
Read in bodyFat.sav
Use read_spss()
from haven
package
Not many options!
bodyFatData <- read_spss("https://www4.stat.ncsu.edu/~online/datasets/bodyFat.sav") bodyFatData
## # A tibble: 20 x 4 ## y x1 x2 x3 ## <dbl> <dbl> <dbl> <dbl> ## 1 19.5 43.1 29.1 11.9 ## 2 24.7 49.8 28.2 22.8 ## 3 30.7 51.9 37 18.7 ## 4 29.8 54.3 31.1 20.1 ## 5 19.1 42.2 30.9 12.9 ## 6 25.6 53.9 23.7 21.7 ## 7 31.4 58.5 27.6 27.1 ## 8 27.9 52.1 30.6 25.4 ## 9 22.1 49.9 23.2 21.3 ## 10 25.5 53.5 24.8 19.3 ## 11 31.1 56.6 30 25.4 ## 12 30.4 56.7 28.3 27.2 ## 13 18.7 46.5 23 11.7 ## 14 19.7 44.2 28.6 17.8 ## 15 14.6 42.7 21.3 12.8 ## 16 29.5 54.4 30.1 23.9 ## 17 27.7 55.3 25.7 22.6 ## 18 30.2 58.6 24.6 25.4 ## 19 22.7 48.2 27.1 14.8 ## 20 25.2 51 27.5 21.1
tidyverse
Notice the ease of use of the functions across the tidyverse so far:
function_name('path-to-file', options)
All functions read the data into a tibble
Good defaults that do the work for you
JSON - JavaScript Object Notation
Used widely across the internet and databases
Can represent usual 2D data or heirarchical data
{ { "name": "Barry Sanders" "games" : 153 "position": "RB" }, { "name": "Joe Montana" "games": 192 "position": "QB" } }
Three major R packages
rjson
RJSONIO
jsonlite
many nice features
a little slower implementation
tidyjson
- new tidyverse
package
jsonlite
Packagejsonlite
basic functions:
Function | Description |
---|---|
fromJSON |
Reads JSON data from file path or character string. Converts and simplfies to R object |
toJSON |
Writes R object to JSON object |
stream_in |
Accepts a file connection - can read streaming JSON data |
A defined method for asking for information from a computer
Useful for getting data
Useful for allowing others to run your model without a GUI (like Shiny)
Registered for a key at newsapi.org. An API for looking at news articles
Look at documentation for API (most have this!)
Example URL to obtain data is given
https://newsapi.org/v2/everything?q=bitcoin&apiKey=myKeyGoesHere
https://newsapi.org/v2/everything?q=bitcoin&from=2021-06-01&apiKey=myKeyGoesHere
Use GET
from httr
package (make sure to load package!)
Modify for what you have interest in!
library(httr) GET("http://newsapi.org/v2/everything?qlnTitle=Juneteenth&from=2021-06-01&language=en& apiKey=myKeyGoesHere")
content
str(myData, max.level = 1)
## List of 10 ## $ url : chr "http://newsapi.org/v2/everything?qInTitle=tesla&from=2021-06-01&language=en&pageSize=100&apiKey=aa4b545bfbd64d4"| __truncated__ ## $ status_code: int 426 ## $ headers :List of 13 ## ..- attr(*, "class")= chr [1:2] "insensitive" "list" ## $ all_headers:List of 1 ## $ cookies :'data.frame': 0 obs. of 7 variables: ## $ content : raw [1:255] 7b 22 73 74 ... ## $ date : POSIXct[1:1], format: "2021-08-07 05:38:26" ## $ times : Named num [1:6] 0 0.0363 0.0594 0.0595 0.1199 ... ## ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ... ## $ request :List of 7 ## ..- attr(*, "class")= chr "request" ## $ handle :Class 'curl_handle' <externalptr> ## - attr(*, "class")= chr "response"
jsonlite
Common steps:
fromJSON
from the jsonlite
packagelibrary(dplyr) library(jsonlite) parsed <- fromJSON(rawToChar(myData$content)) str(parsed, max.level = 1)
## List of 3 ## $ status : chr "error" ## $ code : chr "parameterInvalid" ## $ message: chr "You are trying to request results too far in the past. Your plan permits you to request articles as far back as"| __truncated__
Access in R
Article here discusses accessing APIs generically with R
Same website gives a list of APIs
Databases
Many common database management systems
Oracle
SQL Server - Microsoft product
DB2 - IBM product
MySQL (open source) - Not as many features but popular
PostgreSQL (open source)
Basic SQL language constant across all - features differ
DBI::dbConnect()
RSQLite::SQLite()
for RSQLiteRMySQL::MySQL()
for RMySQLRPostgreSQL::PostgreSQL()
for RPostgreSQLodbc::odbc()
for Open Database Connectivitybigrquery::bigquery()
for google’s bigQuerycon <- DBI::dbConnect(RMySQL::MySQL(), host = "hostname.website", user = "username", password = rstudioapi::askForPassword("DB password") )
DBI::dbConnect()
tbl()
to reference a table in the database
tbl(con, "name_of_table")
DBI::dbConnect()
tbl()
to reference a table in the database
SQL
or dplyr/dbplyr
(we’ll learn dplyr
soon!)
DBI::dbConnect()
tbl()
to reference a table in the database
SQL
or dplyr/dbplyr
(we’ll learn dplyr
soon!)
dbDisconnect()
Type of file | Package | Function |
---|---|---|
Delimited | readr |
read_csv() , read_tsv() ,read_table() , read_delim() |
Excel (.xls,.xlsx) | readxl |
read_excel() |
SAS (.sas7bdat) | haven |
read_sas() |
SPSS (.sav) | haven |
read_spss() |