Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
Base R Data Structures: Matrices
Common Data Structures
A data scientist needs to deal with data! We need to have a firm foundation in the ways that we can store our data in R
. This section goes through the most commonly used ‘built-in’ R
objects that we’ll use.
There are five major data structures used in
R
- Atomic Vector (1d)
- Matrix (2d)
- Array (nd)
- Data Frame (2d)
- List (1d)
Dimension | Homogeneous (elements all the same) | Heterogeneous (elements may differ) |
---|---|---|
1d | Atomic Vector | List |
2d | Matrix | Data Frame |
nd | Array |
Matrix
Vectors are useful to know about but generally not great for dealing with a dataset. When we think of a dataset, we often think about a spreadsheet having rows corresponding to observations and columns corresponding to variables.
Here the observations correspond to measurements on flowers. The variables that were measured on each flower are the columns.
We can see that a vector is not really useful for handling this kind of ‘standard’ data since it is 1D. However, we can think of vectors as the building blocks for more complicated types of data.
As matrices are homogeneous (all elements must be the same type), we can think of a matrix as a collection of vectors of the same type and length (although this is not formally how it works in R
). Think of each column of the matrix as a vector.
Creating a Matrix
- We can create a matrix using the
matrix()
function. Looking at the help formatrix
we see the function definition as:
matrix(data = NA,
nrow = 1,
ncol = 1,
byrow = FALSE,
dimnames = NULL)
where data
is
an optional data vector (including a list or expression vector). Non-atomic classed R objects are coerced by as.vector and all attributes discarded.
Ok, so really we need to supply a vector of data. Then the nrow
and ncol
arguments define how many rows and columns. The byrow
argument is a Boolean telling R
whether or not to fill in the matrix by row or by column (the default). dimnames
is an optional attribute. We’ll look at that shortly.
A basic call to matrix()
might look like this:
<- matrix(c(1, 3, 4, -1, 5, 6),
my_mat nrow = 3,
ncol = 2)
my_mat
[,1] [,2]
[1,] 1 -1
[2,] 3 5
[3,] 4 6
Notice how the values fill in the first column then the second column. This is the default behavior but can be changed via the byrow
argument.
<- matrix(c(1, 3, 4, -1, 5, 6),
my_mat nrow = 3,
ncol = 2,
byrow = TRUE)
my_mat
[,1] [,2]
[1,] 1 3
[2,] 4 -1
[3,] 5 6
Let’s think of a matrix as a collection of vectors of the same type and length (how you might think of a dataset). We could construct the matrix by giving appropriate vectors of the same type and length.
#populate vectors
<- rep(0.2, times = 6)
x <- c(1, 3, 4, -1, 5, 6) y
- Check they are the same
type
of element. In this case, as long as they are bothnumeric
, we can put them together and not lose any precision (integers
are coerced todoubles
if needed).
str(x)
num [1:6] 0.2 0.2 0.2 0.2 0.2 0.2
is.numeric(x)
[1] TRUE
str(y)
num [1:6] 1 3 4 -1 5 6
is.numeric(y)
[1] TRUE
- A
length()
function exists. We can use that to see the vectors have the same number of elements.
#check 'length'
length(x)
[1] 6
length(y)
[1] 6
- Now construct the matrix by combining the vectors together with
c()
.
<- matrix(c(x, y), ncol = 2)
my_mat2 my_mat2
[,1] [,2]
[1,] 0.2 1
[2,] 0.2 3
[3,] 0.2 4
[4,] 0.2 -1
[5,] 0.2 5
[6,] 0.2 6
As byrow = FALSE
is the default, the vectors become the columns!
That being said, the way to really think about the matrix is as a long vector we are giving dimensions to. In fact, if we coerce our matrix to a vector explicitly (via the as.vector()
function), we see the original data vector we passed in.
as.vector(my_mat2)
[1] 0.2 0.2 0.2 0.2 0.2 0.2 1.0 3.0 4.0 -1.0 5.0 6.0
- Matrices don’t have to be made up of numeric data. As long as the elements are all the same type, we are good to go!
<- c("Hi", "There", "!")
x <- c("a", "b", "c")
y <- c("One", "Two", "Three")
z matrix(c(x, y, z), nrow = 3)
[,1] [,2] [,3]
[1,] "Hi" "a" "One"
[2,] "There" "b" "Two"
[3,] "!" "c" "Three"
Notice that we don’t need to specify the number of columns. Generally, we need to only specify the
nrow
orncol
argument andR
figures the rest out!We do need to be careful about how
R
recycles things:
matrix(c(x, y, z), ncol = 2)
Warning in matrix(c(x, y, z), ncol = 2): data length [9] is not a sub-multiple
or multiple of the number of rows [5]
[,1] [,2]
[1,] "Hi" "c"
[2,] "There" "One"
[3,] "!" "Two"
[4,] "a" "Three"
[5,] "b" "Hi"
We were 1 element short of being able to fill a matrix with 2 columns and 5 rows (5 chosen to make sure all the data was included).
R
recycles the first element we passed to fill in the last value. This is a common thing thatR
does. It can be useful for shorthanding things but otherwise is just something we need to be aware of as a common error to be fixed!Let’s shorthand create a matrix of all 0’s.
matrix(0, nrow = 2, ncol = 2)
[,1] [,2]
[1,] 0 0
[2,] 0 0
Matrix Attributes
Similar to a vector, a matrix can have attributes (this is true of any R object!). The common attributes associated with a matrix are the dim
or dimensions and the dimnames
or dimension names.
<- as.matrix(iris[, 1:4])
my_iris head(my_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,] 5.1 3.5 1.4 0.2
[2,] 4.9 3.0 1.4 0.2
[3,] 4.7 3.2 1.3 0.2
[4,] 4.6 3.1 1.5 0.2
[5,] 5.0 3.6 1.4 0.2
[6,] 5.4 3.9 1.7 0.4
str(my_iris)
num [1:150, 1:4] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
attributes(my_iris)
$dim
[1] 150 4
$dimnames
$dimnames[[1]]
NULL
$dimnames[[2]]
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
We can access these attributes through functions designed for them:
dim(my_iris)
[1] 150 4
dimnames(my_iris)
[[1]]
NULL
[[2]]
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
Here we can name the rows and the columns of a matrix by assigning a list()
to the dimnames
attribute. (Lists are covered shortly.)
dimnames(my_iris) <- list(
1:150, #first list element is a vector for the row names
c("Sepal Length", "Sepal Width", "Petal Length", "Petal Width") #second element is a vector for column names
)head(my_iris)
Sepal Length Sepal Width Petal Length Petal Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
We can assign these when we create a matrix!
<- matrix(c(runif(10),
my_mat3 rnorm(10),
rgamma(10, shape = 1, scale = 1)),
ncol = 3,
dimnames = list(1:10, c("Uniform", "Normal", "Gamma")))
my_mat3
Uniform Normal Gamma
1 0.06155295 -0.63483451 0.431251428
2 0.02017257 0.20022334 0.110951185
3 0.91711098 -0.49676038 1.666985085
4 0.42297205 0.01052053 0.804936244
5 0.92170640 1.36541927 0.059788077
6 0.04137455 -0.17187418 1.042745038
7 0.39598334 -0.61119093 0.552589417
8 0.64166641 0.69802283 0.002147421
9 0.35465603 2.49344389 0.614674258
10 0.87806080 -0.76914847 0.350452391
Accessing Elements of a Matrix
We saw that []
allowed us to access the elements of a vector. We’ll use the same syntax to get at elements of a matrix. However, we now have two dimensions! That means we need to specify which row elements and which column elements. This is done by using square brackets with a comma in between our indices.
- Notice the default row names and column names! This gives us hints on this syntax!
<- matrix(c(1:4, 20:17), ncol = 2)
mat mat
[,1] [,2]
[1,] 1 20
[2,] 2 19
[3,] 3 18
[4,] 4 17
2, 2] mat[
[1] 19
- If we leave an index blank, we get the entirety of that index back.
1] mat[ ,
[1] 1 2 3 4
2, ] mat[
[1] 2 19
2:4, 1] mat[
[1] 2 3 4
c(2, 4), ] mat[
[,1] [,2]
[1,] 2 19
[2,] 4 17
Notice that R
simplifies the result where possible. That is, returns an atomic vector if you have only 1 dimension and a matrix if two.
Also, if you only give a single value in the []
then R uses the count of the value in the matrix (essentially treating the matrix elements as a long vector). These counts go down columns first.
- If you do have
dimnames
associated, then you can access elements with those.
<- matrix(c(1:4, 20:17), ncol = 2,
mat dimnames = list(NULL,
c("First", "Second"))
) mat
First Second
[1,] 1 20
[2,] 2 19
[3,] 3 18
[4,] 4 17
"First"] mat[,
[1] 1 2 3 4
Arrays
Arrays are the n-dimensional extension of matrices. Like matrices, they must have all elements of the same type.
<- array(1:24, dim = c(4, 2, 3))
my_array my_array
, , 1
[,1] [,2]
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
, , 2
[,1] [,2]
[1,] 9 13
[2,] 10 14
[3,] 11 15
[4,] 12 16
, , 3
[,1] [,2]
[1,] 17 21
[2,] 18 22
[3,] 19 23
[4,] 20 24
Accessing elements is similar to vectors and matrices!
1, 1, 1] my_array[
[1] 1
4, 2, 1] my_array[
[1] 8
Arrays are often used with images in deep learning.
Recap!
A 2D object where all elements are of the same type
Access elements via
[ , ]
Not ideal for data since all elements must be the same type (see data frames!)
Use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!