![Three columns of shapes are shown next to each other. There are four values in each column. The first column contains numbers only. The second column contains character strings only. The third column contains True and False values only. The values are called elements and have an ordering associated with them, both a row ordering and a column ordering. The ordering is denoted by [ row index, column index ]. Leaving a row or column index blank is associated with the values for an entire row or column, respectively.](img/dfVisualF.png)
Base R Data Structures: Data Frames
Common Data Structures
A data scientist needs to deal with data! We need to have a firm foundation in the ways that we can store our data in R. This section goes through the most commonly used ‘built-in’ R objects that we’ll use.
There are five major data structures used in
R- Atomic Vector (1d)
- Matrix (2d)
- Array (nd)
- Data Frame (2d)
- List (1d)
| Dimension | Homogeneous (elements all the same) | Heterogeneous (elements may differ) |
|---|---|---|
| 1d | Atomic Vector | List |
| 2d | Matrix | Data Frame |
| nd | Array |
Data Frame
We’ve now gone through the common data structures that are homogeneous. Next we’ll take on the data frame. These are 2D objects where the columns can be of different types!
Data Frames
A collection (list) of vectors of the same length
Perfect for most data sets!
Creating a Data Frame
We can create a data frame using the data.frame() function. From the help:
data.frame(..., row.names = NULL, check.rows = FALSE, check.names = TRUE, fix.empty.names = TRUE, stringsAsFactors = FALSE)
The help isn’t super clear on what the first argument ... should be. Essentially, this syntax allows us to pass vectors (each of the same length) as the columns of our data frame.
x <- c("a", "b", "c", "d", "e", "f")
y <- c(1, 3, 4, -1, 5, 6)
z <- 10:15
my_df <- data.frame(x, y, z)
my_df x y z
1 a 1 10
2 b 3 11
3 c 4 12
4 d -1 13
5 e 5 14
6 f 6 15
Data Frame Attributes
Data frames have one attribute of note: names.
str(my_df)'data.frame': 6 obs. of 3 variables:
$ x: chr "a" "b" "c" "d" ...
$ y: num 1 3 4 -1 5 6
$ z: int 10 11 12 13 14 15
attributes(my_df)$names
[1] "x" "y" "z"
$class
[1] "data.frame"
$row.names
[1] 1 2 3 4 5 6
- Our data frame originally inherited the names from the data objects passed in.
- If we didn’t pass in objects with names, we get some default nonsense.
data.frame(1:5, c("a", "b", "c", "d", "e")) X1.5 c..a....b....c....d....e..
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
- We can set the names explicitly when we create the data frame.
x <- c("a", "b", "c", "d", "e", "f")
y <- c(1, 3, 4, -1, 5, 6)
z <- 10:15
my_df <- data.frame(char = x, data1 = y, data2 = z)
my_df char data1 data2
1 a 1 10
2 b 3 11
3 c 4 12
4 d -1 13
5 e 5 14
6 f 6 15
data.frame(number = 1:5, letter = c("a", "b", "c", "d", "e")) number letter
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
Note: A syntactically valid name consists of letters, numbers and the dot or underline characters. It must start with a letter or a dot not followed by a number.
You shouldn’t use
.with names generally. Periods are often used for methods and other things in programming.You also should avoid reserved words:
ifelserepeatwhilefunctionforinnextbreakTRUEFALSENULLInfNaNNANA_integer_NA_real_NA_complex_NA_character_(type?make.namesinto the console for more details)
Accessing Elements of a Data Frame
Let’s check out the ‘built-in’ iris data frame
str(iris)'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
- This is a 2D structure and we can access just like a matrix!
iris[1:4, 2:4] #returns a data frame Sepal.Width Petal.Length Petal.Width
1 3.5 1.4 0.2
2 3.0 1.4 0.2
3 3.2 1.3 0.2
4 3.1 1.5 0.2
iris[1, ] #returns a data frame Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
iris[1:10, 1] #returns a vector [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
- Note again that
Rsimplifies objects sometimes! The last return was simplified to a vector. - This behavior actually comes from using the
[function. Everything you do inRis a function call! - We can avoid the behavior by adding the
drop = FALSEargument
iris[1:10, 1, drop = FALSE] #return a data frame Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
6 5.4
7 4.6
8 5.0
9 4.4
10 4.9
- Or we can call the
[function as a prefix function (like our usual function calls)
`[`(iris, 1:10, 1, drop = FALSE) Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
6 5.4
7 4.6
8 5.0
9 4.4
10 4.9
- Usually data frames have meaningful column names. We can use these for subsetting
iris[1:5 , c("Sepal.Length", "Species")] Sepal.Length Species
1 5.1 setosa
2 4.9 setosa
3 4.7 setosa
4 4.6 setosa
5 5.0 setosa
- The most common way to access a single column is via the
$operator. This returns a vector.
iris$Sepal.Length [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
- RStudio fills in options for us!
- Type
iris$
- If no choices - hit tab
- Hit tab again to choose
- Type
Quick R Video
Please pop this video out and watch it in the full panopto player!
Recap!
Data Frame (2D data structure)
Collection (list) of vectors of the same length
Create with
data.frame()functionAccess with
[ , ]or$Perfect for most data sets!
Most functions that read 2D data store it as a
data frame(ortibble- a special data frame covered later)
Use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!