Another really common type of data is Excel data. This is data that has a file extension of .xls or .xlsx. We often want to pull data from different sheets within these files.
readxl Package
The readxl package is part of the tidyverse (not loaded by default) that has functionality for reading in this type of data!
However, these types of files cannot be pulled from a URL. Instead, we’ll need to download the files and provide a path to them.
Note: If you switch to have your chunk output in your console, the working directory used during the interactive modifying and submitting of code from your .qmd file will use your usual working directory for your R session. This can be annoying! When you render it will use the .qmd file’s location as the working directory.
Reading From a Particular Sheet
We might want to programmatically look at the sheets available in the excel document. This can be done with the excel_sheets() function.
excel_sheets("data/Dry_Bean_Dataset.xlsx")
[1] "Dry_Beans_Dataset" "Citation_Request"
We can pull in data from a specific sheet with the name or via integers (or NULL for 1st)
# A tibble: 0 x 1
# i 1 variable:
# Citation Request :
KOKLU, M. and OZKAN, I.A., (2020), “Multiclass Classification of Dry Beans Using Computer Vision and Machine Learning Techniques.” Computers and Electronics in Agriculture, 174, 105507. DOI: https://doi.org/10.1016/j.compag.2020.105507 <lgl>
Notice that didn’t read in correctly! There is only one entry there (the 1st cell, 1st column) and it is currently being treated as the column name. Similar to the read_csv() function we can use col_names = FALSE here (thanks coherent ecosystem!!).
# A tibble: 1 x 1
...1
<chr>
1 "Citation Request :\r\nKOKLU, M. and OZKAN, I.A., (2020), “Multiclass Classif~
We can see there are some special characters in there (like line break). If we use cat() it will print that out nicely.
cat(dplyr::pull(citation_dry_bean_data, 1))
Citation Request :
KOKLU, M. and OZKAN, I.A., (2020), “Multiclass Classification of Dry Beans Using Computer Vision and Machine Learning Techniques.” Computers and Electronics in Agriculture, 174, 105507. DOI: https://doi.org/10.1016/j.compag.2020.105507
Reading Only Specific Cells
Occasionally, we might want to read only some cells on a particular sheet. This can be done by specifying the range argument!
Cells must be in a contiguous range
dry_bean_range <-read_excel("data/Dry_Bean_Dataset.xlsx", range =cell_cols("A:B") )dry_bean_range
# A tibble: 13,611 x 2
Area Perimeter
<dbl> <dbl>
1 28395 610.
2 28734 638.
3 29380 624.
4 30008 646.
5 30140 620.
6 30279 635.
7 30477 670.
8 30519 630.
9 30685 636.
10 30834 632.
# i 13,601 more rows
Recap!
The read_xl package provides nice functionality for reading in excel type data.
As it is part of the tidyverse it reads the data into a tibble
Functionality to read in from different sheets or to read in particular ranges of data
Use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!