In the spirit of loops, vectorized functions give us a way to execute code on an entire ‘vector’ at once (although we can be a bit more general than just vectors). This tends to speed up computation in comparison to basic loops in R!
This is because loops are inefficient in R. R is an interpreted language. This means that it does a lot of the work of figuring out what to do for you. (Think about function dispatch - it looks at the type of object and figures out which version of plot() or summary() to use.) This process tends to slow R down in comparison to a vectorized operation where it still runs a loop under the hood but a vector should have all the same type of elements in it. This means it can avoid figuring the same thing out repeatedly!
Vectorized Functions for Common Numeric Summaries
There are some ‘built-in’ vectorized functions that are quite useful to apply to a 2D type object:
colMeans(), rowMeans()
colSums(), rowSums()
colSds(), colVars(), colMedians() (must install the matrixStats package to get these)
Let’s go back to our batting dataset from the previous note set.
G AB R H X2B X3B HR
50.740488 139.241320 18.483496 36.388605 6.202024 1.247075 2.850150
If we install the matrixStats package (download the files from the internet), we can then use the colMedians() function to obtain the column medians in a quick fashion.
#install.packages("matrixStats") #only run this once on your machine!library(matrixStats)
Warning: package 'matrixStats' was built under R version 4.1.3
colMedians(my_batting[, 3:9])
Error in colMedians(my_batting[, 3:9]): Argument 'x' must be a matrix or a vector.
Ah, this package requires the object passed to be a matrix or vector (homogenous). Although our data frame we pass is homogenous, the function doesn’t have a check for that. No worries, we can convert to a matrix using as.matrix() (similar to the is. family of functions there is an as. family of functions (read as ‘as dot’)).
colMedians(as.matrix(my_batting[, 3:9]))
[1] 34 46 4 8 1 0 0
Let’s compare the speed of this code to the speed of a for loop!
The microbenchmark package allows for easy recording of computing time.
We just wrap the code we want to benchmark in the microbenchmark() function.
This repeatedly executes the code and reports summary stats on how long it took
Here we will grab all the numeric columns from the data
Some columns contain NA or missing values. We’ll add na.rm = TRUE to both function calls to ignore those values (this is where the for loop actually struggles in this case!)
#install.packages("microbenchmark") #run only once on your machine!library(microbenchmark)my_numeric_batting <- Batting[, 6:22] #get all numeric columnsvectorized_results <-microbenchmark(colMeans(my_numeric_batting, na.rm =TRUE))loop_results <-microbenchmark(for(i in1:17){mean(my_numeric_batting[, i], na.rm =TRUE) })
Compare computational time
vectorized_results
Unit: milliseconds
expr min lq mean median
colMeans(my_numeric_batting, na.rm = TRUE) 3.7646 3.8297 4.716681 3.92185
uq max neval
4.29885 11.0034 100
loop_results
Unit: milliseconds
expr min
for (i in 1:17) { mean(my_numeric_batting[, i], na.rm = TRUE) } 11.0972
lq mean median uq max neval
15.1074 15.51984 15.6637 16.0935 44.9836 100
Vectorized ifelse
We saw the limitation of using standard if/then/else logic for manipulating a data set. The ifelse() function is a vectorized form of if/then/else logic.
Let’s revisit our example that used the airquality dataset. We wanted to code a wind category variable: