In the spirit of loops, vectorized functions give us a way to execute code on an entire ‘vector’ at once (although we can be a bit more general than just vectors). This tends to speed up computation in comparison to basic loops in R!
This is because loops are inefficient in R. R is an interpreted language. This means that it does a lot of the work of figuring out what to do for you. (Think about function dispatch - it looks at the type of object and figures out which version of plot() or summary() to use.) This process tends to slow R down in comparison to a vectorized operation where it still runs a loop under the hood but a vector should have all the same type of elements in it. This means it can avoid figuring the same thing out repeatedly!
Vectorized Functions for Common Numeric Summaries
There are some ‘built-in’ vectorized functions that are quite useful to apply to a 2D type object:
colMeans(), rowMeans()
colSums(), rowSums()
colSds(), colVars(), colMedians() (must install the matrixStats package to get these)
Let’s go back to our batting dataset from the previous note set.
G AB R H X2B X3B HR
50.475469 137.928136 18.305890 35.993003 6.155200 1.221048 2.863367
If we install the matrixStats package (download the files from the internet), we can then use the colMedians() function to obtain the column medians in a quick fashion.
#install.packages("matrixStats") #only run this once on your machine!library(matrixStats)colMedians(my_batting[, 3:9])
Error in colMedians(my_batting[, 3:9]): Argument 'x' must be a matrix or a vector.
Ah, this package requires the object passed to be a matrix or vector (homogenous). Although our data frame we pass is homogenous, the function doesn’t have a check for that. No worries, we can convert to a matrix using as.matrix() (similar to the is. family of functions there is an as. family of functions (read as ‘as dot’)).
colMedians(as.matrix(my_batting[, 3:9]))
G AB R H X2B X3B HR
34 45 4 8 1 0 0
Let’s compare the speed of this code to the speed of a for loop!
The microbenchmark package allows for easy recording of computing time.
We just wrap the code we want to benchmark in the microbenchmark() function.
This repeatedly executes the code and reports summary stats on how long it took
Here we will grab all the numeric columns from the data
Some columns contain NA or missing values. We’ll add na.rm = TRUE to both function calls to ignore those values (this is where the for loop actually struggles in this case!)
#install.packages("microbenchmark") #run only once on your machine!library(microbenchmark)my_numeric_batting <- Batting[, 6:22] #get all numeric columnsvectorized_results <-microbenchmark(colMeans(my_numeric_batting, na.rm =TRUE))loop_results <-microbenchmark(for(i in1:17){mean(my_numeric_batting[, i], na.rm =TRUE) })
Compare computational time
vectorized_results
Unit: milliseconds
expr min lq mean median
colMeans(my_numeric_batting, na.rm = TRUE) 2.9402 3.00435 3.569698 3.07705
uq max neval
3.22875 8.3427 100
loop_results
Unit: milliseconds
expr min
for (i in 1:17) { mean(my_numeric_batting[, i], na.rm = TRUE) } 9.899
lq mean median uq max neval
13.86045 14.13333 14.36675 14.61005 35.9559 100
Vectorized ifelse
We saw the limitation of using standard if/then/else logic for manipulating a data set. The ifelse() function is a vectorized form of if/then/else logic.
Let’s revisit our example that used the airquality dataset. We wanted to code a wind category variable: