In the spirit of loops, vectorized functions give us a way to execute code on an entire ‘vector’ at once (although we can be a bit more general than just vectors). This tends to speed up computation in comparison to basic loops in R!
This is because loops are inefficient in R. R is an interpreted language. This means that it does a lot of the work of figuring out what to do for you. (Think about function dispatch - it looks at the type of object and figures out which version of plot() or summary() to use.) This process tends to slow R down in comparison to a vectorized operation where it still runs a loop under the hood but a vector should have all the same type of elements in it. This means it can avoid figuring the same thing out repeatedly!
Vectorized Functions for Common Numeric Summaries
There are some ‘built-in’ vectorized functions that are quite useful to apply to a 2D type object:
colMeans(), rowMeans()
colSums(), rowSums()
colSds(), colVars(), colMedians() (must install the matrixStats package to get these)
Let’s go back to our batting dataset from the previous note set.
G AB R H X2B X3B HR
47.295798 129.389376 17.290953 33.766272 5.768674 1.154497 2.688300
If we install the matrixStats package (download the files from the internet), we can then use the colMedians() function to obtain the column medians in a quick fashion.
#install.packages("matrixStats") #only run this once on your machine!library(matrixStats)colMedians(my_batting[, 3:9])
Error in `colMedians()`:
! Argument 'x' must be a matrix or a vector
Ah, this package requires the object passed to be a matrix or vector (homogenous). Although our data frame we pass is homogenous, the function doesn’t have a check for that. No worries, we can convert to a matrix using as.matrix() (similar to the is. family of functions there is an as. family of functions (read as ‘as dot’)).
colMedians(as.matrix(my_batting[, 3:9]))
G AB R H X2B X3B HR
31 40 3 7 1 0 0
Let’s compare the speed of this code to the speed of a for loop!
The microbenchmark package allows for easy recording of computing time.
We just wrap the code we want to benchmark in the microbenchmark() function.
This repeatedly executes the code and reports summary stats on how long it took
Here we will grab all the numeric columns from the data
Some columns contain NA or missing values. We’ll add na.rm = TRUE to both function calls to ignore those values (this is where the for loop actually struggles in this case!)
#install.packages("microbenchmark") #run only once on your machine!library(microbenchmark)my_numeric_batting <- Batting[, 6:22] #get all numeric columnsvectorized_results <-microbenchmark(colMeans(my_numeric_batting, na.rm =TRUE))loop_results <-microbenchmark(for(i in1:17){mean(my_numeric_batting[, i], na.rm =TRUE) })
Compare computational time
vectorized_results
Unit: milliseconds
expr min lq mean median
colMeans(my_numeric_batting, na.rm = TRUE) 4.317701 4.583202 5.456984 4.706802
uq max neval
4.9729 11.3953 100
loop_results
Unit: milliseconds
expr min
for (i in 1:17) { mean(my_numeric_batting[, i], na.rm = TRUE) } 8.249702
lq mean median uq max neval
8.663851 9.444096 8.937901 9.251451 40.6268 100
Vectorized ifelse
We saw the limitation of using standard if/then/else logic for manipulating a data set. The ifelse() function is a vectorized form of if/then/else logic.
Let’s revisit our example that used the airquality dataset. We wanted to code a wind category variable: