Control Flow: Vectorized Functions

Published

2026-05-27

Vectorized Functions

In the spirit of loops, vectorized functions give us a way to execute code on an entire ‘vector’ at once (although we can be a bit more general than just vectors). This tends to speed up computation in comparison to basic loops in R!

This is because loops are inefficient in R. R is an interpreted language. This means that it does a lot of the work of figuring out what to do for you. (Think about function dispatch - it looks at the type of object and figures out which version of plot() or summary() to use.) This process tends to slow R down in comparison to a vectorized operation where it still runs a loop under the hood but a vector should have all the same type of elements in it. This means it can avoid figuring the same thing out repeatedly!

Vectorized Functions for Common Numeric Summaries

There are some ‘built-in’ vectorized functions that are quite useful to apply to a 2D type object:

  • colMeans(), rowMeans()
  • colSums(), rowSums()
  • colSds(), colVars(), colMedians() (must install the matrixStats package to get these)

Let’s go back to our batting dataset from the previous note set.

library(Lahman)
my_batting <- Batting[, c("playerID", "teamID", "G", "AB", "R", "H", "X2B", "X3B", "HR")]
head(my_batting)
   playerID teamID  G AB R H X2B X3B HR
1 aardsda01    SFN 11  0 0 0   0   0  0
2 aardsda01    CHN 45  2 0 0   0   0  0
3 aardsda01    CHA 25  0 0 0   0   0  0
4 aardsda01    BOS 47  1 0 0   0   0  0
5 aardsda01    SEA 73  0 0 0   0   0  0
6 aardsda01    SEA 53  0 0 0   0   0  0

We can apply the colMeans() function easily!

colMeans(my_batting[, 3:9])
         G         AB          R          H        X2B        X3B         HR 
 47.295798 129.389376  17.290953  33.766272   5.768674   1.154497   2.688300 

If we install the matrixStats package (download the files from the internet), we can then use the colMedians() function to obtain the column medians in a quick fashion.

#install.packages("matrixStats") #only run this once on your machine!
library(matrixStats)
colMedians(my_batting[, 3:9])
Error in `colMedians()`:
! Argument 'x' must be a matrix or a vector

Ah, this package requires the object passed to be a matrix or vector (homogenous). Although our data frame we pass is homogenous, the function doesn’t have a check for that. No worries, we can convert to a matrix using as.matrix() (similar to the is. family of functions there is an as. family of functions (read as ‘as dot’)).

colMedians(as.matrix(my_batting[, 3:9]))
  G  AB   R   H X2B X3B  HR 
 31  40   3   7   1   0   0 

Let’s compare the speed of this code to the speed of a for loop!

  • The microbenchmark package allows for easy recording of computing time.

  • We just wrap the code we want to benchmark in the microbenchmark() function.

  • This repeatedly executes the code and reports summary stats on how long it took

    • Here we will grab all the numeric columns from the data
    • Some columns contain NA or missing values. We’ll add na.rm = TRUE to both function calls to ignore those values (this is where the for loop actually struggles in this case!)
#install.packages("microbenchmark") #run only once on your machine!
library(microbenchmark)
my_numeric_batting <- Batting[, 6:22] #get all numeric columns
vectorized_results <- microbenchmark(
  colMeans(my_numeric_batting, na.rm = TRUE)
)

loop_results <- microbenchmark(
  for(i in 1:17){
    mean(my_numeric_batting[, i], na.rm = TRUE)
  }
)
  • Compare computational time
vectorized_results
Unit: milliseconds
                                       expr      min       lq     mean   median
 colMeans(my_numeric_batting, na.rm = TRUE) 4.317701 4.583202 5.456984 4.706802
     uq     max neval
 4.9729 11.3953   100
loop_results
Unit: milliseconds
                                                                expr      min
 for (i in 1:17) {     mean(my_numeric_batting[, i], na.rm = TRUE) } 8.249702
       lq     mean   median       uq     max neval
 8.663851 9.444096 8.937901 9.251451 40.6268   100

Vectorized ifelse

We saw the limitation of using standard if/then/else logic for manipulating a data set. The ifelse() function is a vectorized form of if/then/else logic.

Let’s revisit our example that used the airquality dataset. We wanted to code a wind category variable:

  • high wind days (15mph \(\leq\) wind)
  • windy days (10mph \(\leq\) wind < 15mph)
  • lightwind days (6mph \(\leq\) wind < 10mph)
  • calm days (wind \(\leq\) 6mph)

The syntax for ifelse is:

ifelse(vector_condition, if_true_do_this, if_false_do_this)

a vector is returned!

ifelse(airquality$Wind >= 15, "HighWind", "Not HighWind")
  [1] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
  [6] "Not HighWind" "Not HighWind" "Not HighWind" "HighWind"     "Not HighWind"
 [11] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
 [16] "Not HighWind" "Not HighWind" "HighWind"     "Not HighWind" "Not HighWind"
 [21] "Not HighWind" "HighWind"     "Not HighWind" "Not HighWind" "HighWind"    
 [26] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
 [31] "Not HighWind" "Not HighWind" "Not HighWind" "HighWind"     "Not HighWind"
 [36] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
 [41] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
 [46] "Not HighWind" "Not HighWind" "HighWind"     "Not HighWind" "Not HighWind"
 [51] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
 [56] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
 [61] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
 [66] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
 [71] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
 [76] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
 [81] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
 [86] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
 [91] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
 [96] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
[101] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
[106] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
[111] "Not HighWind" "Not HighWind" "HighWind"     "Not HighWind" "Not HighWind"
[116] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
[121] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
[126] "Not HighWind" "Not HighWind" "Not HighWind" "HighWind"     "Not HighWind"
[131] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "HighWind"    
[136] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
[141] "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind" "Not HighWind"
[146] "Not HighWind" "Not HighWind" "HighWind"     "Not HighWind" "Not HighWind"
[151] "Not HighWind" "Not HighWind" "Not HighWind"

We can use a second call to ifelse() to assign what to do in the FALSE condition!

ifelse(airquality$Wind >= 15, "HighWind",
          ifelse(airquality$Wind >= 10, "Windy",
                 ifelse(airquality$Wind >= 6, "LightWind", 
                        ifelse(airquality$Wind >= 0, "Calm", "Error"))))
  [1] "LightWind" "LightWind" "Windy"     "Windy"     "Windy"     "Windy"    
  [7] "LightWind" "Windy"     "HighWind"  "LightWind" "LightWind" "LightWind"
 [13] "LightWind" "Windy"     "Windy"     "Windy"     "Windy"     "HighWind" 
 [19] "Windy"     "LightWind" "LightWind" "HighWind"  "LightWind" "Windy"    
 [25] "HighWind"  "Windy"     "LightWind" "Windy"     "Windy"     "Calm"     
 [31] "LightWind" "LightWind" "LightWind" "HighWind"  "LightWind" "LightWind"
 [37] "Windy"     "LightWind" "LightWind" "Windy"     "Windy"     "Windy"    
 [43] "LightWind" "LightWind" "Windy"     "Windy"     "Windy"     "HighWind" 
 [49] "LightWind" "Windy"     "Windy"     "LightWind" "Calm"      "Calm"     
 [55] "LightWind" "LightWind" "LightWind" "Windy"     "Windy"     "Windy"    
 [61] "LightWind" "Calm"      "LightWind" "LightWind" "Windy"     "Calm"     
 [67] "Windy"     "Calm"      "LightWind" "Calm"      "LightWind" "LightWind"
 [73] "Windy"     "Windy"     "Windy"     "Windy"     "LightWind" "Windy"    
 [79] "LightWind" "Calm"      "Windy"     "LightWind" "LightWind" "Windy"    
 [85] "LightWind" "LightWind" "LightWind" "Windy"     "LightWind" "LightWind"
 [91] "LightWind" "LightWind" "LightWind" "Windy"     "LightWind" "LightWind"
 [97] "LightWind" "Calm"      "Calm"      "Windy"     "LightWind" "LightWind"
[103] "Windy"     "Windy"     "Windy"     "LightWind" "Windy"     "Windy"    
[109] "LightWind" "LightWind" "Windy"     "Windy"     "HighWind"  "Windy"    
[115] "Windy"     "LightWind" "Calm"      "LightWind" "Calm"      "LightWind"
[121] "Calm"      "LightWind" "LightWind" "LightWind" "Calm"      "Calm"     
[127] "Calm"      "LightWind" "HighWind"  "Windy"     "Windy"     "Windy"    
[133] "LightWind" "Windy"     "HighWind"  "LightWind" "Windy"     "Windy"    
[139] "LightWind" "Windy"     "Windy"     "Windy"     "LightWind" "Windy"    
[145] "LightWind" "Windy"     "Windy"     "HighWind"  "LightWind" "Windy"    
[151] "Windy"     "LightWind" "Windy"    

Whoa that was pretty easy! Nice.

Let’s compare this to using a for loop speed-wise.

loopTime<-microbenchmark(
  for (i in seq_len(nrow(airquality))){
    if(airquality$Wind[i] >= 15){
       "HighWind"
    } else if (airquality$Wind[i] >= 10){
      "Windy"
    } else if (airquality$Wind[i] >= 6){
      "LightWind"
    } else if (airquality$Wind[i] >= 0){
      "Calm"
    } else{
      "Error"
    }
  }
, unit = "us")
vectorTime <- microbenchmark(
  ifelse(airquality$Wind >= 15, "HighWind",
         ifelse(airquality$Wind >= 10, "Windy",
                ifelse(airquality$Wind >= 6, "LightWind", 
                       ifelse(airquality$Wind >= 0, "Calm", "Error"))))
)

Compare!

loopTime
Unit: microseconds
                                                                                                                                                                                                                                                                                                                                 expr
 for (i in seq_len(nrow(airquality))) {     if (airquality$Wind[i] >= 15) {         "HighWind"     }     else if (airquality$Wind[i] >= 10) {         "Windy"     }     else if (airquality$Wind[i] >= 6) {         "LightWind"     }     else if (airquality$Wind[i] >= 0) {         "Calm"     }     else {         "Error"     } }
      min       lq     mean median       uq      max neval
 2977.901 3070.001 3474.214 3197.5 3516.101 7749.501   100
vectorTime
Unit: microseconds
                                                                                                                                                                                  expr
 ifelse(airquality$Wind >= 15, "HighWind", ifelse(airquality$Wind >=      10, "Windy", ifelse(airquality$Wind >= 6, "LightWind", ifelse(airquality$Wind >=      0, "Calm", "Error"))))
    min      lq     mean median      uq     max neval
 24.201 26.2015 32.15306 28.302 32.6005 105.301   100

Note: There is an if_else() function from the dplry package. This has more restrictions than ifelse() but otherwise is pretty similar.

Recap!

  • Loops are slower in R

  • Use vectorized functions if possible

  • Common vectorized functions

    • colMeans(), rowMeans()
    • colSums(), rowSums()
    • matrixStats::colSds(), matrixStats::colVars(), matrixStats::colMedians()
    • ifelse() or dplyr::if_else()
    • apply family (covered soon)
    • purrr package (covered in a bit)

Use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!