Multiple Linear Regression Models

class: center, middle, inverse, title-slide

.title[
# Multiple Linear Regression Models
]
.author[
### Justin Post
]

---

# Recap

Given a model, we **fit** the model using data

- Must determine how well the model predicts on **new** data 
- Create a test set or use CV
- Judge effectiveness using a **metric** on predictions made from the model

---
  
# Regression Modeling Ideas
  
For a set of observations `\(y_1,...,y_n\)`, we may want to predict a future value

- Often use the sample mean to do so, `\(\bar{y}\)` (an estimate of `\(E(Y)\)`)

---
  
# Regression Modeling Ideas
  
  For a set of observations `\(y_1,...,y_n\)`, we may want to predict a future value

- Often use the sample mean to do so, `\(\bar{y}\)` (an estimate of `\(E(Y)\)`)

Now consider having pairs `\((x_1,y_1), (x_2,y_2),...(x_n,y_n)\)`
  
<img src="data:image/png;base64,#47-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-2-1.svg" width="370px" style="display: block; margin: auto;" />

---
  
# Regression Modeling Ideas
  
Often use a linear (in the parameters) model for prediction

`$$\mbox{SLR model: }E(Y|x) = \beta_0+\beta_1x$$`
  
<img src="data:image/png;base64,#47-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-3-1.svg" width="370px" style="display: block; margin: auto;" />

---
  
# Regression Modeling Ideas
  
Can include more terms on the right hand side (RHS)

`$$\mbox{Multiple Linear Regression Model: }E(Y|x) = \beta_0+\beta_1x+\beta_2x^2$$`
  
<img src="data:image/png;base64,#47-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-4-1.svg" width="370px" style="display: block; margin: auto;" />

---
  
# Regression Modeling Ideas
  
Can include more terms on the right hand side (RHS)

`$$\mbox{Multiple Linear Regression Model: }E(Y|x) = \beta_0+\beta_1x+\beta_2x^2+\beta_3x^3$$`
  
<img src="data:image/png;base64,#47-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-5-1.svg" width="370px" style="display: block; margin: auto;" />

---
  
# Regression Modeling Ideas
  
- We model the mean response for a given `\(x\)` value
- With multiple predictors or `\(x\)`'s, we do the same idea!

<img src="data:image/png;base64,#img/true_mlr.png" width="500px" style="display: block; margin: auto;" />

---

# Regression Modeling Ideas

- Including a **main effect** for two predictors fits the best plane through the data

`$$\mbox{Multiple Linear Regression Model: } E(Y|x_1,x_2) = \beta_0+\beta_1x_1+\beta_2x_2$$`

<img src="data:image/png;base64,#img/best_mlr_plane.png" width="400px" style="display: block; margin: auto;" />

---

# Regression Modeling Ideas

- Including **main effects** and an **interaction effect** allows for a more flexible surface

`$$\mbox{Multiple Linear Regression Model: } E(Y|x_1,x_2) = \beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1x_2$$`

<img src="data:image/png;base64,#img/best_mlr_saddle.png" width="500px" style="display: block; margin: auto;" />

---

# Regression Modeling Ideas

- Including **main effects** and an **interaction effect** allows for a more flexible surface

- Interaction effects allow for the **effect** of one variable to depend on the value of another

- Model fit previously gives 
    
    + `\(\hat{y}\)` = (19.005) + (-0.791)x1 + (5.631)x2 + (-12.918)x1x2
    
---

# Regression Modeling Ideas

- Including **main effects** and an **interaction effect** allows for a more flexible surface

- Interaction effects allow for the **effect** of one variable to depend on the value of another

- Model fit previously gives 
    
    + `\(\hat{y}\)` = (19.005) + (-0.791)x1 + (5.631)x2 + (-12.918)x1x2

+ For `\(x_1\)` = 0, the slope on `\(x_2\)` is (5.631)`+0*` (-12.918) = 5.631

---

# Regression Modeling Ideas

- Including **main effects** and an **interaction effect** allows for a more flexible surface

- Interaction effects allow for the **effect** of one variable to depend on the value of another

- Model fit previously gives 
    
    + `\(\hat{y}\)` = (19.005) + (-0.791)x1 + (5.631)x2 + (-12.918)x1x2

+ For `\(x_1\)` = 0, the slope on `\(x_2\)` is (5.631)`+0*` (-12.918) = 5.631

+ For `\(x_1\)` = 0.5, the slope on `\(x_2\)` is (5.631)`+0.5*`(-12.918) = -0.828
    
---

# Regression Modeling Ideas

- Including **main effects** and an **interaction effect** allows for a more flexible surface

- Interaction effects allow for the **effect** of one variable to depend on the value of another

- Model fit previously gives 
    
    + `\(\hat{y}\)` = (19.005) + (-0.791)x1 + (5.631)x2 + (-12.918)x1x2

+ For `\(x_1\)` = 0, the slope on `\(x_2\)` is (5.631)`+0*` (-12.918) = 5.631

+ For `\(x_1\)` = 0.5, the slope on `\(x_2\)` is (5.631)`+0.5*`(-12.918) = -0.828

+ For `\(x_1\)` = 1, the slope on `\(x_2\)` is (5.631)`+1*`(-12.918) = -7.286

- Similarly, the slope on `\(x_1\)` depends on `\(x_2\)`!

---

# Regression Modeling Ideas

- Including **main effects** and an **interaction effect** allows for a more flexible surface
- Can also include higher order polynomial terms

`$$\mbox{Multiple Linear Regression Model: } E(Y|x_1,x_2) = \beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1x_2+\beta_4x_1^2$$`

<img src="data:image/png;base64,#img/best_mlr_comp.png" width="500px" style="display: block; margin: auto;" />

---

# Regression Modeling Ideas

Can also include categorical variables through **dummy** or **indicator** variables

- Categorical variable with value of `\(Success\)` and `\(Failure\)`
- Define `\(x_2 = 0\)` if variable is `\(Failure\)`
- Define `\(x_2 = 1\)` if variable is `\(Success\)`

---

layout: false

# Regression Modeling Ideas

Can also include categorical variables through **dummy** or **indicator** variables

- Categorical variable with value of `\(Success\)` and `\(Failure\)`
- Define `\(x_2 = 0\)` if variable is `\(Failure\)`
- Define `\(x_2 = 1\)` if variable is `\(Success\)`

<img src="data:image/png;base64,#47-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-14-1.svg" width="370px" style="display: block; margin: auto auto auto 0;" />

---

# Regression Modeling Ideas

- Define `\(x_2 = 0\)` if variable is `\(Failure\)`
- Define `\(x_2 = 1\)` if variable is `\(Success\)`

`$$\mbox{Separate Intercept Model: }E(Y|x) = \beta_0+\beta_1x_1 + \beta_2x_2$$`

<img src="data:image/png;base64,#47-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-15-1.svg" width="370px" style="display: block; margin: auto auto auto 0;" />

---

# Regression Modeling Ideas

- Define `\(x_2 = 0\)` if variable is `\(Failure\)`
- Define `\(x_2 = 1\)` if variable is `\(Success\)`

`$$\mbox{Separate Intercept and Slopes Model: }E(Y|x) = \beta_0+\beta_1x_1 + \beta_2x_2+\beta_3x_1x_2$$`

<img src="data:image/png;base64,#47-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-16-1.svg" width="370px" style="display: block; margin: auto auto auto 0;" />

---

# Regression Modeling Ideas

- Define `\(x_2 = 0\)` if variable is `\(Failure\)`
- Define `\(x_2 = 1\)` if variable is `\(Success\)`

`$$\mbox{Separate Quadratics Model: }E(Y|x) = \beta_0+ \beta_1x_2+\beta_2x_1+\beta_3x_1x_2+ \beta_4x_1^2 +\beta_5x_1^2x_2$$`

<img src="data:image/png;base64,#47-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-17-1.svg" width="370px" style="display: block; margin: auto auto auto 0;" />

---

# Regression Modeling Ideas

If your categorical variable has more than k>2 categories, define k-1 dummy variables
- Categorical variable with values of "Assistant", "Contractor", "Executive"
- Define `\(x_2 = 0\)` if variable is `\(Executive\)` or `\(Contractor\)`
- Define `\(x_2 = 1\)` if variable is `\(Assistant\)`
- Define `\(x_3 = 0\)` if variable is `\(Contractor\)` or `\(Assistant\)`
- Define `\(x_3 = 1\)` if variable is `\(Executive\)`

---

# Regression Modeling Ideas

If your categorical variable has more than k>2 categories, define k-1 dummy variables
- Categorical variable with values of "Assistant", "Contractor", "Executive"
- Define `\(x_2 = 0\)` if variable is `\(Executive\)` or `\(Contractor\)`
- Define `\(x_2 = 1\)` if variable is `\(Assistant\)`
- Define `\(x_3 = 0\)` if variable is `\(Contractor\)` or `\(Assistant\)`
- Define `\(x_3 = 1\)` if variable is `\(Executive\)`

`$$\mbox{Separate Intercepts Model: }E(Y|x) = \beta_0+ \beta_1x_1+\beta_2x_2+\beta_3x_3$$`
What is implied if `\(x_2\)` and `\(x_3\)` are both zero?

---

# Fitting an MLR Model

Big Idea: Trying to find the line, plane, saddle, etc. **of best fit** through points

- How do we do the fit??

+ Usually minimize the sum of squared residuals (errors)

---

# Fitting an MLR Model

Big Idea: Trying to find the line, plane, saddle, etc. **of best fit** through points

- How do we do the fit??

+ Usually minimize the sum of squared residuals (errors)

- Residual = observed - predicted or `\(y_i-\hat{y}_i\)`

`$$\min\limits_{\hat{\beta}'s}\sum_{i=1}^{n}(y_i-(\hat\beta_0+\hat\beta_1x_{1i}+...+\hat\beta_px_{pi}))^2$$`
  
---
  
# Fitting an MLR Model
  
Big Idea: Trying to find the line, plane, saddle, etc. **of best fit** through points

- How do we do the fit??
  
  + Usually minimize the sum of squared residuals (errors)

- Residual = observed - predicted or `\(y_i-\hat{y}_i\)`
  
  `$$\min\limits_{\hat{\beta}'s}\sum_{i=1}^{n}(y_i-(\hat\beta_0+\hat\beta_1x_{1i}+...+\hat\beta_px_{pi}))^2$$`

- Closed-form results exist for easy calculation via software!

---

# Fitting a Linear Regression Model in R

- Use `lm()` and specify a `formula`: `LHS ~ RHS` 
    
    + `y ~` implies y is modelled by a linear function of the RHS
    
    + RHS consists of terms separated by `+` operators

- `y ~ x1 + x2` gives `\(E(Y|x_1, x_2) = \beta_0+\beta_1x_1+\beta_2x_2\)`

---

# Fitting a Linear Regression Model in R

- Use `lm()` and specify a `formula`: `LHS ~ RHS` 
    
    + `y ~` implies y is modelled by a linear function of the RHS
    
    + RHS consists of terms separated by `+` operators

- `y ~ x1 + x2` gives `\(E(Y|x_1, x_2) = \beta_0+\beta_1x_1+\beta_2x_2\)`

- `:` for interactions, `y ~ x1 + x2 + x1:x2` gives `\(E(Y|x_1, x_2) = \beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1x_2\)`
        
---

# Fitting a Linear Regression Model in R

- Use `lm()` and specify a `formula`: `LHS ~ RHS` 
    
    + `y ~` implies y is modelled by a linear function of the RHS
    
    + RHS consists of terms separated by `+` operators

- `y ~ x1 + x2` gives `\(E(Y|x_1, x_2) = \beta_0+\beta_1x_1+\beta_2x_2\)`

- `:` for interactions, `y ~ x1 + x2 + x1*x2` gives `\(E(Y|x_1, x_2) = \beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3\)`
    
        - `*` denotes factor crossing: `a*b` is interpreted as `a + b + a:b`

- `y ~ x - 1` removes the intercept term
        
---

# Fitting a Linear Regression Model in R

- Use `lm()` and specify a `formula`: `LHS ~ RHS` 
    
    + `y ~` implies y is modelled by a linear function of the RHS
    
    + RHS consists of terms separated by `+` operators

- `y ~ x1 + x2` gives `\(E(Y|x_1, x_2) = \beta_0+\beta_1x_1+\beta_2x_2\)`

- `:` for interactions, `y ~ x1 + x2 + x1*x2` gives `\(E(Y|x_1, x_2) = \beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3\)`
    
        - `*` denotes factor crossing: `a*b` is interpreted as `a + b + a:b`

- `y ~ x - 1` removes the intercept term

- `I()` can be used to create arithmetic predictors
        
            + `y ~ a + I(b+c)` implies `b+c` is the sum of b and c
            
            + `y ~ x + I(x^2)` implies `\(E(Y|x_1) = \beta_0+\beta_1x_1+\beta_2x_1^2\)`

---

# Fitting MLR Models

- Let's read in our `bike_data` and fit some MLR models

```r
library(tidyverse)
bike_data <- read_csv("https://www4.stat.ncsu.edu/~online/datasets/bikeDetails.csv")
bike_data <- bike_data |>
  mutate(log_selling_price = log(selling_price),
         log_km_driven = log(km_driven)) |>
  select(log_km_driven, year, log_selling_price, owner, everything())
bike_data
```

```
## # A tibble: 1,061 x 9
##   log_km_driven  year log_selling_price owner    name  selling_price seller_type
##           <dbl> <dbl>             <dbl> <chr>    <chr>         <dbl> <chr>      
## 1          5.86  2019             12.1  1st own~ Roya~        175000 Individual 
## 2          8.64  2017             10.7  1st own~ Hond~         45000 Individual 
## 3          9.39  2018             11.9  1st own~ Roya~        150000 Individual 
## 4         10.0   2015             11.1  1st own~ Yama~         65000 Individual 
## 5          9.95  2011              9.90 2nd own~ Yama~         20000 Individual 
## # i 1,056 more rows
## # i 2 more variables: km_driven <dbl>, ex_showroom_price <dbl>
```

---

# Fitting MLR Models

- Create models with the same slope but intercepts differing by a categorical variable

```r
owner_fits <- lm(log_selling_price ~ owner + log_km_driven, data = bike_data)
coef(owner_fits)
```

```
##    (Intercept) owner2nd owner owner3rd owner owner4th owner  log_km_driven 
##    14.62423775    -0.06775874     0.08148045     0.20110313    -0.38930862
```

---

# Fitting MLR Models

- Create a data frame for plotting

```r
x_values <- seq(from = min(bike_data$log_km_driven), 
                to = max(bike_data$log_km_driven), 
                length = 2)
pred_df <- data.frame(log_km_driven = rep(x_values, 4), 
                      owner = c(rep("1st owner", 2),
                                rep("2nd owner", 2),
                                rep("3rd owner", 2),
                                rep("4th owner", 2)))
pred_df <- pred_df |> 
  mutate(predictions = predict(owner_fits, newdata = pred_df))
pred_df
```

```
##   log_km_driven     owner predictions
## 1      5.857933 1st owner   12.343694
## 2     13.687677 1st owner    9.295507
## 3      5.857933 2nd owner   12.275935
## 4     13.687677 2nd owner    9.227748
## 5      5.857933 3rd owner   12.425174
## 6     13.687677 3rd owner    9.376987
## 7      5.857933 4th owner   12.544797
## 8     13.687677 4th owner    9.496610
```

---

# Fitting MLR Models

- Plot our different intercept models

```r
ggplot(bike_data, aes(x = log_km_driven, y = log_selling_price, color = owner)) +
  geom_point() + 
  geom_line(data = pred_df, aes(x = log_km_driven, y = predictions, color = owner))
```

<img src="data:image/png;base64,#47-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-21-1.svg" width="360px" style="display: block; margin: auto;" />

---

# Fitting MLR Models

- Create models with the different slopes and intercepts

```r
owner_fits_full <- lm(log_selling_price ~ owner*log_km_driven, data = bike_data)
coef(owner_fits_full)
```

```
##                  (Intercept)               owner2nd owner 
##                  14.55347484                   0.63862406 
##               owner3rd owner               owner4th owner 
##                   0.82280649                   2.31991467 
##                log_km_driven owner2nd owner:log_km_driven 
##                  -0.38219492                  -0.06871037 
## owner3rd owner:log_km_driven owner4th owner:log_km_driven 
##                  -0.07295150                  -0.19192122
```
---

# Fitting MLR Models

- Plot our different intercept models

```r
ggplot(bike_data, aes(x = log_km_driven, y = log_selling_price, color = owner)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```

<img src="data:image/png;base64,#47-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-23-1.svg" width="360px" style="display: block; margin: auto;" />

---
  
# Choosing an MLR Model
  
- Given a bunch of predictors, tons of models you could fit!  How to choose?

- Many variable selection methods exist...

- If you care mainly about prediction, just use *cross-validation* or training/test split!

+ Compare predictions using some metric!

+ We'll see how to use `tidymodels` to do this in a coherent way shortly!

---
      
# Recap
      
- Multiple Linear Regression models are a common model used for a numeric response
    
- Generally fit via minimizing the sum of squared residuals or errors
    
    + Could fit using sum of absolute deviation, or other metric
    
- Can include polynomial terms, interaction terms, and categorical variables
    
- Good metric to compare models with a continuous response is the RMSE