Modeling Concepts: Inference vs Prediction

class: center, middle, inverse, title-slide

.title[
# Modeling Concepts: Inference vs Prediction
]
.author[
### Justin Post
]

---

layout: false

# What do we want to be able to do?

Data Science!

- Read in raw data and manipulate it
- Combine data sources
- Summarize data to glean insights
- Apply common analysis methods
- Communicate Effectively

---

# Modeling Ideas

What is a (statistical) model?

- A mathematical representation of some phenomenon on which you've observed data
- Form of the model can vary greatly!

---

# Modeling (log) Selling Price

- First a visual on motorcycle sales data

```r
bike_data <- read_csv("https://www4.stat.ncsu.edu/~online/datasets/bikeDetails.csv")
bike_data <- bike_data |> 
  mutate(log_selling_price = log(selling_price), 
         log_km_driven = log(km_driven)) |>
  select(log_km_driven, log_selling_price, everything())
bike_data
```

```
## # A tibble: 1,061 x 9
##   log_km_driven log_selling_price name     selling_price  year seller_type owner
##           <dbl>             <dbl> <chr>            <dbl> <dbl> <chr>       <chr>
## 1          5.86             12.1  Royal E~        175000  2019 Individual  1st ~
## 2          8.64             10.7  Honda D~         45000  2017 Individual  1st ~
## 3          9.39             11.9  Royal E~        150000  2018 Individual  1st ~
## 4         10.0              11.1  Yamaha ~         65000  2015 Individual  1st ~
## 5          9.95              9.90 Yamaha ~         20000  2011 Individual  2nd ~
## # i 1,056 more rows
## # i 2 more variables: km_driven <dbl>, ex_showroom_price <dbl>
```

---

# Modeling (log) Selling Price

- First a visual on motorcycle sales data

```r
ggplot(bike_data, aes(x = log_km_driven, y = log_selling_price)) +
  geom_point() +
  geom_smooth(method = "lm")
```

```
## `geom_smooth()` using formula = 'y ~ x'
```

<img src="data:image/png;base64,#44-Modeling_Concepts_Inference_Prediction_files/figure-html/unnamed-chunk-2-1.png" width="350px" style="display: block; margin: auto;" />

---

# Modeling (log) Selling Price

- First a visual on motorcycle sales data

```r
ggplot(bike_data, aes(x = log_km_driven, y = log_selling_price)) +
  geom_point() +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2))
```

<img src="data:image/png;base64,#44-Modeling_Concepts_Inference_Prediction_files/figure-html/unnamed-chunk-3-1.png" width="350px" style="display: block; margin: auto;" />

---

# Modeling (log) Selling Price

- First a visual on motorcycle sales data

```r
ggplot(bike_data, aes(x = log_km_driven, y = log_selling_price)) +
  geom_point() +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2) + I(x^3))
```

<img src="data:image/png;base64,#44-Modeling_Concepts_Inference_Prediction_files/figure-html/unnamed-chunk-4-1.png" width="350px" style="display: block; margin: auto;" />

---

# Modeling (log) Selling Price

- First a visual on motorcycle sales data

```r
ggplot(bike_data, aes(x = log_km_driven, y = log_selling_price)) +
  geom_point() +
  geom_smooth()
```

```
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
```

<img src="data:image/png;base64,#44-Modeling_Concepts_Inference_Prediction_files/figure-html/unnamed-chunk-5-1.png" width="350px" style="display: block; margin: auto;" />

---

# Modeling (log) Selling Price

```
## Warning: package 'tree' was built under R version 4.1.3
```

```r
ggplot(bike_data, aes(x = log_km_driven, y = log_selling_price)) +
  geom_point() + 
  geom_line(data = preds, aes(x = log_km_driven, y = Tree_Predictions), color = "Blue", linewidth = 2)
```

<img src="data:image/png;base64,#44-Modeling_Concepts_Inference_Prediction_files/figure-html/unnamed-chunk-7-1.png" width="350px" style="display: block; margin: auto;" />

---

# Modeling Ideas

What is a (statistical) model?

- A mathematical representation of some phenomenon on which you've observed data
- Form of the model can vary greatly!

**Statistical learning** - Inference, prediction/classification, and pattern finding

- Supervised learning - a variable (or variables) represents an **output** or **response** of interest

--

+ May model response and
        - Make **inference** on the model parameters  
        - **predict** a value or **classify** an observation

Our Goal: Understand what it means to be a good predictive model (not make inference)

---

# Training a Model

- Once a class of models is chosen, we must define some criteria to **fit** (or train) the model

**Simple Linear Regression (SLR) Model**

`$$E(Y_i|x_i) = \beta_0+\beta_1x_i$$`

---

# Training a Model

- Once a class of models is chosen, we must define some criteria to **fit** (or train) the model

**Simple Linear Regression (SLR) Model**

`$$E(Y_i|x_i) = \beta_0+\beta_1x_i$$`

- **Loss function** - Criteria used to fit or train a model

- For a given **numeric** response value, `\(y_i\)` and prediction, `\(\hat{y}_i\)`
    
    `$$y_i - \hat{y}_i, (y_i-\hat{y}_i)^2, |y_i-\hat{y}_i|$$`

---

# Training a Model

- Once a class of models is chosen, we must define some criteria to **fit** (or train) the model

**Simple Linear Regression (SLR) Model**

`$$E(Y_i|x_i) = \beta_0+\beta_1x_i$$`

- **Loss function** - Criteria used to fit or train a model

- For a given **numeric** response value, `\(y_i\)` and prediction, `\(\hat{y}_i\)`
    
    `$$y_i - \hat{y}_i, (y_i-\hat{y}_i)^2, |y_i-\hat{y}_i|$$`

- We try to optimize the loss over all the observations used for training

`$$\sum_{i=1}^{n} (y_i-\hat{y}_i)^2~~~~~~~~~~~~~~~~~~~~ \sum_{i=1}^{n} |y_i-\hat{y}_i|$$`

---

# Training (Fitting) the SLR Model

- Often use squared error loss (least squares regression)

- Nice solutions for our estimates exist!

`$$\hat{\beta}_0 = \bar{y}-\bar{x}\hat{\beta}_1$$`
`$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}$$`

---

# Training (Fitting) the SLR Model

- Often use squared error loss (least squares regression)

- Nice solutions for our estimates exist!

`$$\hat{\beta}_0 = \bar{y}-\bar{x}\hat{\beta}_1$$`
`$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}$$`

```r
y <- bike_data$log_selling_price
x <- bike_data$log_km_driven
b1_hat <- sum((x-mean(x))*(y-mean(y)))/sum((x-mean(x))^2)
b0_hat <- mean(y)-mean(x)*b1_hat
c(round(b0_hat, 4), round(b1_hat, 4))
```

```
## [1] 14.6356 -0.3911
```

- Now we can find a prediction! Denoted as `\(\hat{y}\)`

---

# Training (Fitting) the SLR Model in R

- Use `lm()` function to fit in R

- Utilizes `formula` notation: `y ~ x` -> `response ~ model terms`

```r
slr_fit <- lm(log_selling_price ~ log_km_driven, data = bike_data)
slr_fit
```

```
## 
## Call:
## lm(formula = log_selling_price ~ log_km_driven, data = bike_data)
## 
## Coefficients:
##   (Intercept)  log_km_driven  
##       14.6356        -0.3911
```

---

# Inference Using the SLR Model in R

- If we assume iid errors that are Normally distributed with the same variance, we can conduct inference!

- Confidence intervals and hypothesis tests around the slope parameter
    
    - Use `summary()` (generic function!) on our fitted model

```r
summary(slr_fit)
```

```
## 
## Call:
## lm(formula = log_selling_price ~ log_km_driven, data = bike_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9271 -0.3822 -0.0337  0.3794  2.5656 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   14.63557    0.18455   79.31   <2e-16 ***
## log_km_driven -0.39109    0.01837  -21.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5953 on 1059 degrees of freedom
## Multiple R-squared:  0.2997,	Adjusted R-squared:  0.299 
## F-statistic: 453.2 on 1 and 1059 DF,  p-value: < 2.2e-16
```

---

# Inference Using the SLR Model in R

- If we assume iid errors that are Normally distributed with the same variance, we can conduct inference!

- Can use `anova()` to get Analysis of Variance information

```r
anova(slr_fit)
```

```
## Analysis of Variance Table
## 
## Response: log_selling_price
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## log_km_driven    1 160.58 160.583  453.19 < 2.2e-16 ***
## Residuals     1059 375.24   0.354                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

# Inference Using the SLR Model in R

- If we assume iid errors that are Normally distributed with the same variance, we can conduct inference!

- Residual diagnostics can be found via `plot()` on the fitted model

<img src="data:image/png;base64,#44-Modeling_Concepts_Inference_Prediction_files/figure-html/unnamed-chunk-12-1.png" width="350px" style="display: block; margin: auto;" />

---

# Inference Using the SLR Model in R

- If we assume iid errors that are Normally distributed with the same variance, we can conduct inference!

- Residual diagnostics can be found via `plot()` on the fitted model

<img src="data:image/png;base64,#44-Modeling_Concepts_Inference_Prediction_files/figure-html/unnamed-chunk-13-1.png" width="350px" style="display: block; margin: auto;" />

---

# Prediction Using the SLR Model in R

- Can use the line for prediction with `predict()`!

+ Another generic function in R

```r
predict
```

```
## function (object, ...) 
## UseMethod("predict")
## <bytecode: 0x0000000015bef250>
## <environment: namespace:stats>
```

```r
predict.lm
```

```
## function (object, newdata, se.fit = FALSE, scale = NULL, df = Inf, 
##     interval = c("none", "confidence", "prediction"), level = 0.95, 
##     type = c("response", "terms"), terms = NULL, na.action = na.pass, 
##     pred.var = res.var/weights, weights = 1, ...) 
## {
##     tt <- terms(object)
##     if (!inherits(object, "lm")) 
##         warning("calling predict.lm(<fake-lm-object>) ...")
##     if (missing(newdata) || is.null(newdata)) {
##         mm <- X <- model.matrix(object)
##         mmDone <- TRUE
##         offset <- object$offset
##     }
##     else {
##         Terms <- delete.response(tt)
##         m <- model.frame(Terms, newdata, na.action = na.action, 
##             xlev = object$xlevels)
##         if (!is.null(cl <- attr(Terms, "dataClasses"))) 
##             .checkMFClasses(cl, m)
##         X <- model.matrix(Terms, m, contrasts.arg = object$contrasts)
##         offset <- rep(0, nrow(X))
##         if (!is.null(off.num <- attr(tt, "offset"))) 
##             for (i in off.num) offset <- offset + eval(attr(tt, 
##                 "variables")[[i + 1]], newdata)
##         if (!is.null(object$call$offset)) 
##             offset <- offset + eval(object$call$offset, newdata)
##         mmDone <- FALSE
##     }
##     n <- length(object$residuals)
##     p <- object$rank
##     p1 <- seq_len(p)
##     piv <- if (p) 
##         qr.lm(object)$pivot[p1]
##     if (p < ncol(X) && !(missing(newdata) || is.null(newdata))) 
##         warning("prediction from a rank-deficient fit may be misleading")
##     beta <- object$coefficients
##     predictor <- drop(X[, piv, drop = FALSE] %*% beta[piv])
##     if (!is.null(offset)) 
##         predictor <- predictor + offset
##     interval <- match.arg(interval)
##     if (interval == "prediction") {
##         if (missing(newdata)) 
##             warning("predictions on current data refer to _future_ responses\n")
##         if (missing(newdata) && missing(weights)) {
##             w <- weights.default(object)
##             if (!is.null(w)) {
##                 weights <- w
##                 warning("assuming prediction variance inversely proportional to weights used for fitting\n")
##             }
##         }
##         if (!missing(newdata) && missing(weights) && !is.null(object$weights) && 
##             missing(pred.var)) 
##             warning("Assuming constant prediction variance even though model fit is weighted\n")
##         if (inherits(weights, "formula")) {
##             if (length(weights) != 2L) 
##                 stop("'weights' as formula should be one-sided")
##             d <- if (missing(newdata) || is.null(newdata)) 
##                 model.frame(object)
##             else newdata
##             weights <- eval(weights[[2L]], d, environment(weights))
##         }
##     }
##     type <- match.arg(type)
##     if (se.fit || interval != "none") {
##         w <- object$weights
##         res.var <- if (is.null(scale)) {
##             r <- object$residuals
##             rss <- sum(if (is.null(w)) r^2 else r^2 * w)
##             df <- object$df.residual
##             rss/df
##         }
##         else scale^2
##         if (type != "terms") {
##             if (p > 0) {
##                 XRinv <- if (missing(newdata) && is.null(w)) 
##                   qr.Q(qr.lm(object))[, p1, drop = FALSE]
##                 else X[, piv] %*% qr.solve(qr.R(qr.lm(object))[p1, 
##                   p1])
##                 ip <- drop(XRinv^2 %*% rep(res.var, p))
##             }
##             else ip <- rep(0, n)
##         }
##     }
##     if (type == "terms") {
##         if (!mmDone) {
##             mm <- model.matrix(object)
##             mmDone <- TRUE
##         }
##         aa <- attr(mm, "assign")
##         ll <- attr(tt, "term.labels")
##         hasintercept <- attr(tt, "intercept") > 0L
##         if (hasintercept) 
##             ll <- c("(Intercept)", ll)
##         aaa <- factor(aa, labels = ll)
##         asgn <- split(order(aa), aaa)
##         if (hasintercept) {
##             asgn$"(Intercept)" <- NULL
##             avx <- colMeans(mm)
##             termsconst <- sum(avx[piv] * beta[piv])
##         }
##         nterms <- length(asgn)
##         if (nterms > 0) {
##             predictor <- matrix(ncol = nterms, nrow = NROW(X))
##             dimnames(predictor) <- list(rownames(X), names(asgn))
##             if (se.fit || interval != "none") {
##                 ip <- matrix(ncol = nterms, nrow = NROW(X))
##                 dimnames(ip) <- list(rownames(X), names(asgn))
##                 Rinv <- qr.solve(qr.R(qr.lm(object))[p1, p1])
##             }
##             if (hasintercept) 
##                 X <- sweep(X, 2L, avx, check.margin = FALSE)
##             unpiv <- rep.int(0L, NCOL(X))
##             unpiv[piv] <- p1
##             for (i in seq.int(1L, nterms, length.out = nterms)) {
##                 iipiv <- asgn[[i]]
##                 ii <- unpiv[iipiv]
##                 iipiv[ii == 0L] <- 0L
##                 predictor[, i] <- if (any(iipiv > 0L)) 
##                   X[, iipiv, drop = FALSE] %*% beta[iipiv]
##                 else 0
##                 if (se.fit || interval != "none") 
##                   ip[, i] <- if (any(iipiv > 0L)) 
##                     as.matrix(X[, iipiv, drop = FALSE] %*% Rinv[ii, 
##                       , drop = FALSE])^2 %*% rep.int(res.var, 
##                       p)
##                   else 0
##             }
##             if (!is.null(terms)) {
##                 predictor <- predictor[, terms, drop = FALSE]
##                 if (se.fit) 
##                   ip <- ip[, terms, drop = FALSE]
##             }
##         }
##         else {
##             predictor <- ip <- matrix(0, n, 0L)
##         }
##         attr(predictor, "constant") <- if (hasintercept) 
##             termsconst
##         else 0
##     }
##     if (interval != "none") {
##         tfrac <- qt((1 - level)/2, df)
##         hwid <- tfrac * switch(interval, confidence = sqrt(ip), 
##             prediction = sqrt(ip + pred.var))
##         if (type != "terms") {
##             predictor <- cbind(predictor, predictor + hwid %o% 
##                 c(1, -1))
##             colnames(predictor) <- c("fit", "lwr", "upr")
##         }
##         else {
##             if (!is.null(terms)) 
##                 hwid <- hwid[, terms, drop = FALSE]
##             lwr <- predictor + hwid
##             upr <- predictor - hwid
##         }
##     }
##     if (se.fit || interval != "none") {
##         se <- sqrt(ip)
##         if (type == "terms" && !is.null(terms) && !se.fit) 
##             se <- se[, terms, drop = FALSE]
##     }
##     if (missing(newdata) && !is.null(na.act <- object$na.action)) {
##         predictor <- napredict(na.act, predictor)
##         if (se.fit) 
##             se <- napredict(na.act, se)
##     }
##     if (type == "terms" && interval != "none") {
##         if (missing(newdata) && !is.null(na.act)) {
##             lwr <- napredict(na.act, lwr)
##             upr <- napredict(na.act, upr)
##         }
##         list(fit = predictor, se.fit = se, lwr = lwr, upr = upr, 
##             df = df, residual.scale = sqrt(res.var))
##     }
##     else if (se.fit) 
##         list(fit = predictor, se.fit = se, df = df, residual.scale = sqrt(res.var))
##     else predictor
## }
## <bytecode: 0x0000000034b37768>
## <environment: namespace:stats>
```

---

# Prediction Using the SLR Model in R

- Can use the line for prediction with `predict()`!

- Should supply fitted object and `newdata`
    
        - An optional data frame in which to look for variables with which to predict. If omitted, the fitted values are used.

```r
predict(slr_fit, newdata = data.frame(log_km_driven = c(log(1000), log(10000), log(100000))))
```

```
##        1        2        3 
## 11.93404 11.03353 10.13302
```

```r
exp(predict(slr_fit, newdata = data.frame(log_km_driven = c(log(1000), log(10000), log(100000)))))
```

```
##         1         2         3 
## 152365.60  61915.64  25160.19
```
---

# Quantifying How Well the Model Predicts

We use a **loss** function to fit the model. We use a **metric** to evaluate the model!

- Often use the same loss function for fitting and as the metric
- For a given **numeric** response value, `\(y_i\)` and prediction, `\(\hat{y}_i\)`
`$$(y_i-\hat{y}_i)^2, |y_i-\hat{y}_i|$$`
- Incorporate all points via
`$$\frac{1}{n}\sum_{i=1}^{n} (y_i-\hat{y}_i)^2, \frac{1}{n}\sum_{i=1}^{n} |y_i-\hat{y}_i|$$`

---

# Metric Function

- For a numeric response, we commonly use squared error loss as our metric to evaluate a prediction
`$$L(y_i,\hat{y}_i) = (y_i-\hat{y}_i)^2$$`

- Use Root Mean Square Error as a **metric** across all observations
`$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n} L(y_i, \hat{y}_i)} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}$$`

---

# Commonly Used Metrics

For prediction (numeric response)
- Mean Squared Error (MSE) or Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE or MAD - deviation)
`$$L(y_i,\hat{y}_i) = |y_i-\hat{y}_i|$$`
- [Huber Loss](https://en.wikipedia.org/wiki/Huber_loss)
    + Doesn't penalize large mistakes as much as MSE

---

# Commonly Used Metrics

For prediction (numeric response)
- Mean Squared Error (MSE) or Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE or MAD - deviation)
`$$L(y_i,\hat{y}_i) = |y_i-\hat{y}_i|$$`
- [Huber Loss](https://en.wikipedia.org/wiki/Huber_loss)
    + Doesn't penalize large mistakes as much as MSE

For classification (categorical response)
- Accuracy
- log-loss
- AUC
- F1 Score

---

# Evaluating our SLR Model

- We could find our metric for our SLR model using the training data

+ Called **training error**

```r
head(predict(slr_fit))
```

```
##        1        2        3        4        5        6 
## 12.34461 11.25681 10.96222 10.70779 10.74337 10.33280
```

```r
mean((bike_data$log_selling_price-predict(slr_fit))^2)
```

```
## [1] 0.3536708
```

```r
sqrt(mean((bike_data$log_selling_price-predict(slr_fit))^2))
```

```
## [1] 0.5947023
```

- Doesn't tell us how well we do on data we haven't seen!

---

# Training vs Test Sets

Ideally we want our model to predict well for observations **it has yet to see**!

- For *multiple* linear regression models, our training MSE will always decrease as we add more variables to the model...

- We'll need an independent **test** set to predict on (more on this shortly!)

---

# Big Picture Modeling

Supervised Learning methods try to relate predictors to a response variable through a model

- Lots of common models

- Regression models
    - Tree based methods
    - Naive Bayes
    - k Nearest Neighbors
    - ...

- For a set of predictor values, each will produce some prediction we can call `\(\hat{y}\)`

- Evaluate model via a metric

- Will use an independent test set or cross-validation to more accurately judge our model