class: center, middle, inverse, title-slide .title[ # Modeling Concepts: Inference vs Prediction ] .author[ ### Justin Post ] --- layout: false # What do we want to be able to do? Data Science! - Read in raw data and manipulate it - Combine data sources - Summarize data to glean insights - Apply common analysis methods - Communicate Effectively --- # Modeling Ideas What is a (statistical) model? - A mathematical representation of some phenomenon on which you've observed data - Form of the model can vary greatly! --- # Modeling (log) Selling Price - First a visual on motorcycle sales data ```r bike_data <- read_csv("https://www4.stat.ncsu.edu/~online/datasets/bikeDetails.csv") bike_data <- bike_data |> mutate(log_selling_price = log(selling_price), log_km_driven = log(km_driven)) |> select(log_km_driven, log_selling_price, everything()) bike_data ``` ``` ## # A tibble: 1,061 x 9 ## log_km_driven log_selling_price name selling_price year seller_type owner ## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> ## 1 5.86 12.1 Royal E~ 175000 2019 Individual 1st ~ ## 2 8.64 10.7 Honda D~ 45000 2017 Individual 1st ~ ## 3 9.39 11.9 Royal E~ 150000 2018 Individual 1st ~ ## 4 10.0 11.1 Yamaha ~ 65000 2015 Individual 1st ~ ## 5 9.95 9.90 Yamaha ~ 20000 2011 Individual 2nd ~ ## # i 1,056 more rows ## # i 2 more variables: km_driven <dbl>, ex_showroom_price <dbl> ``` --- # Modeling (log) Selling Price - First a visual on motorcycle sales data ```r ggplot(bike_data, aes(x = log_km_driven, y = log_selling_price)) + geom_point() + geom_smooth(method = "lm") ``` ``` ## `geom_smooth()` using formula = 'y ~ x' ``` <img src="data:image/png;base64,#44-Modeling_Concepts_Inference_Prediction_files/figure-html/unnamed-chunk-2-1.png" width="350px" style="display: block; margin: auto;" /> --- # Modeling (log) Selling Price - First a visual on motorcycle sales data ```r ggplot(bike_data, aes(x = log_km_driven, y = log_selling_price)) + geom_point() + stat_smooth(method = "lm", formula = y ~ x + I(x^2)) ``` <img src="data:image/png;base64,#44-Modeling_Concepts_Inference_Prediction_files/figure-html/unnamed-chunk-3-1.png" width="350px" style="display: block; margin: auto;" /> --- # Modeling (log) Selling Price - First a visual on motorcycle sales data ```r ggplot(bike_data, aes(x = log_km_driven, y = log_selling_price)) + geom_point() + stat_smooth(method = "lm", formula = y ~ x + I(x^2) + I(x^3)) ``` <img src="data:image/png;base64,#44-Modeling_Concepts_Inference_Prediction_files/figure-html/unnamed-chunk-4-1.png" width="350px" style="display: block; margin: auto;" /> --- # Modeling (log) Selling Price - First a visual on motorcycle sales data ```r ggplot(bike_data, aes(x = log_km_driven, y = log_selling_price)) + geom_point() + geom_smooth() ``` ``` ## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")' ``` <img src="data:image/png;base64,#44-Modeling_Concepts_Inference_Prediction_files/figure-html/unnamed-chunk-5-1.png" width="350px" style="display: block; margin: auto;" /> --- # Modeling (log) Selling Price ``` ## Warning: package 'tree' was built under R version 4.1.3 ``` ```r ggplot(bike_data, aes(x = log_km_driven, y = log_selling_price)) + geom_point() + geom_line(data = preds, aes(x = log_km_driven, y = Tree_Predictions), color = "Blue", linewidth = 2) ``` <img src="data:image/png;base64,#44-Modeling_Concepts_Inference_Prediction_files/figure-html/unnamed-chunk-7-1.png" width="350px" style="display: block; margin: auto;" /> --- # Modeling Ideas What is a (statistical) model? - A mathematical representation of some phenomenon on which you've observed data - Form of the model can vary greatly! **Statistical learning** - Inference, prediction/classification, and pattern finding - Supervised learning - a variable (or variables) represents an **output** or **response** of interest -- + May model response and - Make **inference** on the model parameters - **predict** a value or **classify** an observation Our Goal: Understand what it means to be a good predictive model (not make inference) --- # Training a Model - Once a class of models is chosen, we must define some criteria to **fit** (or train) the model **Simple Linear Regression (SLR) Model** `$$E(Y_i|x_i) = \beta_0+\beta_1x_i$$` --- # Training a Model - Once a class of models is chosen, we must define some criteria to **fit** (or train) the model **Simple Linear Regression (SLR) Model** `$$E(Y_i|x_i) = \beta_0+\beta_1x_i$$` - **Loss function** - Criteria used to fit or train a model - For a given **numeric** response value, `\(y_i\)` and prediction, `\(\hat{y}_i\)` `$$y_i - \hat{y}_i, (y_i-\hat{y}_i)^2, |y_i-\hat{y}_i|$$` --- # Training a Model - Once a class of models is chosen, we must define some criteria to **fit** (or train) the model **Simple Linear Regression (SLR) Model** `$$E(Y_i|x_i) = \beta_0+\beta_1x_i$$` - **Loss function** - Criteria used to fit or train a model - For a given **numeric** response value, `\(y_i\)` and prediction, `\(\hat{y}_i\)` `$$y_i - \hat{y}_i, (y_i-\hat{y}_i)^2, |y_i-\hat{y}_i|$$` - We try to optimize the loss over all the observations used for training `$$\sum_{i=1}^{n} (y_i-\hat{y}_i)^2~~~~~~~~~~~~~~~~~~~~ \sum_{i=1}^{n} |y_i-\hat{y}_i|$$` --- # Training (Fitting) the SLR Model - Often use squared error loss (least squares regression) - Nice solutions for our estimates exist! `$$\hat{\beta}_0 = \bar{y}-\bar{x}\hat{\beta}_1$$` `$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}$$` --- # Training (Fitting) the SLR Model - Often use squared error loss (least squares regression) - Nice solutions for our estimates exist! `$$\hat{\beta}_0 = \bar{y}-\bar{x}\hat{\beta}_1$$` `$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}$$` ```r y <- bike_data$log_selling_price x <- bike_data$log_km_driven b1_hat <- sum((x-mean(x))*(y-mean(y)))/sum((x-mean(x))^2) b0_hat <- mean(y)-mean(x)*b1_hat c(round(b0_hat, 4), round(b1_hat, 4)) ``` ``` ## [1] 14.6356 -0.3911 ``` - Now we can find a prediction! Denoted as `\(\hat{y}\)` --- # Training (Fitting) the SLR Model in R - Use `lm()` function to fit in R - Utilizes `formula` notation: `y ~ x` -> `response ~ model terms` ```r slr_fit <- lm(log_selling_price ~ log_km_driven, data = bike_data) slr_fit ``` ``` ## ## Call: ## lm(formula = log_selling_price ~ log_km_driven, data = bike_data) ## ## Coefficients: ## (Intercept) log_km_driven ## 14.6356 -0.3911 ``` --- # Inference Using the SLR Model in R - If we assume iid errors that are Normally distributed with the same variance, we can conduct inference! - Confidence intervals and hypothesis tests around the slope parameter - Use `summary()` (generic function!) on our fitted model ```r summary(slr_fit) ``` ``` ## ## Call: ## lm(formula = log_selling_price ~ log_km_driven, data = bike_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.9271 -0.3822 -0.0337 0.3794 2.5656 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 14.63557 0.18455 79.31 <2e-16 *** ## log_km_driven -0.39109 0.01837 -21.29 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5953 on 1059 degrees of freedom ## Multiple R-squared: 0.2997, Adjusted R-squared: 0.299 ## F-statistic: 453.2 on 1 and 1059 DF, p-value: < 2.2e-16 ``` --- # Inference Using the SLR Model in R - If we assume iid errors that are Normally distributed with the same variance, we can conduct inference! - Can use `anova()` to get Analysis of Variance information ```r anova(slr_fit) ``` ``` ## Analysis of Variance Table ## ## Response: log_selling_price ## Df Sum Sq Mean Sq F value Pr(>F) ## log_km_driven 1 160.58 160.583 453.19 < 2.2e-16 *** ## Residuals 1059 375.24 0.354 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # Inference Using the SLR Model in R - If we assume iid errors that are Normally distributed with the same variance, we can conduct inference! - Residual diagnostics can be found via `plot()` on the fitted model <img src="data:image/png;base64,#44-Modeling_Concepts_Inference_Prediction_files/figure-html/unnamed-chunk-12-1.png" width="350px" style="display: block; margin: auto;" /> --- # Inference Using the SLR Model in R - If we assume iid errors that are Normally distributed with the same variance, we can conduct inference! - Residual diagnostics can be found via `plot()` on the fitted model <img src="data:image/png;base64,#44-Modeling_Concepts_Inference_Prediction_files/figure-html/unnamed-chunk-13-1.png" width="350px" style="display: block; margin: auto;" /> --- # Prediction Using the SLR Model in R - Can use the line for prediction with `predict()`! + Another generic function in R ```r predict ``` ``` ## function (object, ...) ## UseMethod("predict") ## <bytecode: 0x0000000015bef250> ## <environment: namespace:stats> ``` ```r predict.lm ``` ``` ## function (object, newdata, se.fit = FALSE, scale = NULL, df = Inf, ## interval = c("none", "confidence", "prediction"), level = 0.95, ## type = c("response", "terms"), terms = NULL, na.action = na.pass, ## pred.var = res.var/weights, weights = 1, ...) ## { ## tt <- terms(object) ## if (!inherits(object, "lm")) ## warning("calling predict.lm(<fake-lm-object>) ...") ## if (missing(newdata) || is.null(newdata)) { ## mm <- X <- model.matrix(object) ## mmDone <- TRUE ## offset <- object$offset ## } ## else { ## Terms <- delete.response(tt) ## m <- model.frame(Terms, newdata, na.action = na.action, ## xlev = object$xlevels) ## if (!is.null(cl <- attr(Terms, "dataClasses"))) ## .checkMFClasses(cl, m) ## X <- model.matrix(Terms, m, contrasts.arg = object$contrasts) ## offset <- rep(0, nrow(X)) ## if (!is.null(off.num <- attr(tt, "offset"))) ## for (i in off.num) offset <- offset + eval(attr(tt, ## "variables")[[i + 1]], newdata) ## if (!is.null(object$call$offset)) ## offset <- offset + eval(object$call$offset, newdata) ## mmDone <- FALSE ## } ## n <- length(object$residuals) ## p <- object$rank ## p1 <- seq_len(p) ## piv <- if (p) ## qr.lm(object)$pivot[p1] ## if (p < ncol(X) && !(missing(newdata) || is.null(newdata))) ## warning("prediction from a rank-deficient fit may be misleading") ## beta <- object$coefficients ## predictor <- drop(X[, piv, drop = FALSE] %*% beta[piv]) ## if (!is.null(offset)) ## predictor <- predictor + offset ## interval <- match.arg(interval) ## if (interval == "prediction") { ## if (missing(newdata)) ## warning("predictions on current data refer to _future_ responses\n") ## if (missing(newdata) && missing(weights)) { ## w <- weights.default(object) ## if (!is.null(w)) { ## weights <- w ## warning("assuming prediction variance inversely proportional to weights used for fitting\n") ## } ## } ## if (!missing(newdata) && missing(weights) && !is.null(object$weights) && ## missing(pred.var)) ## warning("Assuming constant prediction variance even though model fit is weighted\n") ## if (inherits(weights, "formula")) { ## if (length(weights) != 2L) ## stop("'weights' as formula should be one-sided") ## d <- if (missing(newdata) || is.null(newdata)) ## model.frame(object) ## else newdata ## weights <- eval(weights[[2L]], d, environment(weights)) ## } ## } ## type <- match.arg(type) ## if (se.fit || interval != "none") { ## w <- object$weights ## res.var <- if (is.null(scale)) { ## r <- object$residuals ## rss <- sum(if (is.null(w)) r^2 else r^2 * w) ## df <- object$df.residual ## rss/df ## } ## else scale^2 ## if (type != "terms") { ## if (p > 0) { ## XRinv <- if (missing(newdata) && is.null(w)) ## qr.Q(qr.lm(object))[, p1, drop = FALSE] ## else X[, piv] %*% qr.solve(qr.R(qr.lm(object))[p1, ## p1]) ## ip <- drop(XRinv^2 %*% rep(res.var, p)) ## } ## else ip <- rep(0, n) ## } ## } ## if (type == "terms") { ## if (!mmDone) { ## mm <- model.matrix(object) ## mmDone <- TRUE ## } ## aa <- attr(mm, "assign") ## ll <- attr(tt, "term.labels") ## hasintercept <- attr(tt, "intercept") > 0L ## if (hasintercept) ## ll <- c("(Intercept)", ll) ## aaa <- factor(aa, labels = ll) ## asgn <- split(order(aa), aaa) ## if (hasintercept) { ## asgn$"(Intercept)" <- NULL ## avx <- colMeans(mm) ## termsconst <- sum(avx[piv] * beta[piv]) ## } ## nterms <- length(asgn) ## if (nterms > 0) { ## predictor <- matrix(ncol = nterms, nrow = NROW(X)) ## dimnames(predictor) <- list(rownames(X), names(asgn)) ## if (se.fit || interval != "none") { ## ip <- matrix(ncol = nterms, nrow = NROW(X)) ## dimnames(ip) <- list(rownames(X), names(asgn)) ## Rinv <- qr.solve(qr.R(qr.lm(object))[p1, p1]) ## } ## if (hasintercept) ## X <- sweep(X, 2L, avx, check.margin = FALSE) ## unpiv <- rep.int(0L, NCOL(X)) ## unpiv[piv] <- p1 ## for (i in seq.int(1L, nterms, length.out = nterms)) { ## iipiv <- asgn[[i]] ## ii <- unpiv[iipiv] ## iipiv[ii == 0L] <- 0L ## predictor[, i] <- if (any(iipiv > 0L)) ## X[, iipiv, drop = FALSE] %*% beta[iipiv] ## else 0 ## if (se.fit || interval != "none") ## ip[, i] <- if (any(iipiv > 0L)) ## as.matrix(X[, iipiv, drop = FALSE] %*% Rinv[ii, ## , drop = FALSE])^2 %*% rep.int(res.var, ## p) ## else 0 ## } ## if (!is.null(terms)) { ## predictor <- predictor[, terms, drop = FALSE] ## if (se.fit) ## ip <- ip[, terms, drop = FALSE] ## } ## } ## else { ## predictor <- ip <- matrix(0, n, 0L) ## } ## attr(predictor, "constant") <- if (hasintercept) ## termsconst ## else 0 ## } ## if (interval != "none") { ## tfrac <- qt((1 - level)/2, df) ## hwid <- tfrac * switch(interval, confidence = sqrt(ip), ## prediction = sqrt(ip + pred.var)) ## if (type != "terms") { ## predictor <- cbind(predictor, predictor + hwid %o% ## c(1, -1)) ## colnames(predictor) <- c("fit", "lwr", "upr") ## } ## else { ## if (!is.null(terms)) ## hwid <- hwid[, terms, drop = FALSE] ## lwr <- predictor + hwid ## upr <- predictor - hwid ## } ## } ## if (se.fit || interval != "none") { ## se <- sqrt(ip) ## if (type == "terms" && !is.null(terms) && !se.fit) ## se <- se[, terms, drop = FALSE] ## } ## if (missing(newdata) && !is.null(na.act <- object$na.action)) { ## predictor <- napredict(na.act, predictor) ## if (se.fit) ## se <- napredict(na.act, se) ## } ## if (type == "terms" && interval != "none") { ## if (missing(newdata) && !is.null(na.act)) { ## lwr <- napredict(na.act, lwr) ## upr <- napredict(na.act, upr) ## } ## list(fit = predictor, se.fit = se, lwr = lwr, upr = upr, ## df = df, residual.scale = sqrt(res.var)) ## } ## else if (se.fit) ## list(fit = predictor, se.fit = se, df = df, residual.scale = sqrt(res.var)) ## else predictor ## } ## <bytecode: 0x0000000034b37768> ## <environment: namespace:stats> ``` --- # Prediction Using the SLR Model in R - Can use the line for prediction with `predict()`! - Should supply fitted object and `newdata` - An optional data frame in which to look for variables with which to predict. If omitted, the fitted values are used. ```r predict(slr_fit, newdata = data.frame(log_km_driven = c(log(1000), log(10000), log(100000)))) ``` ``` ## 1 2 3 ## 11.93404 11.03353 10.13302 ``` ```r exp(predict(slr_fit, newdata = data.frame(log_km_driven = c(log(1000), log(10000), log(100000))))) ``` ``` ## 1 2 3 ## 152365.60 61915.64 25160.19 ``` --- # Quantifying How Well the Model Predicts We use a **loss** function to fit the model. We use a **metric** to evaluate the model! - Often use the same loss function for fitting and as the metric - For a given **numeric** response value, `\(y_i\)` and prediction, `\(\hat{y}_i\)` `$$(y_i-\hat{y}_i)^2, |y_i-\hat{y}_i|$$` - Incorporate all points via `$$\frac{1}{n}\sum_{i=1}^{n} (y_i-\hat{y}_i)^2, \frac{1}{n}\sum_{i=1}^{n} |y_i-\hat{y}_i|$$` --- # Metric Function - For a numeric response, we commonly use squared error loss as our metric to evaluate a prediction `$$L(y_i,\hat{y}_i) = (y_i-\hat{y}_i)^2$$` - Use Root Mean Square Error as a **metric** across all observations `$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n} L(y_i, \hat{y}_i)} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}$$` --- # Commonly Used Metrics For prediction (numeric response) - Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) - Mean Absolute Error (MAE or MAD - deviation) `$$L(y_i,\hat{y}_i) = |y_i-\hat{y}_i|$$` - [Huber Loss](https://en.wikipedia.org/wiki/Huber_loss) + Doesn't penalize large mistakes as much as MSE --- # Commonly Used Metrics For prediction (numeric response) - Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) - Mean Absolute Error (MAE or MAD - deviation) `$$L(y_i,\hat{y}_i) = |y_i-\hat{y}_i|$$` - [Huber Loss](https://en.wikipedia.org/wiki/Huber_loss) + Doesn't penalize large mistakes as much as MSE For classification (categorical response) - Accuracy - log-loss - AUC - F1 Score --- # Evaluating our SLR Model - We could find our metric for our SLR model using the training data + Called **training error** ```r head(predict(slr_fit)) ``` ``` ## 1 2 3 4 5 6 ## 12.34461 11.25681 10.96222 10.70779 10.74337 10.33280 ``` ```r mean((bike_data$log_selling_price-predict(slr_fit))^2) ``` ``` ## [1] 0.3536708 ``` ```r sqrt(mean((bike_data$log_selling_price-predict(slr_fit))^2)) ``` ``` ## [1] 0.5947023 ``` - Doesn't tell us how well we do on data we haven't seen! --- # Training vs Test Sets Ideally we want our model to predict well for observations **it has yet to see**! - For *multiple* linear regression models, our training MSE will always decrease as we add more variables to the model... - We'll need an independent **test** set to predict on (more on this shortly!) --- # Big Picture Modeling Supervised Learning methods try to relate predictors to a response variable through a model - Lots of common models - Regression models - Tree based methods - Naive Bayes - k Nearest Neighbors - ... - For a set of predictor values, each will produce some prediction we can call `\(\hat{y}\)` - Evaluate model via a metric - Will use an independent test set or cross-validation to more accurately judge our model