layout: false class: title-slide-section-red, middle # Fitting and Evaluating Simple Linear Regression Models Justin Post --- layout: true <div class="my-footer"><img src="data:image/png;base64,#img/logo.png" alt = "" style="height: 60px;"/></div> --- # Modeling Ideas What is a (statistical) model? - A mathematical representation of some phenomenon on which you've observed data - Form of the model can vary greatly! --- # Modeling Ideas What is a (statistical) model? - A mathematical representation of some phenomenon on which you've observed data - Form of the model can vary greatly! **Simple Linear Regression Model** - `$$\mbox{response = intercept + slope*predictor + Error}$$` `$$Y_i = \beta_0+\beta_1x_i+E_i$$` - May make assumptions about how errors are observed --- # Simple Linear Regression Model - First a visual ```python import pandas as pd import numpy as np import seaborn as sns bike_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/bikeDetails.csv") bike_data['log_selling_price'] = np.log(bike_data['selling_price']) bike_data['log_km_driven'] = np.log(bike_data['km_driven']) ``` --- # Simple Linear Regression Model - First a visual ```python sns.regplot(x = bike_data["year"], y = bike_data["log_selling_price"]) ``` <img src="data:image/png;base64,#25-Fitting_Evaluating_SLR_Models_files/figure-html/unnamed-chunk-3-1.svg" alt="A scatterplot between year (x) and log selling price (y) is shown. The points show a roughly linear, positive relationship and a line is overlaid roughly going through the center of the points." width="400px" style="display: block; margin: auto;" /> --- # Statistical Learning **Statistical learning** - Inference, prediction/classification, and pattern finding - Supervised learning - a variable (or variables) represents an **output** or **response** of interest -- + May model response and - Make **inference** on the model parameters - **predict** a value or **classify** an observation Goal: Understand what it means to be a good predictive model --- # Simple Linear Regression Model Basic model for relating a numeric predictor to a numeric response `$$\mbox{response = intercept + slope*predictor + Error}$$` `$$Y_i = \beta_0+\beta_1x_i+E_i$$` --- layout: false # Simple Linear Regression Model Basic model for relating a numeric predictor to a numeric response `$$\mbox{response = intercept + slope*predictor + Error}$$` `$$Y_i = \beta_0+\beta_1x_i+E_i$$` Consider a data set on motorcycle sale prices ```python import pandas as pd bike_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/bikeDetails.csv") print(bike_data.columns) ``` ``` ## Index(['name', 'selling_price', 'year', 'seller_type', 'owner', 'km_driven', ## 'ex_showroom_price'], ## dtype='object') ``` ```python bike_data.head() ``` ``` ## name ... ex_showroom_price ## 0 Royal Enfield Classic 350 ... NaN ## 1 Honda Dio ... NaN ## 2 Royal Enfield Classic Gunmetal Grey ... 148114.0 ## 3 Yamaha Fazer FI V 2.0 [2016-2018] ... 89643.0 ## 4 Yamaha SZ [2013-2014] ... NaN ## ## [5 rows x 7 columns] ``` --- # Find a 'Best' Fitting Line - We define some criteria to **fit** (or train) the model `$$\mbox{Model 1: log_selling_price = intercept + slope*year + Error}$$` `$$\mbox{Model 2: log_selling_price = intercept + slope*log_km_driven + Error}$$` .left45[ <img src="data:image/png;base64,#25-Fitting_Evaluating_SLR_Models_files/figure-html/unnamed-chunk-5-3.svg" alt="A scatterplot between year (x) and log selling price (y) is shown. The points show a roughly linear, positive relationship and a line is overlaid roughly going through the center of the points." width="350px" style="display: block; margin: auto;" /> ] .right45[ <img src="data:image/png;base64,#25-Fitting_Evaluating_SLR_Models_files/figure-html/unnamed-chunk-6-5.svg" alt="A scatterplot between log km driven (x) and log selling price (y) is shown. The points show a roughly linear, negative relationship and a line is overlaid roughly going through the center of the points." width="350" style="display: block; margin: auto;" /> ] --- # Training a Model - We define some criteria to **fit** (or train) the model - **Loss function** - Criteria used to fit or train a model - For a given **numeric** response value, `\(y_i\)` and prediction, `\(\hat{y}_i\)` `$$y_i - \hat{y}_i, (y_i-\hat{y}_i)^2, |y_i-\hat{y}_i|$$` --- # Training a Model - We define some criteria to **fit** (or train) the model - **Loss function** - Criteria used to fit or train a model - For a given **numeric** response value, `\(y_i\)` and prediction, `\(\hat{y}_i\)` `$$y_i - \hat{y}_i, (y_i-\hat{y}_i)^2, |y_i-\hat{y}_i|$$` - We try to optimize the loss over all the observations used for training `$$\sum_{i=1}^{n} (y_i-\hat{y}_i)^2~~~~~~~~~~~~~~~~~~~~ \sum_{i=1}^{n} |y_i-\hat{y}_i|$$` --- # Find a 'Best' Fitting Line - In SLR, we often use squared error loss (least squares regression) - Nice solutions for our estimates exist! `$$\hat{\beta}_0 = \bar{y}-\bar{x}\hat{\beta}_1$$` `$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}$$` --- # Find a 'Best' Fitting Line - In SLR, we often use squared error loss (least squares regression) - Nice solutions for our estimates exist! `$$\hat{\beta}_0 = \bar{y}-\bar{x}\hat{\beta}_1$$` `$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}$$` ```python y = bike_data['log_selling_price'] x = bike_data['log_km_driven'] b1hat = sum((x-x.mean())*(y-y.mean()))/sum((x-x.mean())**2) b0hat = y.mean()-x.mean()*b1hat print(round(b0hat, 4), round(b1hat, 4)) ``` ``` ## 14.6356 -0.3911 ``` --- # Find a 'Best' Fitting Line - In SLR, we often use squared error loss (least squares regression) - Nice solutions for our estimates exist! `$$\hat{\beta}_0 = \bar{y}-\bar{x}\hat{\beta}_1$$` `$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}$$` ```python y = bike_data['log_selling_price'] x = bike_data['log_km_driven'] b1hat = sum((x-x.mean())*(y-y.mean()))/sum((x-x.mean())**2) b0hat = y.mean()-x.mean()*b1hat print(round(b0hat, 4), round(b1hat, 4)) ``` ``` ## 14.6356 -0.3911 ``` - These give us the values to use with `\(\hat{y}\)`! --- # Simple Linear Regression Model in Python - Can use [`linear_model` from `sklearn` module](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) to fit the model - Note the requirements on the shape of `X` and the shape of `y` to pass to the `.fit()` method ```python print(bike_data['log_km_driven'].shape) ``` ``` ## (1061,) ``` ```python print(bike_data['log_km_driven'].values.reshape(-1,1).shape) ``` ``` ## (1061, 1) ``` --- # Simple Linear Regression Model in Python - Can use [`linear_model` from `sklearn` module](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) to fit the model ```python from sklearn import linear_model reg = linear_model.LinearRegression() #Create a reg object reg.fit(bike_data['log_km_driven'].values.reshape(-1,1), bike_data['log_selling_price']) ``` ```python print(reg.intercept_, reg.coef_) ``` ``` ## 14.6355682846293 [-0.39108654] ``` --- # Simple Linear Regression Model - Can use the line for prediction with the `.predict()` method! ```python print(reg.intercept_, reg.coef_) ``` ``` ## 14.6355682846293 [-0.39108654] ``` ```python pred1 = reg.predict(np.array([[10], [12], [14]])) pred1 #each of these represents a 'y-hat' for the given value of x ``` ``` ## array([10.72470291, 9.94252984, 9.16035677]) ``` <img src="data:image/png;base64,#25-Fitting_Evaluating_SLR_Models_files/figure-html/unnamed-chunk-14-7.svg" alt="A scatterplot between log km driven (x) and log selling price (y) is shown. The points show a roughly linear, negative relationship and a line is overlaid roughly going through the center of the points." width="300px" style="display: block; margin: auto;" /> --- # Recenter Supervised Learning methods try to relate predictors to a response variable through a model - Lots of common models - Regression models - Tree based methods - Naive Bayes - k Nearest Neighbors - For a set of predictor values, each will produce some prediction we can call `\(\hat{y}\)` --- # Recenter Supervised Learning methods try to relate predictors to a response variable through a model - Lots of common models - Regression models - Tree based methods - Naive Bayes - k Nearest Neighbors - For a set of predictor values, each will produce some prediction we can call `\(\hat{y}\)` Goal: Understand what it means to be a good predictive model. **How do we evaluate the model?** --- # Quantifying How Well the Model Predicts We use a **loss** function to fit the model. We use a **metric** to evaluate the model! - Often use the same loss function for fitting and as the metric - For a given **numeric** response value, `\(y_i\)` and prediction, `\(\hat{y}_i\)` `$$(y_i-\hat{y}_i)^2, |y_i-\hat{y}_i|$$` - Incorporate all points via `$$\frac{1}{n}\sum_{i=1}^{n} (y_i-\hat{y}_i)^2, \frac{1}{n}\sum_{i=1}^{n} |y_i-\hat{y}_i|$$` --- # Metric Function - For a numeric response, we commonly use squared error loss as our metric to evaluate a prediction `$$L(y_i,\hat{y}_i) = (y_i-\hat{y}_i)^2$$` - Use Root Mean Square Error as a **metric** across all observations `$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n} L(y_i, \hat{y}_i)} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}$$` --- # Commonly Used Metrics For prediction (numeric response) - Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) - Mean Absolute Error (MAE or MAD - deviation) `$$L(y_i,\hat{y}_i) = |y_i-\hat{y}_i|$$` - [Huber Loss](https://en.wikipedia.org/wiki/Huber_loss) + Doesn't penalize large mistakes as much as MSE --- # Commonly Used Metrics For prediction (numeric response) - Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) - Mean Absolute Error (MAE or MAD - deviation) `$$L(y_i,\hat{y}_i) = |y_i-\hat{y}_i|$$` - [Huber Loss](https://en.wikipedia.org/wiki/Huber_loss) + Doesn't penalize large mistakes as much as MSE For classification (categorical response) - Accuracy - log-loss - AUC - F1 Score --- # Evaluating our SLR Model - We could find our metric for our SLR model using the training data... - Import our [MSE metric](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) from `sklearn.metrics` ```python import sklearn.metrics as metrics pred = reg.predict(bike_data["log_km_driven"].values.reshape(-1,1)) print(np.sqrt(metrics.mean_squared_error(bike_data["log_selling_price"], pred))) ``` ``` ## 0.5947022682215317 ``` ```python print(metrics.mean_absolute_error(bike_data["log_selling_price"], pred)) ``` ``` ## 0.46886132002881753 ``` --- # Useful for Comparison! - Fit a competing model with `year` as the predictor ```python reg1 = linear_model.LinearRegression() #Create a reg object reg1.fit(bike_data['year'].values.reshape(-1,1), bike_data['log_selling_price']) ``` ``` ## LinearRegression() ``` ```python print(reg1.intercept_, reg1.coef_) ``` ``` ## -201.06317651252058 [0.10516552] ``` - Compare the performance on the training data... ```python pred1 = reg1.predict(bike_data["year"].values.reshape(-1,1)) print(np.sqrt(metrics.mean_squared_error(bike_data["log_selling_price"], pred)), np.sqrt(metrics.mean_squared_error(bike_data["log_selling_price"], pred1))) ``` ``` ## 0.5947022682215317 0.5482751462879227 ``` --- # Training vs Test Sets Ideally we want our model to predict well for observations **it has yet to see**! - For multiple linear regression models, our training MSE will always decrease as we add more variables to the model... - We'll need an independent **test** set to predict on (more on this shortly!) --- # Recap - SLR is one type of model for a continuous type response - SLR Model is fit using some criteria (usually least squares, squared error loss) - Must determine a method to judge the model's effectiveness (a metric) + Metric function measures *loss* for each prediction + Combined overall all observations - To obtain a better understanding of the predictive power of a model, we need to apply our metric to prediction made on a different set of data than that used for training!