layout: false class: title-slide-section-red, middle # Fitting and Evaluating Simple Linear Regression Models Justin Post --- layout: true <div class="my-footer"><img src="data:image/png;base64,#img/logo.png" style="height: 60px;"/></div> --- # Modeling Ideas What is a (statistical) model? - A mathematical representation of some phenomenon on which you've observed data - Form of the model can vary greatly! --- # Modeling Ideas What is a (statistical) model? - A mathematical representation of some phenomenon on which you've observed data - Form of the model can vary greatly! **Simple Linear Regression Model** - `$$\mbox{response = intercept + slope*predictor + Error}$$` `$$Y_i = \beta_0+\beta_1x_i+E_i$$` - May make assumptions about how errors are observed --- # Simple Linear Regression Model - First a visual ```python import pandas as pd import numpy as np import seaborn as sns bike_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/bikeDetails.csv") bike_data['log_selling_price'] = np.log(bike_data['selling_price']) bike_data['log_km_driven'] = np.log(bike_data['km_driven']) ``` --- # Simple Linear Regression Model - First a visual ```python sns.regplot(x = bike_data["year"], y = bike_data["log_selling_price"]) ``` <img src="data:image/png;base64,#23-Fitting_Evaluating_SLR_Models_files/figure-html/unnamed-chunk-3-1.svg" width="400px" style="display: block; margin: auto;" /> --- # Statistical Learning **Statistical learning** - Inference, prediction/classification, and pattern finding - Supervised learning - a variable (or variables) represents an **output** or **response** of interest -- + May model response and - Make **inference** on the model parameters - **predict** a value or **classify** an observation Goal: Understand what it means to be a good predictive model --- # Simple Linear Regression Model Basic model for relating a numeric predictor to a numeric response `$$\mbox{response = intercept + slope*predictor + Error}$$` `$$Y_i = \beta_0+\beta_1x_i+E_i$$` --- layout: false # Simple Linear Regression Model Basic model for relating a numeric predictor to a numeric response `$$\mbox{response = intercept + slope*predictor + Error}$$` `$$Y_i = \beta_0+\beta_1x_i+E_i$$` Consider a data set on motorcycle sale prices ```python import pandas as pd bike_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/bikeDetails.csv") print(bike_data.columns) ``` ``` ## Index(['name', 'selling_price', 'year', 'seller_type', 'owner', 'km_driven', ## 'ex_showroom_price'], ## dtype='object') ``` ```python bike_data.head() ``` ``` ## name ... ex_showroom_price ## 0 Royal Enfield Classic 350 ... NaN ## 1 Honda Dio ... NaN ## 2 Royal Enfield Classic Gunmetal Grey ... 148114.0 ## 3 Yamaha Fazer FI V 2.0 [2016-2018] ... 89643.0 ## 4 Yamaha SZ [2013-2014] ... NaN ## ## [5 rows x 7 columns] ``` --- # Find a 'Best' Fitting Line - We define some criteria to **fit** (or train) the model `$$\mbox{Model 1: log_selling_price = intercept + slope*year + Error}$$` `$$\mbox{Model 2: log_selling_price = intercept + slope*log_km_driven + Error}$$` .left45[ <img src="data:image/png;base64,#23-Fitting_Evaluating_SLR_Models_files/figure-html/unnamed-chunk-5-3.svg" width="350px" style="display: block; margin: auto;" /> ] .right45[ <img src="data:image/png;base64,#23-Fitting_Evaluating_SLR_Models_files/figure-html/unnamed-chunk-6-5.svg" width="350" style="display: block; margin: auto;" /> ] --- # Training a Model - We define some criteria to **fit** (or train) the model - **Loss function** - Criteria used to fit or train a model - For a given **numeric** response value, `\(y_i\)` and prediction, `\(\hat{y}_i\)` `$$y_i - \hat{y}_i, (y_i-\hat{y}_i)^2, |y_i-\hat{y}_i|$$` --- # Training a Model - We define some criteria to **fit** (or train) the model - **Loss function** - Criteria used to fit or train a model - For a given **numeric** response value, `\(y_i\)` and prediction, `\(\hat{y}_i\)` `$$y_i - \hat{y}_i, (y_i-\hat{y}_i)^2, |y_i-\hat{y}_i|$$` - We try to optimize the loss over all the observations used for training `$$\sum_{i=1}^{n} (y_i-\hat{y}_i)^2~~~~~~~~~~~~~~~~~~~~ \sum_{i=1}^{n} |y_i-\hat{y}_i|$$` --- # Find a 'Best' Fitting Line - In SLR, we often use squared error loss (least squares regression) - Nice solutions for our estimates exist! `$$\hat{\beta}_0 = \bar{y}-\bar{x}\hat{\beta}_1$$` `$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}$$` --- # Find a 'Best' Fitting Line - In SLR, we often use squared error loss (least squares regression) - Nice solutions for our estimates exist! `$$\hat{\beta}_0 = \bar{y}-\bar{x}\hat{\beta}_1$$` `$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}$$` ```python y = bike_data['log_selling_price'] x = bike_data['log_km_driven'] b1hat = sum((x-x.mean())*(y-y.mean()))/sum((x-x.mean())**2) b0hat = y.mean()-x.mean()*b1hat print(round(b0hat, 4), round(b1hat, 4)) ``` ``` ## 14.6356 -0.3911 ``` --- # Find a 'Best' Fitting Line - In SLR, we often use squared error loss (least squares regression) - Nice solutions for our estimates exist! `$$\hat{\beta}_0 = \bar{y}-\bar{x}\hat{\beta}_1$$` `$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}$$` ```python y = bike_data['log_selling_price'] x = bike_data['log_km_driven'] b1hat = sum((x-x.mean())*(y-y.mean()))/sum((x-x.mean())**2) b0hat = y.mean()-x.mean()*b1hat print(round(b0hat, 4), round(b1hat, 4)) ``` ``` ## 14.6356 -0.3911 ``` - These give us the values to use with `\(\hat{y}\)`! --- # Simple Linear Regression Model in Python - Can use [`linear_model` from `sklearn` module](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) to fit the model - Note the requirements on the shape of `X` and the shape of `y` to pass to the `.fit()` method ```python print(bike_data['log_km_driven'].shape) ``` ``` ## (1061,) ``` ```python print(bike_data['log_km_driven'].values.reshape(-1,1).shape) ``` ``` ## (1061, 1) ``` --- # Simple Linear Regression Model in Python - Can use [`linear_model` from `sklearn` module](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) to fit the model ```python from sklearn import linear_model reg = linear_model.LinearRegression() #Create a reg object reg.fit(bike_data['log_km_driven'].values.reshape(-1,1), bike_data['log_selling_price']) ``` ```python print(reg.intercept_, reg.coef_) ``` ``` ## 14.6355682846293 [-0.39108654] ``` --- # Simple Linear Regression Model - Can use the line for prediction with the `.predict()` method! ```python print(reg.intercept_, reg.coef_) ``` ``` ## 14.6355682846293 [-0.39108654] ``` ```python pred1 = reg.predict(np.array([[10], [12], [14]])) pred1 #each of these represents a 'y-hat' for the given value of x ``` ``` ## array([10.72470291, 9.94252984, 9.16035677]) ``` <img src="data:image/png;base64,#23-Fitting_Evaluating_SLR_Models_files/figure-html/unnamed-chunk-14-7.svg" width="300px" style="display: block; margin: auto;" /> --- # Recenter Supervised Learning methods try to relate predictors to a response variable through a model - Lots of common models - Regression models - Tree based methods - Naive Bayes - k Nearest Neighbors - For a set of predictor values, each will produce some prediction we can call `\(\hat{y}\)` --- # Recenter Supervised Learning methods try to relate predictors to a response variable through a model - Lots of common models - Regression models - Tree based methods - Naive Bayes - k Nearest Neighbors - For a set of predictor values, each will produce some prediction we can call `\(\hat{y}\)` Goal: Understand what it means to be a good predictive model. **How do we evaluate the model?** --- # Quantifying How Well the Model Predicts We use a **loss** function to fit the model. We use a **metric** to evaluate the model! - Often use the same loss function for fitting and as the metric - For a given **numeric** response value, `\(y_i\)` and prediction, `\(\hat{y}_i\)` `$$(y_i-\hat{y}_i)^2, |y_i-\hat{y}_i|$$` - Incorporate all points via `$$\frac{1}{n}\sum_{i=1}^{n} (y_i-\hat{y}_i)^2, \frac{1}{n}\sum_{i=1}^{n} |y_i-\hat{y}_i|$$` --- # Metric Function - For a numeric response, we commonly use squared error loss as our metric to evaluate a prediction `$$L(y_i,\hat{y}_i) = (y_i-\hat{y}_i)^2$$` - Use Root Mean Square Error as a **metric** across all observations `$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n} L(y_i, \hat{y}_i)} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}$$` --- # Commonly Used Metrics For prediction (numeric response) - Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) - Mean Absolute Error (MAE or MAD - deviation) `$$L(y_i,\hat{y}_i) = |y_i-\hat{y}_i|$$` - [Huber Loss](https://en.wikipedia.org/wiki/Huber_loss) + Doesn't penalize large mistakes as much as MSE --- # Commonly Used Metrics For prediction (numeric response) - Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) - Mean Absolute Error (MAE or MAD - deviation) `$$L(y_i,\hat{y}_i) = |y_i-\hat{y}_i|$$` - [Huber Loss](https://en.wikipedia.org/wiki/Huber_loss) + Doesn't penalize large mistakes as much as MSE For classification (categorical response) - Accuracy - log-loss - AUC - F1 Score --- # Evaluating our SLR Model - We could find our metric for our SLR model using the training data... - Import our [MSE metric](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) from `sklearn.metrics` ```python import sklearn.metrics as metrics pred = reg.predict(bike_data["log_km_driven"].values.reshape(-1,1)) print(np.sqrt(metrics.mean_squared_error(bike_data["log_selling_price"], pred))) ``` ``` ## 0.5947022682215317 ``` ```python print(metrics.mean_absolute_error(bike_data["log_selling_price"], pred)) ``` ``` ## 0.46886132002881753 ``` --- # Useful for Comparison! - Fit a competing model with `year` as the predictor ```python reg1 = linear_model.LinearRegression() #Create a reg object reg1.fit(bike_data['year'].values.reshape(-1,1), bike_data['log_selling_price']) ``` ```{=html} <style>#sk-container-id-2 {color: black;background-color: white;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: "▸";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: "▾";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: "";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id="sk-container-id-2" class="sk-top-container"><div class="sk-text-repr-fallback"><pre>LinearRegression()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-2" type="checkbox" checked><label for="sk-estimator-id-2" class="sk-toggleable__label sk-toggleable__label-arrow">LinearRegression</label><div class="sk-toggleable__content"><pre>LinearRegression()</pre></div></div></div></div></div> ``` ```python print(reg1.intercept_, reg1.coef_) ``` ``` ## -201.06317651252067 [0.10516552] ``` - Compare the performance on the training data... ```python pred1 = reg1.predict(bike_data["year"].values.reshape(-1,1)) print(np.sqrt(metrics.mean_squared_error(bike_data["log_selling_price"], pred)), np.sqrt(metrics.mean_squared_error(bike_data["log_selling_price"], pred1))) ``` ``` ## 0.5947022682215317 0.548275146287923 ``` --- # Training vs Test Sets Ideally we want our model to predict well for observations **it has yet to see**! - For multiple linear regression models, our training MSE will always decrease as we add more variables to the model... - We'll need an independent **test** set to predict on (more on this shortly!) --- # Recap - SLR is one type of model for a continuous type response - SLR Model is fit using some criteria (usually least squares, squared error loss) - Must determine a method to judge the model's effectiveness (a metric) + Metric function measures *loss* for each prediction + Combined overall all observations - To obtain a better understanding of the predictive power of a model, we need to apply our metric to prediction made on a different set of data than that used for training!