layout: false class: title-slide-section-red, middle # Multiple Linear Regression Justin Post --- layout: true <div class="my-footer"><img src="data:image/png;base64,#img/logo.png" style="height: 60px;"/></div> --- # Recap Given a model, we **fit** the model using data - Must determine how well the model predicts on **new** data - Create a test set - Judge effectiveness using a **metric** on predictions made from the model --- # Regression Modeling Ideas For a set of observations `\(y_1,...,y_n\)`, we may want to predict a future value - Often use the sample mean to do so, `\(\bar{y}\)` (an estimate of `\(E(Y)\)`) --- # Regression Modeling Ideas For a set of observations `\(y_1,...,y_n\)`, we may want to predict a future value - Often use the sample mean to do so, `\(\bar{y}\)` (an estimate of `\(E(Y)\)`) Now consider having pairs `\((x_1,y_1), (x_2,y_2),...(x_n,y_n)\)` <img src="data:image/png;base64,#26-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-2-1.svg" width="370px" style="display: block; margin: auto;" /> --- # Regression Modeling Ideas Often use a linear (in the parameters) model for prediction `$$\mbox{SLR model: }E(Y|x) = \beta_0+\beta_1x$$` <img src="data:image/png;base64,#26-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-3-1.svg" width="370px" style="display: block; margin: auto;" /> --- # Regression Modeling Ideas Can include more terms on the right hand side (RHS) `$$\mbox{Multiple Linear Regression Model: }E(Y|x) = \beta_0+\beta_1x+\beta_2x^2$$` <img src="data:image/png;base64,#26-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-4-1.svg" width="370px" style="display: block; margin: auto;" /> --- # Regression Modeling Ideas Can include more terms on the right hand side (RHS) `$$\mbox{Multiple Linear Regression Model: }E(Y|x) = \beta_0+\beta_1x+\beta_2x^2+\beta_3x^3$$` <img src="data:image/png;base64,#26-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-5-1.svg" width="370px" style="display: block; margin: auto;" /> --- # Regression Modeling Ideas - We model the mean response for a given `\(x\)` value - With multiple predictors or `\(x\)`'s, we do the same idea! <img src="data:image/png;base64,#img/true_mlr.png" width="500px" style="display: block; margin: auto;" /> --- # Regression Modeling Ideas - Including a **main effect** for two predictors fits the best plane through the data `$$\mbox{Multiple Linear Regression Model: } E(Y|x_1,x_2) = \beta_0+\beta_1x_1+\beta_2x_2$$` <img src="data:image/png;base64,#img/best_mlr_plane.png" width="400px" style="display: block; margin: auto;" /> --- # Regression Modeling Ideas - Including **main effects** and an **interaction effect** allows for a more flexible surface `$$\mbox{Multiple Linear Regression Model: } E(Y|x_1,x_2) = \beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1x_2$$` <img src="data:image/png;base64,#img/best_mlr_saddle.png" width="500px" style="display: block; margin: auto;" /> --- # Regression Modeling Ideas - Including **main effects** and an **interaction effect** allows for a more flexible surface - Interaction effects allow for the **effect** of one variable to depend on the value of another - Model fit previously gives + `\(\hat{y}\)` = (19.005) + (-0.791)x1 + (5.631)x2 + (-12.918)x1x2 --- # Regression Modeling Ideas - Including **main effects** and an **interaction effect** allows for a more flexible surface - Interaction effects allow for the **effect** of one variable to depend on the value of another - Model fit previously gives + `\(\hat{y}\)` = (19.005) + (-0.791)x1 + (5.631)x2 + (-12.918)x1x2 + For `\(x_1\)` = 0, the slope on `\(x_2\)` is (5.631)`+0*` (-12.918) = 5.631 --- # Regression Modeling Ideas - Including **main effects** and an **interaction effect** allows for a more flexible surface - Interaction effects allow for the **effect** of one variable to depend on the value of another - Model fit previously gives + `\(\hat{y}\)` = (19.005) + (-0.791)x1 + (5.631)x2 + (-12.918)x1x2 + For `\(x_1\)` = 0, the slope on `\(x_2\)` is (5.631)`+0*` (-12.918) = 5.631 + For `\(x_1\)` = 0.5, the slope on `\(x_2\)` is (5.631)`+0.5*`(-12.918) = -0.828 --- # Regression Modeling Ideas - Including **main effects** and an **interaction effect** allows for a more flexible surface - Interaction effects allow for the **effect** of one variable to depend on the value of another - Model fit previously gives + `\(\hat{y}\)` = (19.005) + (-0.791)x1 + (5.631)x2 + (-12.918)x1x2 + For `\(x_1\)` = 0, the slope on `\(x_2\)` is (5.631)`+0*` (-12.918) = 5.631 + For `\(x_1\)` = 0.5, the slope on `\(x_2\)` is (5.631)`+0.5*`(-12.918) = -0.828 + For `\(x_1\)` = 1, the slope on `\(x_2\)` is (5.631)`+1*`(-12.918) = -7.286 - Similarly, the slope on `\(x_1\)` depends on `\(x_2\)`! --- # Regression Modeling Ideas - Including **main effects** and an **interaction effect** allows for a more flexible surface - Can also include higher order polynomial terms `$$\mbox{Multiple Linear Regression Model: } E(Y|x_1,x_2) = \beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1x_2+\beta_4x_1^2$$` <img src="data:image/png;base64,#img/best_mlr_comp.png" width="500px" style="display: block; margin: auto;" /> --- # Regression Modeling Ideas Can also include categorical variables through **dummy** or **indicator** variables - Categorical variable with value of `\(Success\)` and `\(Failure\)` - Define `\(x_2 = 0\)` if variable is `\(Failure\)` - Define `\(x_2 = 1\)` if variable is `\(Success\)` --- layout: false # Regression Modeling Ideas Can also include categorical variables through **dummy** or **indicator** variables - Categorical variable with value of `\(Success\)` and `\(Failure\)` - Define `\(x_2 = 0\)` if variable is `\(Failure\)` - Define `\(x_2 = 1\)` if variable is `\(Success\)` <img src="data:image/png;base64,#26-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-14-1.svg" width="370px" style="display: block; margin: auto auto auto 0;" /> --- # Regression Modeling Ideas - Define `\(x_2 = 0\)` if variable is `\(Failure\)` - Define `\(x_2 = 1\)` if variable is `\(Success\)` `$$\mbox{Separate Intercept Model: }E(Y|x) = \beta_0+\beta_1x_1 + \beta_2x_2$$` <img src="data:image/png;base64,#26-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-15-1.svg" width="370px" style="display: block; margin: auto auto auto 0;" /> --- # Regression Modeling Ideas - Define `\(x_2 = 0\)` if variable is `\(Failure\)` - Define `\(x_2 = 1\)` if variable is `\(Success\)` `$$\mbox{Separate Intercept and Slopes Model: }E(Y|x) = \beta_0+\beta_1x_1 + \beta_2x_2+\beta_3x_1x_2$$` <img src="data:image/png;base64,#26-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-16-1.svg" width="370px" style="display: block; margin: auto auto auto 0;" /> --- # Regression Modeling Ideas - Define `\(x_2 = 0\)` if variable is `\(Failure\)` - Define `\(x_2 = 1\)` if variable is `\(Success\)` `$$\mbox{Separate Quadratics Model: }E(Y|x) = \beta_0+ \beta_1x_2+\beta_2x_1+\beta_3x_1x_2+ \beta_4x_1^2 +\beta_5x_1^2x_2$$` <img src="data:image/png;base64,#26-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-17-1.svg" width="370px" style="display: block; margin: auto auto auto 0;" /> --- # Regression Modeling Ideas If your categorical variable has more than k>2 categories, define k-1 dummy variables - Categorical variable with values of "Assistant", "Contractor", "Executive" - Define `\(x_2 = 0\)` if variable is `\(Executive\)` or `\(Contractor\)` - Define `\(x_2 = 1\)` if variable is `\(Assistant\)` - Define `\(x_3 = 0\)` if variable is `\(Contractor\)` or `\(Assistant\)` - Define `\(x_3 = 1\)` if variable is `\(Executive\)` --- # Regression Modeling Ideas If your categorical variable has more than k>2 categories, define k-1 dummy variables - Categorical variable with values of "Assistant", "Contractor", "Executive" - Define `\(x_2 = 0\)` if variable is `\(Executive\)` or `\(Contractor\)` - Define `\(x_2 = 1\)` if variable is `\(Assistant\)` - Define `\(x_3 = 0\)` if variable is `\(Contractor\)` or `\(Assistant\)` - Define `\(x_3 = 1\)` if variable is `\(Executive\)` `$$\mbox{Separate Intercepts Model: }E(Y|x) = \beta_0+ \beta_1x_1+\beta_2x_2+\beta_3x_3$$` What is implied if `\(x_2\)` and `\(x_3\)` are both zero? --- # Fitting an MLR Model Big Idea: Trying to find the line, plane, saddle, etc. **of best fit** through points - How do we do the fit?? + Usually minimize the sum of squared residuals (errors) --- # Fitting an MLR Model Big Idea: Trying to find the line, plane, saddle, etc. **of best fit** through points - How do we do the fit?? + Usually minimize the sum of squared residuals (errors) - Residual = observed - predicted or `\(y_i-\hat{y}_i\)` `$$\min\limits_{\hat{\beta}'s}\sum_{i=1}^{n}(y_i-(\hat\beta_0+\hat\beta_1x_{1i}+...+\hat\beta_px_{pi}))^2$$` --- # Fitting an MLR Model Big Idea: Trying to find the line, plane, saddle, etc. **of best fit** through points - How do we do the fit?? + Usually minimize the sum of squared residuals (errors) - Residual = observed - predicted or `\(y_i-\hat{y}_i\)` `$$\min\limits_{\hat{\beta}'s}\sum_{i=1}^{n}(y_i-(\hat\beta_0+\hat\beta_1x_{1i}+...+\hat\beta_px_{pi}))^2$$` - Closed-form results exist for easy calculation via software! --- # Fitting a Linear Regression Model in Python - Use `sklearn` package - Create a `LinearRegression()` instance - Use the `.fit()` method to fit the model --- # Fitting a Linear Regression Model in Python - Use `sklearn` package - Create a `LinearRegression()` instance - Use the `.fit()` method to fit the model ```python import pandas as pd import numpy as np bike_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/bikeDetails.csv") #create response and new predictor bike_data['log_selling_price'] = np.log(bike_data['selling_price']) bike_data['log_km_driven'] = np.log(bike_data['km_driven']) ``` --- # Fitting a Linear Regression Model in Python - Use `sklearn` package - Create a `LinearRegression()` instance - Use the `.fit()` method to fit the model ```python import pandas as pd import numpy as np bike_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/bikeDetails.csv") #create response and new predictor bike_data['log_selling_price'] = np.log(bike_data['selling_price']) bike_data['log_km_driven'] = np.log(bike_data['km_driven']) ``` ```python from sklearn import linear_model slr_fit = linear_model.LinearRegression() #Create a reg object slr_fit.fit(bike_data['log_km_driven'].values.reshape(-1,1), bike_data['log_selling_price'].values) ``` ```python print(slr_fit.intercept_, slr_fit.coef_) ``` ``` ## 14.6355682846293 [-0.39108654] ``` --- # Fitting a Linear Regression Model in Python ```python import matplotlib.pyplot as plt preds = slr_fit.predict(bike_data['log_km_driven'].values.reshape(-1,1)) plt.scatter(bike_data['log_km_driven'].values.reshape(-1,1), bike_data['log_selling_price'].values) plt.plot(bike_data['log_km_driven'].values.reshape(-1,1), preds, 'red') plt.title("SLR Fit") plt.xlabel('log_km_driven') plt.ylabel('log_selling_price') ``` <img src="data:image/png;base64,#26-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-23-1.svg" width="340px" style="display: block; margin: auto;" /> --- # Add Dummy Variables for a Categorical Predictor - `get_dummies()` function from `pandas` is useful ```python pd.get_dummies(data = bike_data['owner']) ``` ``` ## 1st owner 2nd owner 3rd owner 4th owner ## 0 1 0 0 0 ## 1 1 0 0 0 ## 2 1 0 0 0 ## 3 1 0 0 0 ## 4 0 1 0 0 ## ... ... ... ... ... ## 1056 1 0 0 0 ## 1057 1 0 0 0 ## 1058 0 1 0 0 ## 1059 1 0 0 0 ## 1060 1 0 0 0 ## ## [1061 rows x 4 columns] ``` --- # Add Dummy Variables for a Categorical Predictor - `get_dummies()` function from `pandas` is useful ```python pd.get_dummies(data = bike_data['owner']) ``` ``` ## 1st owner 2nd owner 3rd owner 4th owner ## 0 1 0 0 0 ## 1 1 0 0 0 ## 2 1 0 0 0 ## 3 1 0 0 0 ## 4 0 1 0 0 ## ... ... ... ... ... ## 1056 1 0 0 0 ## 1057 1 0 0 0 ## 1058 0 1 0 0 ## 1059 1 0 0 0 ## 1060 1 0 0 0 ## ## [1061 rows x 4 columns] ``` - If we use just the first variable created, we'll have a 1 owner vs more than 1 owner binary variable --- # Fit a Model with Dummy Variables - Add the binary variable and fit the model ```python bike_data['one_owner'] = pd.get_dummies(data = bike_data['owner'])['1st owner'] mlr_fit = linear_model.LinearRegression() #Create a reg object mlr_fit.fit(bike_data[['log_km_driven','one_owner']], bike_data['log_selling_price'].values) ``` ```python print(mlr_fit.intercept_, mlr_fit.coef_) ``` ``` ## 14.570541645783868 [-0.38893985 0.05002779] ``` --- # Fit a Model with Dummy Variables Plot these fits on the same graph ```python preds1 = mlr_fit.predict(bike_data.loc[bike_data['one_owner'] == 0, ['log_km_driven', 'one_owner']]) preds2 = mlr_fit.predict(bike_data.loc[bike_data['one_owner'] == 1, ['log_km_driven', 'one_owner']]) plt.scatter(bike_data['log_km_driven'].values.reshape(-1,1), bike_data['log_selling_price'].values, c = bike_data['one_owner'].values) plt.plot(bike_data.loc[bike_data['one_owner'] == 0, ['log_km_driven']], preds1, c = 'orange', label = "One owner") plt.plot(bike_data.loc[bike_data['one_owner'] == 1, ['log_km_driven']], preds2, c = 'purple', label = "Multiple owner") plt.title("Different Intercept SLR Fit") plt.xlabel('log_km_driven') plt.ylabel('log_selling_price') plt.legend() ``` --- # Fit a Model with Dummy Variables Plot these fits on the same graph <img src="data:image/png;base64,#26-Multiple_Linear_Regression_files/figure-html/unnamed-chunk-30-3.svg" width="450" style="display: block; margin: auto;" /> --- # Choosing an MLR Model - Given a bunch of predictors, tons of models you could fit! How to choose? - Many variable selection methods exist... - If you care mainly about prediction, just use *cross-validation* or training/test split! --- # Compare Multiple Models - We've seen how to split our data into a training and test set ```python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( bike_data[["year", "log_km_driven", "one_owner"]], bike_data["log_selling_price"], test_size=0.20, random_state=42) ``` --- # Compare Multiple Models - We've seen how to split our data into a training and test set ```python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( bike_data[["year", "log_km_driven", "one_owner"]], bike_data["log_selling_price"], test_size=0.20, random_state=42) ``` - Fit competing models on the training set ```python mlr_fit = linear_model.LinearRegression().fit(X_train[['log_km_driven','one_owner']], y_train) slr_fit = linear_model.LinearRegression().fit(X_train['log_km_driven'].values.reshape(-1,1), y_train) cat_fit = linear_model.LinearRegression().fit(X_train['one_owner'].values.reshape(-1,1), y_train) ``` --- # Compare Multiple Models - Look at training RMSE for comparison ```python from sklearn.metrics import mean_squared_error np.sqrt(mean_squared_error(y_train, mlr_fit.predict(X_train[["log_km_driven", "one_owner"]]))) ``` ``` ## 0.5944388369922229 ``` ```python np.sqrt(mean_squared_error(y_train, slr_fit.predict(X_train["log_km_driven"].values.reshape(-1,1)))) ``` ``` ## 0.594953681655801 ``` ```python np.sqrt(mean_squared_error(y_train, cat_fit.predict(X_train["one_owner"].values.reshape(-1,1)))) ``` ``` ## 0.7014857593074377 ``` --- # Compare Multiple Models - What we care about is test set RMSE though! ```python np.sqrt(mean_squared_error(y_test, mlr_fit.predict(X_test[["log_km_driven", "one_owner"]]))) ``` ``` ## 0.5954962522276137 ``` ```python np.sqrt(mean_squared_error(y_test, slr_fit.predict(X_test["log_km_driven"].values.reshape(-1,1)))) ``` ``` ## 0.594318016111905 ``` ```python np.sqrt(mean_squared_error(y_test, cat_fit.predict(X_test["one_owner"].values.reshape(-1,1)))) ``` ``` ## 0.7319099074576736 ``` --- # Recap - Multiple Linear Regression models are a common model used for a numeric response - Generally fit via minimizing the sum of squared residuals or errors + Could fit using sum of absolute deviation, or other metric - Can include polynomial terms, interaction terms, and categorical variables through dummy (indicator) variables - Good metric to compare models on is the RMSE on a test set