Regularized Regression

---

# Regularized Regression
Justin Post

---
layout: true

---

# Regularization Methods

- Recall the LASSO model (like least squares but a penalty term added)
    + `$\alpha$` (>0) is called a tuning parameter
  
`$$\min\limits_{\beta's}\sum_{i=1}^{n}(y_i-(\beta_0+\beta_1x_{1i}+...+\beta_px_{pi}))^2 + \alpha\sum_{j=1}^{p}|\beta_j|$$`
- Sets coefficients to 0 as you 'shrink'!

---

# Tuning Parameter

- When choosing the tuning parameter, we are really considering a **family of models**!

- Let's recall an example we did

```python
import pandas as pd
import numpy as np
from sklearn import linear_model
from math import sqrt
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LassoCV, Lasso
fat_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/fat.csv")
fat_data.columns
```

```
## Index(['Unnamed: 0', 'brozek', 'siri', 'density', 'age', 'weight', 'height',
##        'adipos', 'free', 'neck', 'chest', 'abdom', 'hip', 'thigh', 'knee',
##        'ankle', 'biceps', 'forearm', 'wrist'],
##       dtype='object')
```

---

# Cleaning and Splitting the Data

- Drop some variables we don't want
- Remove any rows with missing values

```python
mod_fat_data = fat_data.drop(["Unnamed: 0", "siri", "density"], axis = 1).dropna()

X_train, X_test, y_train, y_test = train_test_split(
  mod_fat_data.drop("brozek", axis = 1),
  mod_fat_data["brozek"], 
  test_size=0.20, 
  random_state=41)
```

---

# Scale Data with Regularization

- Usually want to scale the data if using regularization methods
    + Subtract mean, divide by sd
    + Use the training means and sds for test set too!

```python
means = X_train.apply(np.mean, axis = 0)
stds = X_train.apply(np.std, axis = 0)
X_train = X_train.apply(lambda x: (x-np.mean(x))/np.std(x), axis = 0)
X_train.head()
```

```
##           age    weight    height  ...    biceps   forearm     wrist
## 120  0.540354  1.015051  1.153840  ...  0.610561  1.235346  1.389566
## 133  0.384191 -0.767741 -0.849386  ...  0.505550  0.307013 -0.955724
## 207  0.149947  0.600867  0.636879  ...  1.590663  1.430785  0.216921
## 49   0.149947 -1.830214 -0.849386  ... -1.874697 -1.403073 -1.488745
## 25  -1.411679 -0.686705  0.378398  ... -0.789584 -0.230442 -0.529308
## 
## [5 rows x 15 columns]
```

---

# Scale Data with Regularization

- Usually want to scale the data if using regularization methods
    + Subtract mean, divide by sd
    + Use the training means and sds for test set too!

```python
#quick function to standardize based off of a supplied mean and std
def my_std_fun(x, means, stds):
    return(x-means)/stds
#loop through the columns and use the function on each
for x in X_test.columns:
    X_test[x] = my_std_fun(X_test[x], means[x], stds[x])
X_test.head()
```

```
##           age    weight    height  ...    biceps   forearm     wrist
## 107  0.540354  0.897999  1.089220  ...  1.030604  0.453592  0.963150
## 143 -1.724004 -0.668697  0.572259  ... -0.579563 -0.719039  0.003713
## 167 -0.787028  1.672344  0.572259  ...  1.765681  2.163679  1.709378
## 29  -1.255516 -0.632681 -0.267804  ... -0.719577 -0.963337 -0.635912
## 30  -1.021272  0.132659  0.959980  ...  0.120510 -0.474741  0.216921
## 
## [5 rows x 15 columns]
```

---

# Fit a LASSO Model Using CV

```python
lasso_mod = LassoCV(cv=5, random_state=0) \
                  .fit(X_train, y_train)
print(lasso_mod.alpha_)
```

```
## 0.0682784472098843
```

```python
print(np.array(list(zip(X_train.columns, lasso_mod.coef_))))
```

```
## [['age' '0.0352062230265082']
##  ['weight' '8.67668517157885']
##  ['height' '0.18524483241596934']
##  ['adipos' '0.0']
##  ['free' '-8.098358879411053']
##  ['neck' '0.0']
##  ['chest' '0.10123124957260958']
##  ['abdom' '1.5227501560784786']
##  ['hip' '0.0']
##  ['thigh' '0.5368436925906938']
##  ['knee' '0.2410290965097115']
##  ['ankle' '0.09972823212694173']
##  ['biceps' '0.12439075979914412']
##  ['forearm' '0.20197553831501175']
##  ['wrist' '0.0']]
```

---

# LASSO Fits Visual

```
## (-0.09950000000000003, 2.3095000000000003, -9.605158072913387, 10.14868287610137)
```

---

# Fit 'Best' Model by CV on All Training Data

```python
lasso_best = Lasso(lasso_mod.alpha_).fit(X_train,y_train)
```

- Predict on the test set (using the standardized test predictors!)

```python
lasso_pred = lasso_best.predict(X_test)
#could compare this to other 'best' models
np.sqrt(mean_squared_error(y_test, lasso_pred))
```

```
## 1.9916053642246037
```

---