Comparison of Regularization Penalties

September 26, 2014

\(\DeclareMathOperator*{\argmin}{arg\,min}\) \(\newcommand{\bs}[1]{\boldsymbol{#1}}\)

Review

Linear Model, OLS, and Regularization

Notation

\(\bs{y}=\bs{X\beta}+\bs{\epsilon}\)

where

\(\bs{y}_{n\times 1} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}, \quad\) \(\bs{X}_{n\times p} = \begin{bmatrix} \bs{x_1}^T \\ \vdots \\ \bs{x_n}^T \end{bmatrix}\) = \(\begin{bmatrix} x_{11} & x_{12} & \ldots & x_{1p} \\ \vdots & & \ddots \\ x_{n1} & x_{n2} & \ldots & x_{np}\end{bmatrix}\),

\(\bs{\beta}_{p\times 1}=\begin{pmatrix} \beta_1 \\ \vdots \\ \beta_p\end{pmatrix}, \quad\) and \(\quad \bs{\epsilon}_{n\times 1} \sim \left[\bs{0}, \sigma^2 \bs{I}_n\right]\)

OLS

We want to minimize the L2 loss function (squared error loss): \[ \begin{aligned} RSS(\bs{\beta}) &= (\bs{y}-\bs{X\beta})^T(\bs{y}-\bs{X\beta}) \\ &= \sum_{i=1}^n (y_i-\bs{x_i}^T\bs{\beta})^2 \end{aligned} \] Solution: \[ \begin{aligned} \hat{\bs{\beta}}_{OLS} &= \argmin_{\bs{\beta}} RSS(\bs{\beta}) \\ &= \bs{(X^TX)^{-1}X^Ty} \end{aligned} \]

Regularization

\[ \begin{aligned} \hat{\bs\beta}(\lambda) &= \argmin_{\bs\beta} \left[ RSS(\bs{\beta}) + \lambda J(\bs{\beta}) \right] \\ &= \argmin_{\bs{\beta}} RSS(\bs{\beta}) \text{ (subject to } J(\bs{\beta}) \le t) \end{aligned} \]

\(\lambda \ge 0\) or \(t \ge 0\) is the tuning parameter
Standardize variables first!
\(\beta_0\) is typically left out of the penalty
- After standardization, set \(\hat{\beta}_0=\bar{y}\) and solve for remaining coefficients without intercept

Assessing Prediction Accuracy

Given a new input, \(\pmb{x}_0\), we assess our prediction \(\hat{f}(\pmb{x}_0)\) by:

Expected Prediction Error (EPE)
- \[ \begin{aligned} EPE(\pmb{x}_0) &= E[(y_0 - \hat{f}(\pmb{x}_0))^2] \\ &= \text{Var}(\epsilon) + \text{Var}(\hat{f}(\pmb{x}_0)) + \text{Bias}(\hat{f}(\pmb{x}_0))^2 \\ &= \text{Var}(\epsilon) + MSE(\hat{f}(\pmb{x}_0)) \end{aligned} \]
- \(\text{Var}(\epsilon)\): irreducible error variance
- \(\text{Var}(\hat{f}(\pmb{x}_0))\): sample-to-sample variability of \(\hat{f}(\pmb{x}_0)\)
- \(\text{Bias}(\hat{f}(\pmb{x}_0))\): average difference of \(\hat{f}(\pmb{x}_0)\) & \(f(\pmb{x}_0)\)

Bias/Variance Trade-off

Source: Hastie, Tibshirani, & Friedman (2009)

Estimating Prediction Error

Randomly split data into training and test sets
- Test set has \(m\) observations

Estimate \(\hat{f}(\bs{x}) = \bs{x^T\hat{\beta}}\) using training data
Estimate MSE using test data \[ \widehat{MSE}(\hat{f}) = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{f}(x_i))^2 \]

Common Penalties \(J(\cdot)\)

Penalties

Ridge: \(\quad J_R(\bs{\beta}) = \left|\left|\bs{\beta}\right|\right|_2^2 = \bs{\beta}^T\bs{\beta} = \sum_{j=1}^p \beta_j^2\)
LASSO: \(\quad J_L(\bs{\beta}) = \left|\left|\bs{\beta}\right|\right|_1 = \sum_{j=1}^p \left| \beta_j \right|\)

Ridge vs. Lasso

Source: Hastie, Tibshirani, & Friedman (2009)

Penalties

\(\bs L_q\): \(\quad J(\bs{\beta}) = \sum_{j=1}^p \left|\beta_j\right|^q\)

Source: Hastie, Tibshirani, & Friedman (2009)

Penalties

Elastic Net (Zou and Hastie, 2005): \[ J_{EN}(\bs{\beta}) = \alpha J_{L}(\bs{\beta}) + (1-\alpha)J_{R}(\bs{\beta}) = \sum_{i=1}^p \left\{\alpha\left|\beta_j\right| + (1-\alpha)\beta_j^2\right\} \]

plot of chunk unnamed-chunk-3

Penalty Overview

Ridge: L2 penalty
LASSO: L1 penalty
Elastic Net: Weighted average of L1 and L2 penalties

Ridge

Minimize \[RSS(\bs{\beta}, \lambda)=(\bs{y-X\beta})^T(\bs{y-X\beta}) + \lambda\bs{\beta^T\beta}\]

Solution: \[\hat{\bs\beta}_{RIDGE} = (\bs{X^TX}+\lambda\bs{I})^{-1}\bs{X}^T\bs{y}\]

Ridge

Pros:

\((\bs{X^TX}+\lambda\bs{I})\) is nonsingular
Can be used on \(p > n\) problems
Attenuates strange coefficients due to multicollinearity
Closed form estimate of variance
- \(Var(\bs{\hat{\beta}}) = \sigma^2 \bs{WX^TXW}\) where \(\bs{W}=(\bs{X^TX}-\lambda\bs{I})^{-1}\)
There always exists \(\lambda\) such that \(MSE(\bs{\hat{\beta}_{Ridge}}) < MSE(\bs{\hat{\beta}_{OLS}})\)
Empirically does great at prediction

Ridge

Cons:

No variable selection (nothing set to zero)
Low interpretability
It doesn’t do as well when all of the true coefficients are moderately large

LASSO

Minimize \[RSS(\bs{\beta})=(\bs{y-X\beta})^T(\bs{y-X\beta})\] subject to \[\sum_{i=1}^p \left| \beta_i \right| \le t\]

Solution:

No closed form solution as with Ridge Regression
Efficient fitting algorithms exist that do not increase computational complexity compared to ridge.

LASSO

Pros:

Does variable selection
- Easier to interpret than Ridge
Can be used on \(p > n\) problems
Performs well when true model is sparse (most coefficients are zero)

LASSO

Cons:

For \(p>n\), LASSO can select at most only \(n\) predictors before model is saturated
For a correlated group of predictors, LASSO often selects only one and sets the rest to zero

Elastic Net

Minimize \[RSS(\bs{\beta})=(\bs{y-X\beta})^T(\bs{y-X\beta})\] subject to \[\sum_{i=1}^p \left\{(1-\alpha)\beta_i^2 + \alpha\left| \beta_i \right|\right\}\le t\]

Solution:

No closed form
Does not increase computational complexity

Elastic Net

Pros:

Performs well where LASSO does well
Alleviates issues where LASSO performs poorly
- Can select groups of correlated variables
- When \(p > n\), can select more than \(n\) predictors
Does shrinkage and variable selection

Elastic Net

Cons:

Empirically only does well when "close" to Ridge or LASSO

Examples

Example 1

True model: \(\bs{y}=\bs{X\beta} + \bs{\epsilon}\)

where:

\(\bs{\beta}=(\underbrace{1,...,1}_{15}, \underbrace{0,...,0}_{4085})^T\)
\(p=5000 > n=1000\)
Uncorrelated predictors:
- \(\bs{X}_i \overset{\text{iid}}{\sim} N(\bs{0}, \bs I)\)
\(\bs{\epsilon} \overset{\text{iid}}{\sim} N(\bs{0},\bs I)\)

Example 1

Solution Path | Training data MSE (10-fold CV) plot of chunk unnamed-chunk-4

Example 1 - MSE on test set

\(\alpha\)	MSE
\(\alpha=0\) (Ridge)	16.2645
\(\alpha=0.2\)	1.7551
\(\alpha=0.4\)	1.4761
\(\alpha=0.6\)	1.4428
\(\alpha=0.8\)	1.3376
\(\alpha=1\) (LASSO)	1.3276

Example 2

True model: \(\bs{y}=\bs{X\beta} + \bs{\epsilon}\)

where:

\(\bs{\beta}=(\underbrace{1,...,1}_{1500}, \underbrace{0,...,0}_{3500})^T\)
\(p=5000 > n=1000\)
Uncorrelated predictors:
- \(\bs{X}_i \overset{\text{iid}}{\sim} N(\bs{0}, \bs I)\)
\(\bs{\epsilon} \overset{\text{iid}}{\sim} N(\bs{0},\bs I)\)

Example 2

Solution Path | Training data MSE (10-fold CV) plot of chunk unnamed-chunk-6

Example 2 - MSE on test set

\(\alpha\)	MSE
\(\alpha=0\) (Ridge)	1490.7833
\(\alpha=0.2\)	1637.0238
\(\alpha=0.4\)	1641.1414
\(\alpha=0.6\)	1633.5475
\(\alpha=0.8\)	1641.1414
\(\alpha=1\) (LASSO)	1641.1414

Example 3

True model: \(y =\bs{X\beta} + \epsilon\)

where

\(\bs{\beta}=(10,10,5,5,\underbrace{1,...,1}_{10},\underbrace{0,...,0}_{36})^T\)
\(p=50\)
\(n=100\)
Correlated predictors: \(Cov(\bs{X})_{ij} = (0.7)^{|i-j|}\)

Example 3

Solution Path | Training data MSE (10-fold CV) plot of chunk unnamed-chunk-8

Example 3 - MSE on test set

\(\alpha\)	MSE
\(\alpha=0\) (Ridge)	6.7541
\(\alpha=0.2\)	1.2461
\(\alpha=0.4\)	1.4948
\(\alpha=0.6\)	1.4219
\(\alpha=0.8\)	1.4985
\(\alpha=1\) (LASSO)	1.7152

Summary

We can do \(p>n\) problems!
No penalty is universally superior
Ridge and LASSO fit into the Elastic Net framework
- Need cross validation to choose \(\lambda\) and \(\alpha\)
Several versions of LASSO exist
- Grouped LASSO, Adaptive LASSO
- To be discussed next week?