September 26, 2014

\(\DeclareMathOperator*{\argmin}{arg\,min}\) \(\newcommand{\bs}[1]{\boldsymbol{#1}}\)

Review

Linear Model, OLS, and Regularization

Notation

\(\bs{y}=\bs{X\beta}+\bs{\epsilon}\)

where

\(\bs{y}_{n\times 1} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}, \quad\) \(\bs{X}_{n\times p} = \begin{bmatrix} \bs{x_1}^T \\ \vdots \\ \bs{x_n}^T \end{bmatrix}\) = \(\begin{bmatrix} x_{11} & x_{12} & \ldots & x_{1p} \\ \vdots & & \ddots \\ x_{n1} & x_{n2} & \ldots & x_{np}\end{bmatrix}\),

\(\bs{\beta}_{p\times 1}=\begin{pmatrix} \beta_1 \\ \vdots \\ \beta_p\end{pmatrix}, \quad\) and \(\quad \bs{\epsilon}_{n\times 1} \sim \left[\bs{0}, \sigma^2 \bs{I}_n\right]\)

OLS

We want to minimize the L2 loss function (squared error loss): \[ \begin{aligned} RSS(\bs{\beta}) &= (\bs{y}-\bs{X\beta})^T(\bs{y}-\bs{X\beta}) \\ &= \sum_{i=1}^n (y_i-\bs{x_i}^T\bs{\beta})^2 \end{aligned} \] Solution: \[ \begin{aligned} \hat{\bs{\beta}}_{OLS} &= \argmin_{\bs{\beta}} RSS(\bs{\beta}) \\ &= \bs{(X^TX)^{-1}X^Ty} \end{aligned} \]

Regularization

\[ \begin{aligned} \hat{\bs\beta}(\lambda) &= \argmin_{\bs\beta} \left[ RSS(\bs{\beta}) + \lambda J(\bs{\beta}) \right] \\ &= \argmin_{\bs{\beta}} RSS(\bs{\beta}) \text{ (subject to } J(\bs{\beta}) \le t) \end{aligned} \]

  • \(\lambda \ge 0\) or \(t \ge 0\) is the tuning parameter
  • Standardize variables first!
  • \(\beta_0\) is typically left out of the penalty
    • After standardization, set \(\hat{\beta}_0=\bar{y}\) and solve for remaining coefficients without intercept

Assessing Prediction Accuracy

Given a new input, \(\pmb{x}_0\), we assess our prediction \(\hat{f}(\pmb{x}_0)\) by:

  • Expected Prediction Error (EPE)
    • \[ \begin{aligned} EPE(\pmb{x}_0) &= E[(y_0 - \hat{f}(\pmb{x}_0))^2] \\ &= \text{Var}(\epsilon) + \text{Var}(\hat{f}(\pmb{x}_0)) + \text{Bias}(\hat{f}(\pmb{x}_0))^2 \\ &= \text{Var}(\epsilon) + MSE(\hat{f}(\pmb{x}_0)) \end{aligned} \]
    • \(\text{Var}(\epsilon)\): irreducible error variance
    • \(\text{Var}(\hat{f}(\pmb{x}_0))\): sample-to-sample variability of \(\hat{f}(\pmb{x}_0)\)
    • \(\text{Bias}(\hat{f}(\pmb{x}_0))\): average difference of \(\hat{f}(\pmb{x}_0)\) & \(f(\pmb{x}_0)\)

Bias/Variance Trade-off


Source: Hastie, Tibshirani, & Friedman (2009)

Estimating Prediction Error

  • Randomly split data into training and test sets
    • Test set has \(m\) observations
  1. Estimate \(\hat{f}(\bs{x}) = \bs{x^T\hat{\beta}}\) using training data
  2. Estimate MSE using test data \[ \widehat{MSE}(\hat{f}) = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{f}(x_i))^2 \]

Common Penalties \(J(\cdot)\)

Penalties

  • Ridge: \(\quad J_R(\bs{\beta}) = \left|\left|\bs{\beta}\right|\right|_2^2 = \bs{\beta}^T\bs{\beta} = \sum_{j=1}^p \beta_j^2\)

  • LASSO: \(\quad J_L(\bs{\beta}) = \left|\left|\bs{\beta}\right|\right|_1 = \sum_{j=1}^p \left| \beta_j \right|\) plot of chunk unnamed-chunk-2

Ridge vs. Lasso


Source: Hastie, Tibshirani, & Friedman (2009)

Penalties

  • \(\bs L_q\): \(\quad J(\bs{\beta}) = \sum_{j=1}^p \left|\beta_j\right|^q\)


Source: Hastie, Tibshirani, & Friedman (2009)

Penalties

  • Elastic Net (Zou and Hastie, 2005): \[ J_{EN}(\bs{\beta}) = \alpha J_{L}(\bs{\beta}) + (1-\alpha)J_{R}(\bs{\beta}) = \sum_{i=1}^p \left\{\alpha\left|\beta_j\right| + (1-\alpha)\beta_j^2\right\} \]

plot of chunk unnamed-chunk-3

Penalty Overview

  • Ridge: L2 penalty
  • LASSO: L1 penalty
  • Elastic Net: Weighted average of L1 and L2 penalties

Ridge

Minimize \[RSS(\bs{\beta}, \lambda)=(\bs{y-X\beta})^T(\bs{y-X\beta}) + \lambda\bs{\beta^T\beta}\]

Solution: \[\hat{\bs\beta}_{RIDGE} = (\bs{X^TX}+\lambda\bs{I})^{-1}\bs{X}^T\bs{y}\]

Ridge

Pros:

  • \((\bs{X^TX}+\lambda\bs{I})\) is nonsingular
  • Can be used on \(p > n\) problems
  • Attenuates strange coefficients due to multicollinearity
  • Closed form estimate of variance
    • \(Var(\bs{\hat{\beta}}) = \sigma^2 \bs{WX^TXW}\) where \(\bs{W}=(\bs{X^TX}-\lambda\bs{I})^{-1}\)
  • There always exists \(\lambda\) such that \(MSE(\bs{\hat{\beta}_{Ridge}}) < MSE(\bs{\hat{\beta}_{OLS}})\)
  • Empirically does great at prediction

Ridge

Cons:

  • No variable selection (nothing set to zero)
  • Low interpretability
  • It doesn’t do as well when all of the true coefficients are moderately large

LASSO

Minimize \[RSS(\bs{\beta})=(\bs{y-X\beta})^T(\bs{y-X\beta})\] subject to \[\sum_{i=1}^p \left| \beta_i \right| \le t\]

Solution:

  • No closed form solution as with Ridge Regression
  • Efficient fitting algorithms exist that do not increase computational complexity compared to ridge.

LASSO

Pros:

  • Does variable selection
    • Easier to interpret than Ridge
  • Can be used on \(p > n\) problems
  • Performs well when true model is sparse (most coefficients are zero)

LASSO

Cons:

  • For \(p>n\), LASSO can select at most only \(n\) predictors before model is saturated
  • For a correlated group of predictors, LASSO often selects only one and sets the rest to zero

Elastic Net

Minimize \[RSS(\bs{\beta})=(\bs{y-X\beta})^T(\bs{y-X\beta})\] subject to \[\sum_{i=1}^p \left\{(1-\alpha)\beta_i^2 + \alpha\left| \beta_i \right|\right\}\le t\]

Solution:

  • No closed form
  • Does not increase computational complexity

Elastic Net

Pros:

  • Performs well where LASSO does well
  • Alleviates issues where LASSO performs poorly
    • Can select groups of correlated variables
    • When \(p > n\), can select more than \(n\) predictors
  • Does shrinkage and variable selection

Elastic Net

Cons:

  • Empirically only does well when "close" to Ridge or LASSO

Examples

Example 1

True model: \(\bs{y}=\bs{X\beta} + \bs{\epsilon}\)

where:

  • \(\bs{\beta}=(\underbrace{1,...,1}_{15}, \underbrace{0,...,0}_{4085})^T\)
  • \(p=5000 > n=1000\)
  • Uncorrelated predictors:
    • \(\bs{X}_i \overset{\text{iid}}{\sim} N(\bs{0}, \bs I)\)
  • \(\bs{\epsilon} \overset{\text{iid}}{\sim} N(\bs{0},\bs I)\)

Example 1

Solution Path | Training data MSE (10-fold CV) plot of chunk unnamed-chunk-4

Example 1 - MSE on test set

\(\alpha\) MSE
\(\alpha=0\) (Ridge) 16.2645
\(\alpha=0.2\) 1.7551
\(\alpha=0.4\) 1.4761
\(\alpha=0.6\) 1.4428
\(\alpha=0.8\) 1.3376
\(\alpha=1\) (LASSO) 1.3276

Example 2

True model: \(\bs{y}=\bs{X\beta} + \bs{\epsilon}\)

where:

  • \(\bs{\beta}=(\underbrace{1,...,1}_{1500}, \underbrace{0,...,0}_{3500})^T\)
  • \(p=5000 > n=1000\)
  • Uncorrelated predictors:
    • \(\bs{X}_i \overset{\text{iid}}{\sim} N(\bs{0}, \bs I)\)
  • \(\bs{\epsilon} \overset{\text{iid}}{\sim} N(\bs{0},\bs I)\)

Example 2

Solution Path | Training data MSE (10-fold CV) plot of chunk unnamed-chunk-6

Example 2 - MSE on test set

\(\alpha\) MSE
\(\alpha=0\) (Ridge) 1490.7833
\(\alpha=0.2\) 1637.0238
\(\alpha=0.4\) 1641.1414
\(\alpha=0.6\) 1633.5475
\(\alpha=0.8\) 1641.1414
\(\alpha=1\) (LASSO) 1641.1414

Example 3

True model: \(y =\bs{X\beta} + \epsilon\)

where

  • \(\bs{\beta}=(10,10,5,5,\underbrace{1,...,1}_{10},\underbrace{0,...,0}_{36})^T\)
  • \(p=50\)
  • \(n=100\)
  • Correlated predictors: \(Cov(\bs{X})_{ij} = (0.7)^{|i-j|}\)

Example 3

Solution Path | Training data MSE (10-fold CV) plot of chunk unnamed-chunk-8

Example 3 - MSE on test set

\(\alpha\) MSE
\(\alpha=0\) (Ridge) 6.7541
\(\alpha=0.2\) 1.2461
\(\alpha=0.4\) 1.4948
\(\alpha=0.6\) 1.4219
\(\alpha=0.8\) 1.4985
\(\alpha=1\) (LASSO) 1.7152

Summary

  • We can do \(p>n\) problems!
  • No penalty is universally superior
  • Ridge and LASSO fit into the Elastic Net framework
    • Need cross validation to choose \(\lambda\) and \(\alpha\)
  • Several versions of LASSO exist
    • Grouped LASSO, Adaptive LASSO
    • To be discussed next week?