September 26, 2014
\(\DeclareMathOperator*{\argmin}{arg\,min}\) \(\newcommand{\bs}[1]{\boldsymbol{#1}}\)
\(\bs{y}=\bs{X\beta}+\bs{\epsilon}\)
where
\(\bs{y}_{n\times 1} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}, \quad\) \(\bs{X}_{n\times p} = \begin{bmatrix} \bs{x_1}^T \\ \vdots \\ \bs{x_n}^T \end{bmatrix}\) = \(\begin{bmatrix} x_{11} & x_{12} & \ldots & x_{1p} \\ \vdots & & \ddots \\ x_{n1} & x_{n2} & \ldots & x_{np}\end{bmatrix}\),
\(\bs{\beta}_{p\times 1}=\begin{pmatrix} \beta_1 \\ \vdots \\ \beta_p\end{pmatrix}, \quad\) and \(\quad \bs{\epsilon}_{n\times 1} \sim \left[\bs{0}, \sigma^2 \bs{I}_n\right]\)
We want to minimize the L2 loss function (squared error loss): \[ \begin{aligned} RSS(\bs{\beta}) &= (\bs{y}-\bs{X\beta})^T(\bs{y}-\bs{X\beta}) \\ &= \sum_{i=1}^n (y_i-\bs{x_i}^T\bs{\beta})^2 \end{aligned} \] Solution: \[ \begin{aligned} \hat{\bs{\beta}}_{OLS} &= \argmin_{\bs{\beta}} RSS(\bs{\beta}) \\ &= \bs{(X^TX)^{-1}X^Ty} \end{aligned} \]
\[ \begin{aligned} \hat{\bs\beta}(\lambda) &= \argmin_{\bs\beta} \left[ RSS(\bs{\beta}) + \lambda J(\bs{\beta}) \right] \\ &= \argmin_{\bs{\beta}} RSS(\bs{\beta}) \text{ (subject to } J(\bs{\beta}) \le t) \end{aligned} \]
Given a new input, \(\pmb{x}_0\), we assess our prediction \(\hat{f}(\pmb{x}_0)\) by:
Source: Hastie, Tibshirani, & Friedman (2009)
Ridge: \(\quad J_R(\bs{\beta}) = \left|\left|\bs{\beta}\right|\right|_2^2 = \bs{\beta}^T\bs{\beta} = \sum_{j=1}^p \beta_j^2\)
LASSO: \(\quad J_L(\bs{\beta}) = \left|\left|\bs{\beta}\right|\right|_1 = \sum_{j=1}^p \left| \beta_j \right|\)
Source: Hastie, Tibshirani, & Friedman (2009)
Source: Hastie, Tibshirani, & Friedman (2009)
Minimize \[RSS(\bs{\beta}, \lambda)=(\bs{y-X\beta})^T(\bs{y-X\beta}) + \lambda\bs{\beta^T\beta}\]
Solution: \[\hat{\bs\beta}_{RIDGE} = (\bs{X^TX}+\lambda\bs{I})^{-1}\bs{X}^T\bs{y}\]
Pros:
Cons:
Minimize \[RSS(\bs{\beta})=(\bs{y-X\beta})^T(\bs{y-X\beta})\] subject to \[\sum_{i=1}^p \left| \beta_i \right| \le t\]
Solution:
Pros:
Cons:
Minimize \[RSS(\bs{\beta})=(\bs{y-X\beta})^T(\bs{y-X\beta})\] subject to \[\sum_{i=1}^p \left\{(1-\alpha)\beta_i^2 + \alpha\left| \beta_i \right|\right\}\le t\]
Solution:
Pros:
Cons:
True model: \(\bs{y}=\bs{X\beta} + \bs{\epsilon}\)
where:
Solution Path | Training data MSE (10-fold CV)
\(\alpha\) | MSE |
---|---|
\(\alpha=0\) (Ridge) | 16.2645 |
\(\alpha=0.2\) | 1.7551 |
\(\alpha=0.4\) | 1.4761 |
\(\alpha=0.6\) | 1.4428 |
\(\alpha=0.8\) | 1.3376 |
\(\alpha=1\) (LASSO) | 1.3276 |
True model: \(\bs{y}=\bs{X\beta} + \bs{\epsilon}\)
where:
Solution Path | Training data MSE (10-fold CV)
\(\alpha\) | MSE |
---|---|
\(\alpha=0\) (Ridge) | 1490.7833 |
\(\alpha=0.2\) | 1637.0238 |
\(\alpha=0.4\) | 1641.1414 |
\(\alpha=0.6\) | 1633.5475 |
\(\alpha=0.8\) | 1641.1414 |
\(\alpha=1\) (LASSO) | 1641.1414 |
True model: \(y =\bs{X\beta} + \epsilon\)
where
Solution Path | Training data MSE (10-fold CV)
\(\alpha\) | MSE |
---|---|
\(\alpha=0\) (Ridge) | 6.7541 |
\(\alpha=0.2\) | 1.2461 |
\(\alpha=0.4\) | 1.4948 |
\(\alpha=0.6\) | 1.4219 |
\(\alpha=0.8\) | 1.4985 |
\(\alpha=1\) (LASSO) | 1.7152 |