import pandas as pd
import numpy as np
fat_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/fat.csv")Putting it all together
In this section we’ll look at an example of fitting a few different models (MLR models and LASSO models) and choosing the overall best one via test error.
Process: 1. Read the data in (would then want to explore data, clearn it, etc. We’ll skip this part) 2. Split the data into a training and test set 3. For each model type, select a best model using the training set (we’ll use cross-validation but you could split the training set into a training and validation set instead) 4. Compare the best models on the test set. Select the model with the lowest error (with considerations for simplicity)
1. Read in the fat data set
Description
Age, weight, height, and 10 body circumference measurements are recorded for 252 men. Each man’s percentage of body fat was accurately estimated by an underwater weighing technique.
- brozek, Percent body fat using Brozek’s equation, 457/Density - 414.2
 - siri, Percent body fat using Siri’s equation, 495/Density - 450
 - density, Density (gm/\(cm^3\))
 - age, Age (yrs)
 - weight, Weight (lbs)
 - height, Height (inches)
 - adipos, Adiposity index = Weight/Height\(^2\) (kg/\(m^2\))
 - free, Fat Free Weight = (1 - fraction of body fat) * Weight, using Brozek’s formula (lbs)
 - neck, Neck circumference (cm)
 - chest, Chest circumference (cm)
 - abdom, Abdomen circumference (cm) at the umbilicus and level with the iliac crest
 - hip, Hip circumference (cm)
 - thigh, Thigh circumference (cm)
 - knee, Knee circumference (cm)
 - ankle, Ankle circumference (cm)
 - biceps, Extended biceps circumference (cm)
 - forearm, Forearm circumference (cm)
 - wrist, Wrist circumference (cm) distal to the styloid processes
 
Source Johnson R. Journal of Statistics Education v.4, n.1 (1996)
We’ll attempt to model the percent body fat (Brozek masure used as the gold standard) using easily obtainable measurements.
fat_data.columnsIndex(['Unnamed: 0', 'brozek', 'siri', 'density', 'age', 'weight', 'height',
       'adipos', 'free', 'neck', 'chest', 'abdom', 'hip', 'thigh', 'knee',
       'ankle', 'biceps', 'forearm', 'wrist'],
      dtype='object')
fat_data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 19 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  252 non-null    int64  
 1   brozek      252 non-null    float64
 2   siri        252 non-null    float64
 3   density     252 non-null    float64
 4   age         252 non-null    int64  
 5   weight      252 non-null    float64
 6   height      252 non-null    float64
 7   adipos      252 non-null    float64
 8   free        252 non-null    float64
 9   neck        252 non-null    float64
 10  chest       252 non-null    float64
 11  abdom       252 non-null    float64
 12  hip         252 non-null    float64
 13  thigh       252 non-null    float64
 14  knee        252 non-null    float64
 15  ankle       252 non-null    float64
 16  biceps      252 non-null    float64
 17  forearm     252 non-null    float64
 18  wrist       252 non-null    float64
dtypes: float64(17), int64(2)
memory usage: 37.5 KB
fat_data.head()| Unnamed: 0 | brozek | siri | density | age | weight | height | adipos | free | neck | chest | abdom | hip | thigh | knee | ankle | biceps | forearm | wrist | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 12.6 | 12.3 | 1.0708 | 23 | 154.25 | 67.75 | 23.7 | 134.9 | 36.2 | 93.1 | 85.2 | 94.5 | 59.0 | 37.3 | 21.9 | 32.0 | 27.4 | 17.1 | 
| 1 | 2 | 6.9 | 6.1 | 1.0853 | 22 | 173.25 | 72.25 | 23.4 | 161.3 | 38.5 | 93.6 | 83.0 | 98.7 | 58.7 | 37.3 | 23.4 | 30.5 | 28.9 | 18.2 | 
| 2 | 3 | 24.6 | 25.3 | 1.0414 | 22 | 154.00 | 66.25 | 24.7 | 116.0 | 34.0 | 95.8 | 87.9 | 99.2 | 59.6 | 38.9 | 24.0 | 28.8 | 25.2 | 16.6 | 
| 3 | 4 | 10.9 | 10.4 | 1.0751 | 26 | 184.75 | 72.25 | 24.9 | 164.7 | 37.4 | 101.8 | 86.4 | 101.2 | 60.1 | 37.3 | 22.8 | 32.4 | 29.4 | 18.2 | 
| 4 | 5 | 27.8 | 28.7 | 1.0340 | 24 | 184.25 | 71.25 | 25.6 | 133.1 | 34.4 | 97.3 | 100.0 | 101.9 | 63.2 | 42.2 | 24.0 | 32.2 | 27.7 | 17.7 | 
Drop the observations where we have missing values for simplicity (and only select the columns we are going to use).
mod_fat_data = fat_data.drop(["Unnamed: 0", "siri", "density"], axis = 1).dropna()
mod_fat_data.head()| brozek | age | weight | height | adipos | free | neck | chest | abdom | hip | thigh | knee | ankle | biceps | forearm | wrist | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 12.6 | 23 | 154.25 | 67.75 | 23.7 | 134.9 | 36.2 | 93.1 | 85.2 | 94.5 | 59.0 | 37.3 | 21.9 | 32.0 | 27.4 | 17.1 | 
| 1 | 6.9 | 22 | 173.25 | 72.25 | 23.4 | 161.3 | 38.5 | 93.6 | 83.0 | 98.7 | 58.7 | 37.3 | 23.4 | 30.5 | 28.9 | 18.2 | 
| 2 | 24.6 | 22 | 154.00 | 66.25 | 24.7 | 116.0 | 34.0 | 95.8 | 87.9 | 99.2 | 59.6 | 38.9 | 24.0 | 28.8 | 25.2 | 16.6 | 
| 3 | 10.9 | 26 | 184.75 | 72.25 | 24.9 | 164.7 | 37.4 | 101.8 | 86.4 | 101.2 | 60.1 | 37.3 | 22.8 | 32.4 | 29.4 | 18.2 | 
| 4 | 27.8 | 24 | 184.25 | 71.25 | 25.6 | 133.1 | 34.4 | 97.3 | 100.0 | 101.9 | 63.2 | 42.2 | 24.0 | 32.2 | 27.7 | 17.7 | 
There aren’t any categorical predictors so we don’t need to create any dummy variables.
We do however want to standardize our predictors when using the LASSO generally. If we have predictors on vastly different scales (say one takes on values from 0 to 1 and the other from 0 to 10000), this can affect our fitting and penalization. Instead we usually standardize our numerical predictors. We do so by subtracting off the mean and dividing by the standard deviation.
For example, if we had \(x\) with values 0, 3, 6, 10, 11, we would create a new column, say \(x_{std}\) that takes each observation and subtracts off the mean (\(\frac{0+3+6+10+11}{5} = 6\)) and divides each observation by the standard deviation (\(\sqrt{\frac{1}{5-1}\left((0-6)^2+(3-6)^2+(6-6)^2+(10-6)^2+(11-6)^2\right)} = 4.64\)). This would yield a new set of values given by -1.447, -0.7234, 0, 0.964, 1.206.
from math import sqrt
temp = np.array([0, 3, 6, 10, 11])
print(temp.mean(), temp.std(ddof = 1))
(temp-temp.mean())/temp.std(ddof = 1)6.0 4.636809247747852
array([-1.29399328, -0.64699664,  0.        ,  0.86266219,  1.07832773])
However, we need to split our data before we standardize! We want to standardize only the training set values using the training set means and standard deviations. We’ll then use those same values to standardize the test set before we test our models!
2. Training and Test Split
First, let’s just read in all the functions we’ll need from sklearn
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LassoCV, LassoX_train, X_test, y_train, y_test = train_test_split(
  mod_fat_data.drop("brozek", axis = 1),
  mod_fat_data["brozek"],
  test_size=0.20,
  random_state=41)3. Fit and Select Models on Training Data
First, we can standardize our variables. We’ll save the values used for standardization so we can standardize the test set with those as well.
means = X_train.apply(np.mean, axis = 0)
means| 0 | |
|---|---|
| age | 45.079602 | 
| weight | 178.316667 | 
| height | 70.036070 | 
| adipos | 25.403483 | 
| free | 143.059204 | 
| neck | 37.876617 | 
| chest | 100.808458 | 
| abdom | 92.559701 | 
| hip | 99.855224 | 
| thigh | 59.358706 | 
| knee | 38.607960 | 
| ankle | 23.018408 | 
| biceps | 32.155721 | 
| forearm | 28.671642 | 
| wrist | 18.196517 | 
stds = X_train.apply(np.std, axis = 0)
stds| 0 | |
|---|---|
| age | 12.807167 | 
| weight | 27.765423 | 
| height | 3.868759 | 
| adipos | 3.421550 | 
| free | 17.150096 | 
| neck | 2.274781 | 
| chest | 8.342636 | 
| abdom | 10.502178 | 
| hip | 6.760037 | 
| thigh | 5.063393 | 
| knee | 2.403715 | 
| ankle | 1.560810 | 
| biceps | 2.856846 | 
| forearm | 2.046680 | 
| wrist | 0.938050 | 
X_train = X_train.apply(lambda x: (x-np.mean(x))/np.std(x), axis = 0)
X_train| age | weight | height | adipos | free | neck | chest | abdom | hip | thigh | knee | ankle | biceps | forearm | wrist | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 120 | 0.540354 | 1.015051 | 1.153840 | 0.232794 | 0.445525 | 1.285127 | 0.418518 | 0.632278 | 0.627922 | -0.169591 | 0.287904 | 1.013315 | 0.610561 | 1.235346 | 1.389566 | 
| 133 | 0.384191 | -0.767741 | -0.849386 | -0.176377 | -1.507817 | -0.033681 | -0.048960 | -0.300862 | -1.117631 | -0.643582 | -1.251380 | -1.613526 | 0.505550 | 0.307013 | -0.955724 | 
| 207 | 0.149947 | 0.600867 | 0.636879 | 0.203568 | -0.551554 | 1.021366 | 0.226732 | 0.832237 | 0.272894 | 0.264900 | 0.329506 | 0.180414 | 1.590663 | 1.430785 | 0.216921 | 
| 49 | 0.149947 | -1.830214 | -0.849386 | -1.520797 | -1.274582 | -1.704172 | -2.086685 | -2.110010 | -1.872064 | -1.729810 | -1.750607 | -0.716556 | -1.874697 | -1.403073 | -1.488745 | 
| 25 | -1.411679 | -0.686705 | 0.378398 | -1.023946 | 0.515495 | -0.956847 | -1.343515 | -1.224480 | -0.496332 | -0.860827 | -0.793755 | -0.332140 | -0.789584 | -0.230442 | -0.529308 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 80 | 1.711573 | -0.524633 | -0.590905 | -0.088697 | -1.676912 | 0.230081 | -0.372599 | 0.308536 | -0.407575 | -0.900326 | -0.169721 | 0.436691 | -0.964603 | -0.719039 | 0.856545 | 
| 226 | 0.774597 | -0.317541 | -0.461665 | 0.057435 | 0.066518 | -0.297443 | 0.106866 | -0.138990 | -0.407575 | -0.544834 | -0.044914 | -0.268071 | 0.435543 | 0.307013 | 0.643337 | 
| 140 | -0.396622 | -0.056425 | 0.249158 | -0.234830 | -0.539892 | -0.209522 | -0.264719 | 0.051446 | 0.258102 | -0.051094 | 0.412711 | -0.908764 | -0.474552 | -0.377021 | -1.701953 | 
| 163 | -0.865110 | -1.380014 | 0.119917 | -1.637703 | -1.414523 | -0.824966 | -1.391462 | -0.872172 | -1.517037 | -1.374317 | -1.251380 | -1.677595 | -1.349642 | -1.207635 | -1.808557 | 
| 192 | -0.240459 | 0.546843 | 0.119917 | 0.583513 | 1.139399 | 1.241167 | 0.490438 | 0.146665 | 0.420823 | 0.245151 | -0.003312 | 1.077384 | 0.645564 | 0.697890 | 0.536733 | 
201 rows × 15 columns
#now each column has mean 0 and sd 1
X_train.apply(np.mean, axis = 0)
X_train.apply(np.std, axis = 0)| 0 | |
|---|---|
| age | 1.0 | 
| weight | 1.0 | 
| height | 1.0 | 
| adipos | 1.0 | 
| free | 1.0 | 
| neck | 1.0 | 
| chest | 1.0 | 
| abdom | 1.0 | 
| hip | 1.0 | 
| thigh | 1.0 | 
| knee | 1.0 | 
| ankle | 1.0 | 
| biceps | 1.0 | 
| forearm | 1.0 | 
| wrist | 1.0 | 
MLR Models
Below I’ll consider some basic MLR models to compare. Even though we don’t need to use CV on those to determine a tuning parameter, we can use CV on those to determine the best MLR model (using the training set alone). Similarly, we’ll then find a ‘best’ LASSO model with tuning parameter selected by CV. Then we’ll compare the two ‘best’ models on the test set to see the overall winner!
# Full model
cv_full_model = cross_validate(
    LinearRegression(),
    X_train,
    y_train,
    cv = 5,
    scoring = "neg_mean_squared_error")
cv_mlr1= cross_validate(
    LinearRegression(),
    X_train[["weight", "wrist", "abdom"]],
    y_train,
    cv = 5,
    scoring = "neg_mean_squared_error")
cv_mlr2 = cross_validate(
    LinearRegression(),
    X_train[["weight", "height", "neck", "chest", "abdom", "biceps", "forearm", "wrist"]],
    y_train,
    cv = 5,
    scoring = "neg_mean_squared_error")print(np.sqrt(-sum(cv_full_model['test_score'])),
      np.sqrt(-sum(cv_mlr1['test_score'])),
      np.sqrt(-sum(cv_mlr2['test_score'])))6.101775440661614 9.044740234075288 9.680673752851906
Looks like the full model is the best here!
mlr_best = LinearRegression().fit(X_train, y_train)
print(mlr_best.intercept_)
print(np.array(list(zip(X_train.columns, mlr_best.coef_))))19.045273631840807
[['age' '0.1338989120135575']
 ['weight' '9.34842058078133']
 ['height' '0.19894030597107948']
 ['adipos' '-0.9776707934156488']
 ['free' '-8.827474152930117']
 ['neck' '0.11485215326542098']
 ['chest' '0.5751829606181673']
 ['abdom' '1.2206574175508305']
 ['hip' '0.18340049047065232']
 ['thigh' '0.8379939074493237']
 ['knee' '0.16205656779585353']
 ['ankle' '0.2436973560143484']
 ['biceps' '0.17243823308972106']
 ['forearm' '0.30301942051892716']
 ['wrist' '-0.013949037814783072']]
LASSO model
lasso_mod = LassoCV(cv=5, random_state=0) \
    .fit(X_train,
         y_train)Just to look at the tuning parameters and CV errors:
np.set_printoptions(suppress = True)
fit_info = np.array(list(zip(lasso_mod.alphas_, np.mean(lasso_mod.mse_path_, axis = 1))))
fit_info[fit_info[:,1].argsort()][0:10,].round(4)array([[0.0683, 2.1017],
       [0.0732, 2.106 ],
       [0.0785, 2.1139],
       [0.0842, 2.1278],
       [0.0637, 2.1427],
       [0.0903, 2.1461],
       [0.0968, 2.17  ],
       [0.1038, 2.2005],
       [0.1113, 2.229 ],
       [0.0594, 2.233 ]])
print(lasso_mod.alpha_)
print(lasso_mod.intercept_)
print(np.array(list(zip(X_train.columns, lasso_mod.coef_))))0.06827844720988437
19.045273631840807
[['age' '0.03520622302650939']
 ['weight' '8.676685171578885']
 ['height' '0.18524483241596834']
 ['adipos' '0.0']
 ['free' '-8.098358879411064']
 ['neck' '0.0']
 ['chest' '0.1012312495726051']
 ['abdom' '1.5227501560784638']
 ['hip' '0.0']
 ['thigh' '0.536843692590691']
 ['knee' '0.2410290965097099']
 ['ankle' '0.0997282321269406']
 ['biceps' '0.12439075979914176']
 ['forearm' '0.20197553831501108']
 ['wrist' '0.0']]
Fit that best model for comparison on the test set.
lasso_best = Lasso(lasso_mod.alpha_).fit(X_train,y_train)4. Compare on the Test Set
We need to standardize the test set using our training values. Then we can find the test set predictions and RMSE!
We’ll see how to have python do this for us later. For now, let’s do it manually!
#quick function to standardize based off of a supplied mean and std
def my_std_fun(x, means, stds):
    return(x-means)/stds
#loop through the columns and use the function on each
for x in X_test.columns:
    X_test[x] = my_std_fun(X_test[x], means[x], stds[x])
X_test.head()| age | weight | height | adipos | free | neck | chest | abdom | hip | thigh | knee | ankle | biceps | forearm | wrist | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 107 | 0.540354 | 0.897999 | 1.089220 | 0.174341 | 1.355141 | 1.812651 | 1.101755 | 0.860802 | 0.124966 | -0.702830 | 0.038291 | 0.244483 | 1.030604 | 0.453592 | 0.963150 | 
| 143 | -1.724004 | -0.668697 | 0.572259 | -1.111626 | 0.049026 | -1.044767 | -1.043850 | -1.472047 | -0.880946 | -0.643582 | -1.043368 | -0.204002 | -0.579563 | -0.719039 | 0.003713 | 
| 167 | -0.787028 | 1.672344 | 0.572259 | 1.431082 | 2.136478 | 2.647896 | 0.885996 | 0.746540 | 1.027328 | 0.778390 | 1.286359 | 1.013315 | 1.765681 | 2.163679 | 1.709378 | 
| 29 | -1.255516 | -0.632681 | -0.267804 | -0.468642 | 0.153981 | -0.517244 | -0.408559 | -0.862650 | -0.170890 | -0.090593 | -1.376186 | -0.268071 | -0.719577 | -0.963337 | -0.635912 | 
| 30 | -1.021272 | 0.132659 | 0.959980 | -0.527095 | 0.970303 | 0.361962 | -0.036974 | -0.367514 | -0.008169 | -0.367087 | 0.038291 | 6.971758 | 0.120510 | -0.474741 | 0.216921 | 
Now find our predictions and test set RMSE.
mlr_pred = mlr_best.predict(X_test)
lasso_pred = lasso_best.predict(X_test)
print(np.sqrt(mean_squared_error(y_test, mlr_pred)), np.sqrt(mean_squared_error(y_test, lasso_pred)))1.7547240606525405 1.9916053642246054