import pandas as pd
import numpy as np
= pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/fat.csv") fat_data
Putting it all together
In this section we’ll look at an example of fitting a few different models (MLR models and LASSO models) and choosing the overall best one via test error.
Process: 1. Read the data in (would then want to explore data, clearn it, etc. We’ll skip this part) 2. Split the data into a training and test set 3. For each model type, select a best model using the training set (we’ll use cross-validation but you could split the training set into a training and validation set instead) 4. Compare the best models on the test set. Select the model with the lowest error (with considerations for simplicity)
1. Read in the fat
data set
Description
Age, weight, height, and 10 body circumference measurements are recorded for 252 men. Each man’s percentage of body fat was accurately estimated by an underwater weighing technique.
- brozek, Percent body fat using Brozek’s equation, 457/Density - 414.2
- siri, Percent body fat using Siri’s equation, 495/Density - 450
- density, Density (gm/\(cm^3\))
- age, Age (yrs)
- weight, Weight (lbs)
- height, Height (inches)
- adipos, Adiposity index = Weight/Height\(^2\) (kg/\(m^2\))
- free, Fat Free Weight = (1 - fraction of body fat) * Weight, using Brozek’s formula (lbs)
- neck, Neck circumference (cm)
- chest, Chest circumference (cm)
- abdom, Abdomen circumference (cm) at the umbilicus and level with the iliac crest
- hip, Hip circumference (cm)
- thigh, Thigh circumference (cm)
- knee, Knee circumference (cm)
- ankle, Ankle circumference (cm)
- biceps, Extended biceps circumference (cm)
- forearm, Forearm circumference (cm)
- wrist, Wrist circumference (cm) distal to the styloid processes
Source Johnson R. Journal of Statistics Education v.4, n.1 (1996)
We’ll attempt to model the percent body fat (Brozek masure used as the gold standard) using easily obtainable measurements.
fat_data.columns
Index(['Unnamed: 0', 'brozek', 'siri', 'density', 'age', 'weight', 'height',
'adipos', 'free', 'neck', 'chest', 'abdom', 'hip', 'thigh', 'knee',
'ankle', 'biceps', 'forearm', 'wrist'],
dtype='object')
fat_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 252 non-null int64
1 brozek 252 non-null float64
2 siri 252 non-null float64
3 density 252 non-null float64
4 age 252 non-null int64
5 weight 252 non-null float64
6 height 252 non-null float64
7 adipos 252 non-null float64
8 free 252 non-null float64
9 neck 252 non-null float64
10 chest 252 non-null float64
11 abdom 252 non-null float64
12 hip 252 non-null float64
13 thigh 252 non-null float64
14 knee 252 non-null float64
15 ankle 252 non-null float64
16 biceps 252 non-null float64
17 forearm 252 non-null float64
18 wrist 252 non-null float64
dtypes: float64(17), int64(2)
memory usage: 37.5 KB
fat_data.head()
Unnamed: 0 | brozek | siri | density | age | weight | height | adipos | free | neck | chest | abdom | hip | thigh | knee | ankle | biceps | forearm | wrist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 12.6 | 12.3 | 1.0708 | 23 | 154.25 | 67.75 | 23.7 | 134.9 | 36.2 | 93.1 | 85.2 | 94.5 | 59.0 | 37.3 | 21.9 | 32.0 | 27.4 | 17.1 |
1 | 2 | 6.9 | 6.1 | 1.0853 | 22 | 173.25 | 72.25 | 23.4 | 161.3 | 38.5 | 93.6 | 83.0 | 98.7 | 58.7 | 37.3 | 23.4 | 30.5 | 28.9 | 18.2 |
2 | 3 | 24.6 | 25.3 | 1.0414 | 22 | 154.00 | 66.25 | 24.7 | 116.0 | 34.0 | 95.8 | 87.9 | 99.2 | 59.6 | 38.9 | 24.0 | 28.8 | 25.2 | 16.6 |
3 | 4 | 10.9 | 10.4 | 1.0751 | 26 | 184.75 | 72.25 | 24.9 | 164.7 | 37.4 | 101.8 | 86.4 | 101.2 | 60.1 | 37.3 | 22.8 | 32.4 | 29.4 | 18.2 |
4 | 5 | 27.8 | 28.7 | 1.0340 | 24 | 184.25 | 71.25 | 25.6 | 133.1 | 34.4 | 97.3 | 100.0 | 101.9 | 63.2 | 42.2 | 24.0 | 32.2 | 27.7 | 17.7 |
Drop the observations where we have missing values for simplicity (and only select the columns we are going to use).
= fat_data.drop(["Unnamed: 0", "siri", "density"], axis = 1).dropna()
mod_fat_data mod_fat_data.head()
brozek | age | weight | height | adipos | free | neck | chest | abdom | hip | thigh | knee | ankle | biceps | forearm | wrist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12.6 | 23 | 154.25 | 67.75 | 23.7 | 134.9 | 36.2 | 93.1 | 85.2 | 94.5 | 59.0 | 37.3 | 21.9 | 32.0 | 27.4 | 17.1 |
1 | 6.9 | 22 | 173.25 | 72.25 | 23.4 | 161.3 | 38.5 | 93.6 | 83.0 | 98.7 | 58.7 | 37.3 | 23.4 | 30.5 | 28.9 | 18.2 |
2 | 24.6 | 22 | 154.00 | 66.25 | 24.7 | 116.0 | 34.0 | 95.8 | 87.9 | 99.2 | 59.6 | 38.9 | 24.0 | 28.8 | 25.2 | 16.6 |
3 | 10.9 | 26 | 184.75 | 72.25 | 24.9 | 164.7 | 37.4 | 101.8 | 86.4 | 101.2 | 60.1 | 37.3 | 22.8 | 32.4 | 29.4 | 18.2 |
4 | 27.8 | 24 | 184.25 | 71.25 | 25.6 | 133.1 | 34.4 | 97.3 | 100.0 | 101.9 | 63.2 | 42.2 | 24.0 | 32.2 | 27.7 | 17.7 |
There aren’t any categorical predictors so we don’t need to create any dummy variables.
We do however want to standardize our predictors when using the LASSO generally. If we have predictors on vastly different scales (say one takes on values from 0 to 1 and the other from 0 to 10000), this can affect our fitting and penalization. Instead we usually standardize our numerical predictors. We do so by subtracting off the mean and dividing by the standard deviation.
For example, if we had \(x\) with values 0, 3, 6, 10, 11, we would create a new column, say \(x_{std}\) that takes each observation and subtracts off the mean (\(\frac{0+3+6+10+11}{5} = 6\)) and divides each observation by the standard deviation (\(\sqrt{\frac{1}{5-1}\left((0-6)^2+(3-6)^2+(6-6)^2+(10-6)^2+(11-6)^2\right)} = 4.64\)). This would yield a new set of values given by -1.447, -0.7234, 0, 0.964, 1.206.
from math import sqrt
= np.array([0, 3, 6, 10, 11])
temp print(temp.mean(), temp.std(ddof = 1))
-temp.mean())/temp.std(ddof = 1) (temp
6.0 4.636809247747852
array([-1.29399328, -0.64699664, 0. , 0.86266219, 1.07832773])
However, we need to split our data before we standardize! We want to standardize only the training set values using the training set means and standard deviations. We’ll then use those same values to standardize the test set before we test our models!
2. Training and Test Split
First, let’s just read in all the functions we’ll need from sklearn
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LassoCV, Lasso
= train_test_split(
X_train, X_test, y_train, y_test "brozek", axis = 1),
mod_fat_data.drop("brozek"],
mod_fat_data[=0.20,
test_size=41) random_state
3. Fit and Select Models on Training Data
First, we can standardize our variables. We’ll save the values used for standardization so we can standardize the test set with those as well.
= X_train.apply(np.mean, axis = 0)
means means
0 | |
---|---|
age | 45.079602 |
weight | 178.316667 |
height | 70.036070 |
adipos | 25.403483 |
free | 143.059204 |
neck | 37.876617 |
chest | 100.808458 |
abdom | 92.559701 |
hip | 99.855224 |
thigh | 59.358706 |
knee | 38.607960 |
ankle | 23.018408 |
biceps | 32.155721 |
forearm | 28.671642 |
wrist | 18.196517 |
= X_train.apply(np.std, axis = 0)
stds stds
0 | |
---|---|
age | 12.807167 |
weight | 27.765423 |
height | 3.868759 |
adipos | 3.421550 |
free | 17.150096 |
neck | 2.274781 |
chest | 8.342636 |
abdom | 10.502178 |
hip | 6.760037 |
thigh | 5.063393 |
knee | 2.403715 |
ankle | 1.560810 |
biceps | 2.856846 |
forearm | 2.046680 |
wrist | 0.938050 |
= X_train.apply(lambda x: (x-np.mean(x))/np.std(x), axis = 0)
X_train X_train
age | weight | height | adipos | free | neck | chest | abdom | hip | thigh | knee | ankle | biceps | forearm | wrist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
120 | 0.540354 | 1.015051 | 1.153840 | 0.232794 | 0.445525 | 1.285127 | 0.418518 | 0.632278 | 0.627922 | -0.169591 | 0.287904 | 1.013315 | 0.610561 | 1.235346 | 1.389566 |
133 | 0.384191 | -0.767741 | -0.849386 | -0.176377 | -1.507817 | -0.033681 | -0.048960 | -0.300862 | -1.117631 | -0.643582 | -1.251380 | -1.613526 | 0.505550 | 0.307013 | -0.955724 |
207 | 0.149947 | 0.600867 | 0.636879 | 0.203568 | -0.551554 | 1.021366 | 0.226732 | 0.832237 | 0.272894 | 0.264900 | 0.329506 | 0.180414 | 1.590663 | 1.430785 | 0.216921 |
49 | 0.149947 | -1.830214 | -0.849386 | -1.520797 | -1.274582 | -1.704172 | -2.086685 | -2.110010 | -1.872064 | -1.729810 | -1.750607 | -0.716556 | -1.874697 | -1.403073 | -1.488745 |
25 | -1.411679 | -0.686705 | 0.378398 | -1.023946 | 0.515495 | -0.956847 | -1.343515 | -1.224480 | -0.496332 | -0.860827 | -0.793755 | -0.332140 | -0.789584 | -0.230442 | -0.529308 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
80 | 1.711573 | -0.524633 | -0.590905 | -0.088697 | -1.676912 | 0.230081 | -0.372599 | 0.308536 | -0.407575 | -0.900326 | -0.169721 | 0.436691 | -0.964603 | -0.719039 | 0.856545 |
226 | 0.774597 | -0.317541 | -0.461665 | 0.057435 | 0.066518 | -0.297443 | 0.106866 | -0.138990 | -0.407575 | -0.544834 | -0.044914 | -0.268071 | 0.435543 | 0.307013 | 0.643337 |
140 | -0.396622 | -0.056425 | 0.249158 | -0.234830 | -0.539892 | -0.209522 | -0.264719 | 0.051446 | 0.258102 | -0.051094 | 0.412711 | -0.908764 | -0.474552 | -0.377021 | -1.701953 |
163 | -0.865110 | -1.380014 | 0.119917 | -1.637703 | -1.414523 | -0.824966 | -1.391462 | -0.872172 | -1.517037 | -1.374317 | -1.251380 | -1.677595 | -1.349642 | -1.207635 | -1.808557 |
192 | -0.240459 | 0.546843 | 0.119917 | 0.583513 | 1.139399 | 1.241167 | 0.490438 | 0.146665 | 0.420823 | 0.245151 | -0.003312 | 1.077384 | 0.645564 | 0.697890 | 0.536733 |
201 rows × 15 columns
#now each column has mean 0 and sd 1
apply(np.mean, axis = 0)
X_train.apply(np.std, axis = 0) X_train.
0 | |
---|---|
age | 1.0 |
weight | 1.0 |
height | 1.0 |
adipos | 1.0 |
free | 1.0 |
neck | 1.0 |
chest | 1.0 |
abdom | 1.0 |
hip | 1.0 |
thigh | 1.0 |
knee | 1.0 |
ankle | 1.0 |
biceps | 1.0 |
forearm | 1.0 |
wrist | 1.0 |
MLR Models
Below I’ll consider some basic MLR models to compare. Even though we don’t need to use CV on those to determine a tuning parameter, we can use CV on those to determine the best MLR model (using the training set alone). Similarly, we’ll then find a ‘best’ LASSO model with tuning parameter selected by CV. Then we’ll compare the two ‘best’ models on the test set to see the overall winner!
# Full model
= cross_validate(
cv_full_model
LinearRegression(),
X_train,
y_train,= 5,
cv = "neg_mean_squared_error")
scoring = cross_validate(
cv_mlr1
LinearRegression(),"weight", "wrist", "abdom"]],
X_train[[
y_train,= 5,
cv = "neg_mean_squared_error")
scoring = cross_validate(
cv_mlr2
LinearRegression(),"weight", "height", "neck", "chest", "abdom", "biceps", "forearm", "wrist"]],
X_train[[
y_train,= 5,
cv = "neg_mean_squared_error") scoring
print(np.sqrt(-sum(cv_full_model['test_score'])),
-sum(cv_mlr1['test_score'])),
np.sqrt(-sum(cv_mlr2['test_score']))) np.sqrt(
6.101775440661614 9.044740234075288 9.680673752851906
Looks like the full model is the best here!
= LinearRegression().fit(X_train, y_train)
mlr_best print(mlr_best.intercept_)
print(np.array(list(zip(X_train.columns, mlr_best.coef_))))
19.045273631840807
[['age' '0.1338989120135575']
['weight' '9.34842058078133']
['height' '0.19894030597107948']
['adipos' '-0.9776707934156488']
['free' '-8.827474152930117']
['neck' '0.11485215326542098']
['chest' '0.5751829606181673']
['abdom' '1.2206574175508305']
['hip' '0.18340049047065232']
['thigh' '0.8379939074493237']
['knee' '0.16205656779585353']
['ankle' '0.2436973560143484']
['biceps' '0.17243823308972106']
['forearm' '0.30301942051892716']
['wrist' '-0.013949037814783072']]
LASSO model
= LassoCV(cv=5, random_state=0) \
lasso_mod
.fit(X_train, y_train)
Just to look at the tuning parameters and CV errors:
= True)
np.set_printoptions(suppress = np.array(list(zip(lasso_mod.alphas_, np.mean(lasso_mod.mse_path_, axis = 1))))
fit_info 1].argsort()][0:10,].round(4) fit_info[fit_info[:,
array([[0.0683, 2.1017],
[0.0732, 2.106 ],
[0.0785, 2.1139],
[0.0842, 2.1278],
[0.0637, 2.1427],
[0.0903, 2.1461],
[0.0968, 2.17 ],
[0.1038, 2.2005],
[0.1113, 2.229 ],
[0.0594, 2.233 ]])
print(lasso_mod.alpha_)
print(lasso_mod.intercept_)
print(np.array(list(zip(X_train.columns, lasso_mod.coef_))))
0.06827844720988437
19.045273631840807
[['age' '0.03520622302650939']
['weight' '8.676685171578885']
['height' '0.18524483241596834']
['adipos' '0.0']
['free' '-8.098358879411064']
['neck' '0.0']
['chest' '0.1012312495726051']
['abdom' '1.5227501560784638']
['hip' '0.0']
['thigh' '0.536843692590691']
['knee' '0.2410290965097099']
['ankle' '0.0997282321269406']
['biceps' '0.12439075979914176']
['forearm' '0.20197553831501108']
['wrist' '0.0']]
Fit that best model for comparison on the test set.
= Lasso(lasso_mod.alpha_).fit(X_train,y_train) lasso_best
4. Compare on the Test Set
We need to standardize the test set using our training values. Then we can find the test set predictions and RMSE!
We’ll see how to have python do this for us later. For now, let’s do it manually!
#quick function to standardize based off of a supplied mean and std
def my_std_fun(x, means, stds):
return(x-means)/stds
#loop through the columns and use the function on each
for x in X_test.columns:
= my_std_fun(X_test[x], means[x], stds[x])
X_test[x] X_test.head()
age | weight | height | adipos | free | neck | chest | abdom | hip | thigh | knee | ankle | biceps | forearm | wrist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
107 | 0.540354 | 0.897999 | 1.089220 | 0.174341 | 1.355141 | 1.812651 | 1.101755 | 0.860802 | 0.124966 | -0.702830 | 0.038291 | 0.244483 | 1.030604 | 0.453592 | 0.963150 |
143 | -1.724004 | -0.668697 | 0.572259 | -1.111626 | 0.049026 | -1.044767 | -1.043850 | -1.472047 | -0.880946 | -0.643582 | -1.043368 | -0.204002 | -0.579563 | -0.719039 | 0.003713 |
167 | -0.787028 | 1.672344 | 0.572259 | 1.431082 | 2.136478 | 2.647896 | 0.885996 | 0.746540 | 1.027328 | 0.778390 | 1.286359 | 1.013315 | 1.765681 | 2.163679 | 1.709378 |
29 | -1.255516 | -0.632681 | -0.267804 | -0.468642 | 0.153981 | -0.517244 | -0.408559 | -0.862650 | -0.170890 | -0.090593 | -1.376186 | -0.268071 | -0.719577 | -0.963337 | -0.635912 |
30 | -1.021272 | 0.132659 | 0.959980 | -0.527095 | 0.970303 | 0.361962 | -0.036974 | -0.367514 | -0.008169 | -0.367087 | 0.038291 | 6.971758 | 0.120510 | -0.474741 | 0.216921 |
Now find our predictions and test set RMSE.
= mlr_best.predict(X_test)
mlr_pred = lasso_best.predict(X_test)
lasso_pred
print(np.sqrt(mean_squared_error(y_test, mlr_pred)), np.sqrt(mean_squared_error(y_test, lasso_pred)))
1.7547240606525405 1.9916053642246054