Published

2025-03-31

Open In Colab

Putting it all together

In this section we’ll look at an example of fitting a few different models (MLR models and LASSO models) and choosing the overall best one via test error.

Process: 1. Read the data in (would then want to explore data, clearn it, etc. We’ll skip this part) 2. Split the data into a training and test set 3. For each model type, select a best model using the training set (we’ll use cross-validation but you could split the training set into a training and validation set instead) 4. Compare the best models on the test set. Select the model with the lowest error (with considerations for simplicity)

1. Read in the fat data set

Description

Age, weight, height, and 10 body circumference measurements are recorded for 252 men. Each man’s percentage of body fat was accurately estimated by an underwater weighing technique.

  • brozek, Percent body fat using Brozek’s equation, 457/Density - 414.2
  • siri, Percent body fat using Siri’s equation, 495/Density - 450
  • density, Density (gm/\(cm^3\))
  • age, Age (yrs)
  • weight, Weight (lbs)
  • height, Height (inches)
  • adipos, Adiposity index = Weight/Height\(^2\) (kg/\(m^2\))
  • free, Fat Free Weight = (1 - fraction of body fat) * Weight, using Brozek’s formula (lbs)
  • neck, Neck circumference (cm)
  • chest, Chest circumference (cm)
  • abdom, Abdomen circumference (cm) at the umbilicus and level with the iliac crest
  • hip, Hip circumference (cm)
  • thigh, Thigh circumference (cm)
  • knee, Knee circumference (cm)
  • ankle, Ankle circumference (cm)
  • biceps, Extended biceps circumference (cm)
  • forearm, Forearm circumference (cm)
  • wrist, Wrist circumference (cm) distal to the styloid processes

Source Johnson R. Journal of Statistics Education v.4, n.1 (1996)

We’ll attempt to model the percent body fat (Brozek masure used as the gold standard) using easily obtainable measurements.

import pandas as pd
import numpy as np
fat_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/fat.csv")
fat_data.columns
Index(['Unnamed: 0', 'brozek', 'siri', 'density', 'age', 'weight', 'height',
       'adipos', 'free', 'neck', 'chest', 'abdom', 'hip', 'thigh', 'knee',
       'ankle', 'biceps', 'forearm', 'wrist'],
      dtype='object')
fat_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 19 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  252 non-null    int64  
 1   brozek      252 non-null    float64
 2   siri        252 non-null    float64
 3   density     252 non-null    float64
 4   age         252 non-null    int64  
 5   weight      252 non-null    float64
 6   height      252 non-null    float64
 7   adipos      252 non-null    float64
 8   free        252 non-null    float64
 9   neck        252 non-null    float64
 10  chest       252 non-null    float64
 11  abdom       252 non-null    float64
 12  hip         252 non-null    float64
 13  thigh       252 non-null    float64
 14  knee        252 non-null    float64
 15  ankle       252 non-null    float64
 16  biceps      252 non-null    float64
 17  forearm     252 non-null    float64
 18  wrist       252 non-null    float64
dtypes: float64(17), int64(2)
memory usage: 37.5 KB
fat_data.head()
Unnamed: 0 brozek siri density age weight height adipos free neck chest abdom hip thigh knee ankle biceps forearm wrist
0 1 12.6 12.3 1.0708 23 154.25 67.75 23.7 134.9 36.2 93.1 85.2 94.5 59.0 37.3 21.9 32.0 27.4 17.1
1 2 6.9 6.1 1.0853 22 173.25 72.25 23.4 161.3 38.5 93.6 83.0 98.7 58.7 37.3 23.4 30.5 28.9 18.2
2 3 24.6 25.3 1.0414 22 154.00 66.25 24.7 116.0 34.0 95.8 87.9 99.2 59.6 38.9 24.0 28.8 25.2 16.6
3 4 10.9 10.4 1.0751 26 184.75 72.25 24.9 164.7 37.4 101.8 86.4 101.2 60.1 37.3 22.8 32.4 29.4 18.2
4 5 27.8 28.7 1.0340 24 184.25 71.25 25.6 133.1 34.4 97.3 100.0 101.9 63.2 42.2 24.0 32.2 27.7 17.7

Drop the observations where we have missing values for simplicity (and only select the columns we are going to use).

mod_fat_data = fat_data.drop(["Unnamed: 0", "siri", "density"], axis = 1).dropna()
mod_fat_data.head()
brozek age weight height adipos free neck chest abdom hip thigh knee ankle biceps forearm wrist
0 12.6 23 154.25 67.75 23.7 134.9 36.2 93.1 85.2 94.5 59.0 37.3 21.9 32.0 27.4 17.1
1 6.9 22 173.25 72.25 23.4 161.3 38.5 93.6 83.0 98.7 58.7 37.3 23.4 30.5 28.9 18.2
2 24.6 22 154.00 66.25 24.7 116.0 34.0 95.8 87.9 99.2 59.6 38.9 24.0 28.8 25.2 16.6
3 10.9 26 184.75 72.25 24.9 164.7 37.4 101.8 86.4 101.2 60.1 37.3 22.8 32.4 29.4 18.2
4 27.8 24 184.25 71.25 25.6 133.1 34.4 97.3 100.0 101.9 63.2 42.2 24.0 32.2 27.7 17.7

There aren’t any categorical predictors so we don’t need to create any dummy variables.

We do however want to standardize our predictors when using the LASSO generally. If we have predictors on vastly different scales (say one takes on values from 0 to 1 and the other from 0 to 10000), this can affect our fitting and penalization. Instead we usually standardize our numerical predictors. We do so by subtracting off the mean and dividing by the standard deviation.

For example, if we had \(x\) with values 0, 3, 6, 10, 11, we would create a new column, say \(x_{std}\) that takes each observation and subtracts off the mean (\(\frac{0+3+6+10+11}{5} = 6\)) and divides each observation by the standard deviation (\(\sqrt{\frac{1}{5-1}\left((0-6)^2+(3-6)^2+(6-6)^2+(10-6)^2+(11-6)^2\right)} = 4.64\)). This would yield a new set of values given by -1.447, -0.7234, 0, 0.964, 1.206.

from math import sqrt
temp = np.array([0, 3, 6, 10, 11])
print(temp.mean(), temp.std(ddof = 1))
(temp-temp.mean())/temp.std(ddof = 1)
6.0 4.636809247747852
array([-1.29399328, -0.64699664,  0.        ,  0.86266219,  1.07832773])

However, we need to split our data before we standardize! We want to standardize only the training set values using the training set means and standard deviations. We’ll then use those same values to standardize the test set before we test our models!

2. Training and Test Split

First, let’s just read in all the functions we’ll need from sklearn

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LassoCV, Lasso
X_train, X_test, y_train, y_test = train_test_split(
  mod_fat_data.drop("brozek", axis = 1),
  mod_fat_data["brozek"],
  test_size=0.20,
  random_state=41)

3. Fit and Select Models on Training Data

First, we can standardize our variables. We’ll save the values used for standardization so we can standardize the test set with those as well.

means = X_train.apply(np.mean, axis = 0)
means
0
age 45.079602
weight 178.316667
height 70.036070
adipos 25.403483
free 143.059204
neck 37.876617
chest 100.808458
abdom 92.559701
hip 99.855224
thigh 59.358706
knee 38.607960
ankle 23.018408
biceps 32.155721
forearm 28.671642
wrist 18.196517

stds = X_train.apply(np.std, axis = 0)
stds
0
age 12.807167
weight 27.765423
height 3.868759
adipos 3.421550
free 17.150096
neck 2.274781
chest 8.342636
abdom 10.502178
hip 6.760037
thigh 5.063393
knee 2.403715
ankle 1.560810
biceps 2.856846
forearm 2.046680
wrist 0.938050

X_train = X_train.apply(lambda x: (x-np.mean(x))/np.std(x), axis = 0)
X_train
age weight height adipos free neck chest abdom hip thigh knee ankle biceps forearm wrist
120 0.540354 1.015051 1.153840 0.232794 0.445525 1.285127 0.418518 0.632278 0.627922 -0.169591 0.287904 1.013315 0.610561 1.235346 1.389566
133 0.384191 -0.767741 -0.849386 -0.176377 -1.507817 -0.033681 -0.048960 -0.300862 -1.117631 -0.643582 -1.251380 -1.613526 0.505550 0.307013 -0.955724
207 0.149947 0.600867 0.636879 0.203568 -0.551554 1.021366 0.226732 0.832237 0.272894 0.264900 0.329506 0.180414 1.590663 1.430785 0.216921
49 0.149947 -1.830214 -0.849386 -1.520797 -1.274582 -1.704172 -2.086685 -2.110010 -1.872064 -1.729810 -1.750607 -0.716556 -1.874697 -1.403073 -1.488745
25 -1.411679 -0.686705 0.378398 -1.023946 0.515495 -0.956847 -1.343515 -1.224480 -0.496332 -0.860827 -0.793755 -0.332140 -0.789584 -0.230442 -0.529308
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
80 1.711573 -0.524633 -0.590905 -0.088697 -1.676912 0.230081 -0.372599 0.308536 -0.407575 -0.900326 -0.169721 0.436691 -0.964603 -0.719039 0.856545
226 0.774597 -0.317541 -0.461665 0.057435 0.066518 -0.297443 0.106866 -0.138990 -0.407575 -0.544834 -0.044914 -0.268071 0.435543 0.307013 0.643337
140 -0.396622 -0.056425 0.249158 -0.234830 -0.539892 -0.209522 -0.264719 0.051446 0.258102 -0.051094 0.412711 -0.908764 -0.474552 -0.377021 -1.701953
163 -0.865110 -1.380014 0.119917 -1.637703 -1.414523 -0.824966 -1.391462 -0.872172 -1.517037 -1.374317 -1.251380 -1.677595 -1.349642 -1.207635 -1.808557
192 -0.240459 0.546843 0.119917 0.583513 1.139399 1.241167 0.490438 0.146665 0.420823 0.245151 -0.003312 1.077384 0.645564 0.697890 0.536733

201 rows × 15 columns

#now each column has mean 0 and sd 1
X_train.apply(np.mean, axis = 0)
X_train.apply(np.std, axis = 0)
0
age 1.0
weight 1.0
height 1.0
adipos 1.0
free 1.0
neck 1.0
chest 1.0
abdom 1.0
hip 1.0
thigh 1.0
knee 1.0
ankle 1.0
biceps 1.0
forearm 1.0
wrist 1.0

MLR Models

Below I’ll consider some basic MLR models to compare. Even though we don’t need to use CV on those to determine a tuning parameter, we can use CV on those to determine the best MLR model (using the training set alone). Similarly, we’ll then find a ‘best’ LASSO model with tuning parameter selected by CV. Then we’ll compare the two ‘best’ models on the test set to see the overall winner!

# Full model
cv_full_model = cross_validate(
    LinearRegression(),
    X_train,
    y_train,
    cv = 5,
    scoring = "neg_mean_squared_error")
cv_mlr1= cross_validate(
    LinearRegression(),
    X_train[["weight", "wrist", "abdom"]],
    y_train,
    cv = 5,
    scoring = "neg_mean_squared_error")
cv_mlr2 = cross_validate(
    LinearRegression(),
    X_train[["weight", "height", "neck", "chest", "abdom", "biceps", "forearm", "wrist"]],
    y_train,
    cv = 5,
    scoring = "neg_mean_squared_error")
print(np.sqrt(-sum(cv_full_model['test_score'])),
      np.sqrt(-sum(cv_mlr1['test_score'])),
      np.sqrt(-sum(cv_mlr2['test_score'])))
6.101775440661614 9.044740234075288 9.680673752851906

Looks like the full model is the best here!

mlr_best = LinearRegression().fit(X_train, y_train)
print(mlr_best.intercept_)
print(np.array(list(zip(X_train.columns, mlr_best.coef_))))
19.045273631840807
[['age' '0.1338989120135575']
 ['weight' '9.34842058078133']
 ['height' '0.19894030597107948']
 ['adipos' '-0.9776707934156488']
 ['free' '-8.827474152930117']
 ['neck' '0.11485215326542098']
 ['chest' '0.5751829606181673']
 ['abdom' '1.2206574175508305']
 ['hip' '0.18340049047065232']
 ['thigh' '0.8379939074493237']
 ['knee' '0.16205656779585353']
 ['ankle' '0.2436973560143484']
 ['biceps' '0.17243823308972106']
 ['forearm' '0.30301942051892716']
 ['wrist' '-0.013949037814783072']]

LASSO model

lasso_mod = LassoCV(cv=5, random_state=0) \
    .fit(X_train,
         y_train)

Just to look at the tuning parameters and CV errors:

np.set_printoptions(suppress = True)
fit_info = np.array(list(zip(lasso_mod.alphas_, np.mean(lasso_mod.mse_path_, axis = 1))))
fit_info[fit_info[:,1].argsort()][0:10,].round(4)
array([[0.0683, 2.1017],
       [0.0732, 2.106 ],
       [0.0785, 2.1139],
       [0.0842, 2.1278],
       [0.0637, 2.1427],
       [0.0903, 2.1461],
       [0.0968, 2.17  ],
       [0.1038, 2.2005],
       [0.1113, 2.229 ],
       [0.0594, 2.233 ]])
print(lasso_mod.alpha_)
print(lasso_mod.intercept_)
print(np.array(list(zip(X_train.columns, lasso_mod.coef_))))
0.06827844720988437
19.045273631840807
[['age' '0.03520622302650939']
 ['weight' '8.676685171578885']
 ['height' '0.18524483241596834']
 ['adipos' '0.0']
 ['free' '-8.098358879411064']
 ['neck' '0.0']
 ['chest' '0.1012312495726051']
 ['abdom' '1.5227501560784638']
 ['hip' '0.0']
 ['thigh' '0.536843692590691']
 ['knee' '0.2410290965097099']
 ['ankle' '0.0997282321269406']
 ['biceps' '0.12439075979914176']
 ['forearm' '0.20197553831501108']
 ['wrist' '0.0']]

Fit that best model for comparison on the test set.

lasso_best = Lasso(lasso_mod.alpha_).fit(X_train,y_train)

4. Compare on the Test Set

We need to standardize the test set using our training values. Then we can find the test set predictions and RMSE!

We’ll see how to have python do this for us later. For now, let’s do it manually!

#quick function to standardize based off of a supplied mean and std
def my_std_fun(x, means, stds):
    return(x-means)/stds
#loop through the columns and use the function on each
for x in X_test.columns:
    X_test[x] = my_std_fun(X_test[x], means[x], stds[x])
X_test.head()
age weight height adipos free neck chest abdom hip thigh knee ankle biceps forearm wrist
107 0.540354 0.897999 1.089220 0.174341 1.355141 1.812651 1.101755 0.860802 0.124966 -0.702830 0.038291 0.244483 1.030604 0.453592 0.963150
143 -1.724004 -0.668697 0.572259 -1.111626 0.049026 -1.044767 -1.043850 -1.472047 -0.880946 -0.643582 -1.043368 -0.204002 -0.579563 -0.719039 0.003713
167 -0.787028 1.672344 0.572259 1.431082 2.136478 2.647896 0.885996 0.746540 1.027328 0.778390 1.286359 1.013315 1.765681 2.163679 1.709378
29 -1.255516 -0.632681 -0.267804 -0.468642 0.153981 -0.517244 -0.408559 -0.862650 -0.170890 -0.090593 -1.376186 -0.268071 -0.719577 -0.963337 -0.635912
30 -1.021272 0.132659 0.959980 -0.527095 0.970303 0.361962 -0.036974 -0.367514 -0.008169 -0.367087 0.038291 6.971758 0.120510 -0.474741 0.216921

Now find our predictions and test set RMSE.

mlr_pred = mlr_best.predict(X_test)
lasso_pred = lasso_best.predict(X_test)

print(np.sqrt(mean_squared_error(y_test, mlr_pred)), np.sqrt(mean_squared_error(y_test, lasso_pred)))
1.7547240606525405 1.9916053642246054