Published

2025-03-31

Putting it all together

In this section we’ll look at an example of fitting a few different models (MLR models and LASSO models) and choosing the overall best one via test error.

Process: 1. Read the data in (would then want to explore data, clearn it, etc. We’ll skip this part) 2. Split the data into a training and test set 3. For each model type, select a best model using the training set (we’ll use cross-validation but you could split the training set into a training and validation set instead) 4. Compare the best models on the test set. Select the model with the lowest error (with considerations for simplicity)

1. Read in the `fat` data set

Description

Age, weight, height, and 10 body circumference measurements are recorded for 252 men. Each man’s percentage of body fat was accurately estimated by an underwater weighing technique.

brozek, Percent body fat using Brozek’s equation, 457/Density - 414.2
siri, Percent body fat using Siri’s equation, 495/Density - 450
density, Density (gm/\(cm^3\))
age, Age (yrs)
weight, Weight (lbs)
height, Height (inches)
adipos, Adiposity index = Weight/Height\(^2\) (kg/\(m^2\))
free, Fat Free Weight = (1 - fraction of body fat) * Weight, using Brozek’s formula (lbs)
neck, Neck circumference (cm)
chest, Chest circumference (cm)
abdom, Abdomen circumference (cm) at the umbilicus and level with the iliac crest
hip, Hip circumference (cm)
thigh, Thigh circumference (cm)
knee, Knee circumference (cm)
ankle, Ankle circumference (cm)
biceps, Extended biceps circumference (cm)
forearm, Forearm circumference (cm)
wrist, Wrist circumference (cm) distal to the styloid processes

Source Johnson R. Journal of Statistics Education v.4, n.1 (1996)

We’ll attempt to model the percent body fat (Brozek masure used as the gold standard) using easily obtainable measurements.

import pandas as pd
import numpy as np
fat_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/fat.csv")

fat_data.columns

Index(['Unnamed: 0', 'brozek', 'siri', 'density', 'age', 'weight', 'height',
       'adipos', 'free', 'neck', 'chest', 'abdom', 'hip', 'thigh', 'knee',
       'ankle', 'biceps', 'forearm', 'wrist'],
      dtype='object')

fat_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 19 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  252 non-null    int64  
 1   brozek      252 non-null    float64
 2   siri        252 non-null    float64
 3   density     252 non-null    float64
 4   age         252 non-null    int64  
 5   weight      252 non-null    float64
 6   height      252 non-null    float64
 7   adipos      252 non-null    float64
 8   free        252 non-null    float64
 9   neck        252 non-null    float64
 10  chest       252 non-null    float64
 11  abdom       252 non-null    float64
 12  hip         252 non-null    float64
 13  thigh       252 non-null    float64
 14  knee        252 non-null    float64
 15  ankle       252 non-null    float64
 16  biceps      252 non-null    float64
 17  forearm     252 non-null    float64
 18  wrist       252 non-null    float64
dtypes: float64(17), int64(2)
memory usage: 37.5 KB

fat_data.head()

	Unnamed: 0	brozek	siri	density	age	weight	height	adipos	free	neck	chest	abdom	hip	thigh	knee	ankle	biceps	forearm	wrist
0	1	12.6	12.3	1.0708	23	154.25	67.75	23.7	134.9	36.2	93.1	85.2	94.5	59.0	37.3	21.9	32.0	27.4	17.1
1	2	6.9	6.1	1.0853	22	173.25	72.25	23.4	161.3	38.5	93.6	83.0	98.7	58.7	37.3	23.4	30.5	28.9	18.2
2	3	24.6	25.3	1.0414	22	154.00	66.25	24.7	116.0	34.0	95.8	87.9	99.2	59.6	38.9	24.0	28.8	25.2	16.6
3	4	10.9	10.4	1.0751	26	184.75	72.25	24.9	164.7	37.4	101.8	86.4	101.2	60.1	37.3	22.8	32.4	29.4	18.2
4	5	27.8	28.7	1.0340	24	184.25	71.25	25.6	133.1	34.4	97.3	100.0	101.9	63.2	42.2	24.0	32.2	27.7	17.7

Drop the observations where we have missing values for simplicity (and only select the columns we are going to use).

mod_fat_data = fat_data.drop(["Unnamed: 0", "siri", "density"], axis = 1).dropna()
mod_fat_data.head()

	brozek	age	weight	height	adipos	free	neck	chest	abdom	hip	thigh	knee	ankle	biceps	forearm	wrist
0	12.6	23	154.25	67.75	23.7	134.9	36.2	93.1	85.2	94.5	59.0	37.3	21.9	32.0	27.4	17.1
1	6.9	22	173.25	72.25	23.4	161.3	38.5	93.6	83.0	98.7	58.7	37.3	23.4	30.5	28.9	18.2
2	24.6	22	154.00	66.25	24.7	116.0	34.0	95.8	87.9	99.2	59.6	38.9	24.0	28.8	25.2	16.6
3	10.9	26	184.75	72.25	24.9	164.7	37.4	101.8	86.4	101.2	60.1	37.3	22.8	32.4	29.4	18.2
4	27.8	24	184.25	71.25	25.6	133.1	34.4	97.3	100.0	101.9	63.2	42.2	24.0	32.2	27.7	17.7

There aren’t any categorical predictors so we don’t need to create any dummy variables.

We do however want to standardize our predictors when using the LASSO generally. If we have predictors on vastly different scales (say one takes on values from 0 to 1 and the other from 0 to 10000), this can affect our fitting and penalization. Instead we usually standardize our numerical predictors. We do so by subtracting off the mean and dividing by the standard deviation.

For example, if we had \(x\) with values 0, 3, 6, 10, 11, we would create a new column, say \(x_{std}\) that takes each observation and subtracts off the mean (\(\frac{0+3+6+10+11}{5} = 6\)) and divides each observation by the standard deviation (\(\sqrt{\frac{1}{5-1}\left((0-6)^2+(3-6)^2+(6-6)^2+(10-6)^2+(11-6)^2\right)} = 4.64\)). This would yield a new set of values given by -1.447, -0.7234, 0, 0.964, 1.206.

from math import sqrt
temp = np.array([0, 3, 6, 10, 11])
print(temp.mean(), temp.std(ddof = 1))
(temp-temp.mean())/temp.std(ddof = 1)

6.0 4.636809247747852

array([-1.29399328, -0.64699664,  0.        ,  0.86266219,  1.07832773])

However, we need to split our data before we standardize! We want to standardize only the training set values using the training set means and standard deviations. We’ll then use those same values to standardize the test set before we test our models!

2. Training and Test Split

First, let’s just read in all the functions we’ll need from sklearn

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LassoCV, Lasso

X_train, X_test, y_train, y_test = train_test_split(
  mod_fat_data.drop("brozek", axis = 1),
  mod_fat_data["brozek"],
  test_size=0.20,
  random_state=41)

3. Fit and Select Models on Training Data

First, we can standardize our variables. We’ll save the values used for standardization so we can standardize the test set with those as well.

means = X_train.apply(np.mean, axis = 0)
means

	0
age	45.079602
weight	178.316667
height	70.036070
adipos	25.403483
free	143.059204
neck	37.876617
chest	100.808458
abdom	92.559701
hip	99.855224
thigh	59.358706
knee	38.607960
ankle	23.018408
biceps	32.155721
forearm	28.671642
wrist	18.196517

dtype: float64

stds = X_train.apply(np.std, axis = 0)
stds

	0
age	12.807167
weight	27.765423
height	3.868759
adipos	3.421550
free	17.150096
neck	2.274781
chest	8.342636
abdom	10.502178
hip	6.760037
thigh	5.063393
knee	2.403715
ankle	1.560810
biceps	2.856846
forearm	2.046680
wrist	0.938050

dtype: float64

X_train = X_train.apply(lambda x: (x-np.mean(x))/np.std(x), axis = 0)
X_train

	age	weight	height	adipos	free	neck	chest	abdom	hip	thigh	knee	ankle	biceps	forearm	wrist
120	0.540354	1.015051	1.153840	0.232794	0.445525	1.285127	0.418518	0.632278	0.627922	-0.169591	0.287904	1.013315	0.610561	1.235346	1.389566
133	0.384191	-0.767741	-0.849386	-0.176377	-1.507817	-0.033681	-0.048960	-0.300862	-1.117631	-0.643582	-1.251380	-1.613526	0.505550	0.307013	-0.955724
207	0.149947	0.600867	0.636879	0.203568	-0.551554	1.021366	0.226732	0.832237	0.272894	0.264900	0.329506	0.180414	1.590663	1.430785	0.216921
49	0.149947	-1.830214	-0.849386	-1.520797	-1.274582	-1.704172	-2.086685	-2.110010	-1.872064	-1.729810	-1.750607	-0.716556	-1.874697	-1.403073	-1.488745
25	-1.411679	-0.686705	0.378398	-1.023946	0.515495	-0.956847	-1.343515	-1.224480	-0.496332	-0.860827	-0.793755	-0.332140	-0.789584	-0.230442	-0.529308
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
80	1.711573	-0.524633	-0.590905	-0.088697	-1.676912	0.230081	-0.372599	0.308536	-0.407575	-0.900326	-0.169721	0.436691	-0.964603	-0.719039	0.856545
226	0.774597	-0.317541	-0.461665	0.057435	0.066518	-0.297443	0.106866	-0.138990	-0.407575	-0.544834	-0.044914	-0.268071	0.435543	0.307013	0.643337
140	-0.396622	-0.056425	0.249158	-0.234830	-0.539892	-0.209522	-0.264719	0.051446	0.258102	-0.051094	0.412711	-0.908764	-0.474552	-0.377021	-1.701953
163	-0.865110	-1.380014	0.119917	-1.637703	-1.414523	-0.824966	-1.391462	-0.872172	-1.517037	-1.374317	-1.251380	-1.677595	-1.349642	-1.207635	-1.808557
192	-0.240459	0.546843	0.119917	0.583513	1.139399	1.241167	0.490438	0.146665	0.420823	0.245151	-0.003312	1.077384	0.645564	0.697890	0.536733

201 rows × 15 columns

#now each column has mean 0 and sd 1
X_train.apply(np.mean, axis = 0)
X_train.apply(np.std, axis = 0)

	0
age	1.0
weight	1.0
height	1.0
adipos	1.0
free	1.0
neck	1.0
chest	1.0
abdom	1.0
hip	1.0
thigh	1.0
knee	1.0
ankle	1.0
biceps	1.0
forearm	1.0
wrist	1.0

dtype: float64

MLR Models

Below I’ll consider some basic MLR models to compare. Even though we don’t need to use CV on those to determine a tuning parameter, we can use CV on those to determine the best MLR model (using the training set alone). Similarly, we’ll then find a ‘best’ LASSO model with tuning parameter selected by CV. Then we’ll compare the two ‘best’ models on the test set to see the overall winner!

# Full model
cv_full_model = cross_validate(
    LinearRegression(),
    X_train,
    y_train,
    cv = 5,
    scoring = "neg_mean_squared_error")
cv_mlr1= cross_validate(
    LinearRegression(),
    X_train[["weight", "wrist", "abdom"]],
    y_train,
    cv = 5,
    scoring = "neg_mean_squared_error")
cv_mlr2 = cross_validate(
    LinearRegression(),
    X_train[["weight", "height", "neck", "chest", "abdom", "biceps", "forearm", "wrist"]],
    y_train,
    cv = 5,
    scoring = "neg_mean_squared_error")

print(np.sqrt(-sum(cv_full_model['test_score'])),
      np.sqrt(-sum(cv_mlr1['test_score'])),
      np.sqrt(-sum(cv_mlr2['test_score'])))

6.101775440661614 9.044740234075288 9.680673752851906

Looks like the full model is the best here!

mlr_best = LinearRegression().fit(X_train, y_train)
print(mlr_best.intercept_)
print(np.array(list(zip(X_train.columns, mlr_best.coef_))))

19.045273631840807
[['age' '0.1338989120135575']
 ['weight' '9.34842058078133']
 ['height' '0.19894030597107948']
 ['adipos' '-0.9776707934156488']
 ['free' '-8.827474152930117']
 ['neck' '0.11485215326542098']
 ['chest' '0.5751829606181673']
 ['abdom' '1.2206574175508305']
 ['hip' '0.18340049047065232']
 ['thigh' '0.8379939074493237']
 ['knee' '0.16205656779585353']
 ['ankle' '0.2436973560143484']
 ['biceps' '0.17243823308972106']
 ['forearm' '0.30301942051892716']
 ['wrist' '-0.013949037814783072']]

LASSO model

lasso_mod = LassoCV(cv=5, random_state=0) \
    .fit(X_train,
         y_train)

Just to look at the tuning parameters and CV errors:

np.set_printoptions(suppress = True)
fit_info = np.array(list(zip(lasso_mod.alphas_, np.mean(lasso_mod.mse_path_, axis = 1))))
fit_info[fit_info[:,1].argsort()][0:10,].round(4)

array([[0.0683, 2.1017],
       [0.0732, 2.106 ],
       [0.0785, 2.1139],
       [0.0842, 2.1278],
       [0.0637, 2.1427],
       [0.0903, 2.1461],
       [0.0968, 2.17  ],
       [0.1038, 2.2005],
       [0.1113, 2.229 ],
       [0.0594, 2.233 ]])

print(lasso_mod.alpha_)
print(lasso_mod.intercept_)
print(np.array(list(zip(X_train.columns, lasso_mod.coef_))))

0.06827844720988437
19.045273631840807
[['age' '0.03520622302650939']
 ['weight' '8.676685171578885']
 ['height' '0.18524483241596834']
 ['adipos' '0.0']
 ['free' '-8.098358879411064']
 ['neck' '0.0']
 ['chest' '0.1012312495726051']
 ['abdom' '1.5227501560784638']
 ['hip' '0.0']
 ['thigh' '0.536843692590691']
 ['knee' '0.2410290965097099']
 ['ankle' '0.0997282321269406']
 ['biceps' '0.12439075979914176']
 ['forearm' '0.20197553831501108']
 ['wrist' '0.0']]

Fit that best model for comparison on the test set.

lasso_best = Lasso(lasso_mod.alpha_).fit(X_train,y_train)

4. Compare on the Test Set

We need to standardize the test set using our training values. Then we can find the test set predictions and RMSE!

We’ll see how to have python do this for us later. For now, let’s do it manually!

#quick function to standardize based off of a supplied mean and std
def my_std_fun(x, means, stds):
    return(x-means)/stds
#loop through the columns and use the function on each
for x in X_test.columns:
    X_test[x] = my_std_fun(X_test[x], means[x], stds[x])
X_test.head()

	age	weight	height	adipos	free	neck	chest	abdom	hip	thigh	knee	ankle	biceps	forearm	wrist
107	0.540354	0.897999	1.089220	0.174341	1.355141	1.812651	1.101755	0.860802	0.124966	-0.702830	0.038291	0.244483	1.030604	0.453592	0.963150
143	-1.724004	-0.668697	0.572259	-1.111626	0.049026	-1.044767	-1.043850	-1.472047	-0.880946	-0.643582	-1.043368	-0.204002	-0.579563	-0.719039	0.003713
167	-0.787028	1.672344	0.572259	1.431082	2.136478	2.647896	0.885996	0.746540	1.027328	0.778390	1.286359	1.013315	1.765681	2.163679	1.709378
29	-1.255516	-0.632681	-0.267804	-0.468642	0.153981	-0.517244	-0.408559	-0.862650	-0.170890	-0.090593	-1.376186	-0.268071	-0.719577	-0.963337	-0.635912
30	-1.021272	0.132659	0.959980	-0.527095	0.970303	0.361962	-0.036974	-0.367514	-0.008169	-0.367087	0.038291	6.971758	0.120510	-0.474741	0.216921

Now find our predictions and test set RMSE.

mlr_pred = mlr_best.predict(X_test)
lasso_pred = lasso_best.predict(X_test)

print(np.sqrt(mean_squared_error(y_test, mlr_pred)), np.sqrt(mean_squared_error(y_test, lasso_pred)))

1.7547240606525405 1.9916053642246054

Putting it all together

1. Read in the fat data set

Description

2. Training and Test Split

3. Fit and Select Models on Training Data

MLR Models

LASSO model

4. Compare on the Test Set

1. Read in the `fat` data set