Prediction and Training/Test Set Ideas

layout: false
class: title-slide-section-red, middle

# Prediction and Training/Test Set Ideas
Justin Post

---
layout: true

---

# Predictive Modeling Idea

- Choose form of model

- Fit model to data using some algorithm 
    + Usually can be written as a problem where we minimize some loss function 
    
- Evaluate the model using a metric
    + RMSE very common for a numeric response

---

# Predictive Modeling Idea

- Choose form of model

- Fit model to data using some algorithm 
    + Usually can be written as a problem where we minimize some loss function

- Evaluate the model using a metric
    + RMSE very common for a numeric response

- Ideally we want our model to predict well for observations **it has yet to see**!

---

# Training vs Test Sets

- Evaluation of predictions over the observations used to *fit or train the model* is called the **training (set) error**
- Using RMSE as our metric:

`$$\mbox{Training RMSE} = \sqrt{\frac{1}{\mbox{# of obs used to fit model}}\sum_{\mbox{obs used to fit model}}(y-\hat{y})^2}$$`

---

# Training vs Test Sets

- Evaluation of predictions over the observations used to *fit or train the model* is called the **training (set) error**
- Using RMSE as our metric:

`$$\mbox{Training RMSE} = \sqrt{\frac{1}{\mbox{# of obs used to fit model}}\sum_{\mbox{obs used to fit model}}(y-\hat{y})^2}$$`

- If we only consider this, we'll have no idea how the model will fare on data it hasn't seen!

---

# Training vs Test Sets

One method is to split the data into a **training set** and **test set**
- On the training set we can fit (or train) our models
- We can then predict for the test set observations and judge effectiveness with our metric

---

# Example of Fitting and Evaluating Models

Consider our data set on motorcycle sale prices

```python
import pandas as pd
import numpy as np
bike_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/bikeDetails.csv")
bike_data['log_selling_price'] = np.log(bike_data['selling_price'])
bike_data['log_km_driven'] = np.log(bike_data['km_driven'])
print(bike_data.columns)
```

```
## Index(['name', 'selling_price', 'year', 'seller_type', 'owner', 'km_driven',
##        'ex_showroom_price', 'log_selling_price', 'log_km_driven'],
##       dtype='object')
```

---

# Example of Fitting and Evaluating Models

- Response variable of `log_selling_price = ln(selling_price)`

- Consider three linear regression models:
`$$\mbox{Model 1: log_selling_price = intercept + slope*year + Error}$$`
`$$\mbox{Model 2: log_selling_price = intercept + slope*log_km_driven + Error}$$`
`$$\mbox{Model 3: log_selling_price = intercept + slope*log_km_driven + slope*year + Error}$$`
---

layout: false

# Fitting the Models with `sklearn`

```python
from sklearn import linear_model
reg1 = linear_model.LinearRegression() #Create a reg object
reg2 = linear_model.LinearRegression() #Create a reg object
reg3 = linear_model.LinearRegression() #Create a reg object
reg1.fit(bike_data['year'].values.reshape(-1,1), bike_data['log_selling_price']) 
reg2.fit(bike_data['log_km_driven'].values.reshape(-1,1), bike_data['log_selling_price']) 
reg3.fit(bike_data[['year', 'log_km_driven']], bike_data['log_selling_price']) 
```

```python
print(reg1.intercept_, reg1.coef_)
```

```
## -201.06317651252067 [0.10516552]
```

```python
print(reg2.intercept_, reg2.coef_)
```

```
## 14.6355682846293 [-0.39108654]
```

```python
print(reg3.intercept_, reg3.coef_)
```

```
## -148.79329107788155 [ 0.0803366  -0.22686129]
```

---

# Example of Fitting and Evaluating Models

- Now we have the fitted models.  Want to use them to predict the response
`$$\mbox{Model 1: } \widehat{\mbox{log_selling_price}} = -201.06 + 0.105*\mbox{year}$$`
`$$\mbox{Model 2: } \widehat{\mbox{log_selling_price}} = 14.64 -0.391*\mbox{log_km_driven}$$`

`$$\mbox{Model 3: } \widehat{\mbox{log_selling_price}} = -148.79 + 0.080*\mbox{year}-0.227*\mbox{log_km_driven}$$`

---

# Example of Fitting and Evaluating Models

`$$\mbox{Model 3: } \widehat{\mbox{log_selling_price}} = -148.79 + 0.080*\mbox{year}-0.227*\mbox{log_km_driven}$$`

- Use the `.predict()` method

```python
pred1 = reg1.predict(bike_data['year'].values.reshape(-1,1))
pred2 = reg2.predict(bike_data['log_km_driven'].values.reshape(-1,1))
pred3 = reg3.predict(bike_data[['year', 'log_km_driven']])
pd.DataFrame(zip(pred1, pred2, pred3, bike_data['log_selling_price']), 
             columns = ["Model1", "Model2", "Model3", "Actual"])
```

```
##          Model1     Model2     Model3     Actual
## 0     11.266005  12.344609  12.077366  12.072541
## 1     11.055674  11.256811  11.285683  10.714418
## 2     11.160839  10.962225  11.195136  11.918391
## 3     10.845343  10.707789  10.806533  11.082143
## 4     10.424681  10.743366  10.505825   9.903488
## ...         ...        ...        ...        ...
## 1056  10.319515   9.503589   9.706319   9.740969
## 1057  10.529846  10.566601  10.483624   9.680344
## 1058  10.635012  10.543589  10.550612   9.615805
## 1059  10.214350  10.381310  10.135130   9.392662
## 1060  10.109184  10.164638   9.929107   9.210340
## 
## [1061 rows x 4 columns]
```

---

# Example of Fitting and Evaluating Models

- Find **training** RMSE

```python
from sklearn.metrics import mean_squared_error
RMSE1 = np.sqrt(mean_squared_error(y_true = bike_data['log_selling_price'], y_pred = pred1))
RMSE2 = np.sqrt(mean_squared_error(bike_data['log_selling_price'], pred2))
RMSE3 = np.sqrt(mean_squared_error(bike_data['log_selling_price'], pred3))
print(round(RMSE1, 3), round(RMSE2, 3), round(RMSE3, 3))
```

```
## 0.548 0.595 0.511
```

- Estimate of RMSE is too **optimistic** compared to how the model would perform with new data!  Redo with train/test split!

---

# Train/Test Split

- `sklearn` has a function to make splitting data easy
- Commonly use 80/20 or 70/30 split

---

# Train/Test Split

- `sklearn` has a function to make splitting data easy
- Commonly use 80/20 or 70/30 split

```python
from sklearn.model_selection import train_test_split
#Function will return a list with four things:
#Test/train for predictors (X)
#Test/train for response (y)
X_train, X_test, y_train, y_test = train_test_split(
  bike_data[["year", "log_km_driven"]],
  bike_data["log_selling_price"], 
  test_size=0.20, 
  random_state=422)
```

---

# Fit or Train Model

- We then fit the model on the training set

```python
reg1 = linear_model.LinearRegression() #Create a reg object
reg2 = linear_model.LinearRegression() #Create a reg object
reg3 = linear_model.LinearRegression() #Create a reg object
reg1.fit(X_train['year'].values.reshape(-1,1), y_train.values) 
reg2.fit(X_train['log_km_driven'].values.reshape(-1,1), y_train.values) 
reg3.fit(X_train[['year', 'log_km_driven']], y_train.values) 
```

---

# Fit or Train Model

- We then fit the model on the training set

- Can look at training RMSE if we want

```python
train_RMSE1 = np.sqrt(mean_squared_error(y_train.values, 
                                         reg1.predict(X_train['year'].values.reshape(-1,1))))
train_RMSE2 = np.sqrt(mean_squared_error(y_train.values, 
                                         reg2.predict(X_train['log_km_driven'].values.reshape(-1,1))))
train_RMSE3 = np.sqrt(mean_squared_error(y_train.values, 
                                         reg3.predict(X_train[['year', 'log_km_driven']])))
print(round(train_RMSE1, 3), round(train_RMSE2, 3), round(train_RMSE3, 3))
```

```
## 0.557 0.593 0.516
```

---

# Test Error

- Now we look at predictions on the test set
    + Test data **not** used when training model

```python
test_RMSE1 = np.sqrt(mean_squared_error(y_test.values, 
                                         reg1.predict(X_test['year'].values.reshape(-1,1))))
test_RMSE2 = np.sqrt(mean_squared_error(y_test.values, 
                                         reg2.predict(X_test['log_km_driven'].values.reshape(-1,1))))
test_RMSE3 = np.sqrt(mean_squared_error(y_test.values, 
                                         reg3.predict(X_test[['year', 'log_km_driven']])))
print(round(test_RMSE1, 3), round(test_RMSE2, 3), round(test_RMSE3, 3))
```

```
## 0.513 0.603 0.491
```

- When choosing a model, if the RMSE values were 'close', we'd want to consider the interpretability of the model (and perhaps the assumptions if we wanted to do inference too!)

---

# Recap

- Choose form of model 
- Fit model to data using some algorithm 
    + Usually can be written as a problem where we minimize some loss function 
- Evaluate the model using a metric
    + RMSE very common for a numeric response

- Ideally we want our model to predict well for observations **it has yet to see**!