layout: false class: title-slide-section-red, middle # Cross-Validation Justin Post --- layout: true <div class="my-footer"><img src="data:image/png;base64,#img/logo.png" style="height: 60px;"/></div> --- # Recap - Judge the model's effectiveness at predicting using a metric comparing the predictions to the observed value - Often split data into a training and test set + Perhaps 70/30 or 80/20 - Next: Cross-validation as an alternative to just train/test (and why we might do both!) --- # Issues with Trainging vs Test Sets Why may we not want to just do a basic training/test split? - If we don't have much data, we aren't using it all when fitting the models - Data is randomly split into training/test + May just get a weird split by chance + Makes metric evaluation a somewhat variable measurement depending on number of data points --- # Issues with Trainging vs Test Sets Why may we not want to just do a basic training/test split? - If we don't have much data, we aren't using it all when fitting the models - Data is randomly split into training/test + May just get a weird split by chance + Makes metric evaluation a somewhat variable measurement depending on number of data points - Instead, we could consider splitting the data multiple ways, do the fitting/testing process, and combine the results! + Idea of cross validation! + A less variable measurement of your metric that uses all the data + Higher computational cost! --- # Cross-validation Common method for assessing a predictive model <img src="data:image/png;base64,#img/cv.png" width="600px" style="display: block; margin: auto;" /> --- # Cross-Validation Idea `\(k\)` fold Cross-Validation (CV) - Split data into k folds - Train model on first k-1 folds, test on kth to find metric value - Train model on first k-2 folds and kth fold, test on (k-1)st fold to find metric value - ... --- # Cross-Validation Idea `\(k\)` fold Cross-Validation (CV) - Split data into k folds - Train model on first k-1 folds, test on kth to find metric value - Train model on first k-2 folds and kth fold, test on (k-1)st fold to find metric value - ... Find CV error - Combine test metrics across test folds - For example, average all MSE metrics - **Key = no predictions used in the value of the metric were found on data that were used to train that model!** --- # CV on MLR Models - Let's consider our three linear regression models `$$\mbox{Model 1: log_selling_price = intercept + slope*year + Error}$$` `$$\mbox{Model 2: log_selling_price = intercept + slope*log_km_driven + Error}$$` `$$\mbox{Model 3: log_selling_price = intercept + slope*log_km_driven + slope*year + Error}$$` ```python import pandas as pd import numpy as np bike_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/bikeDetails.csv") bike_data['log_selling_price'] = np.log(bike_data['selling_price']) bike_data['log_km_driven'] = np.log(bike_data['km_driven']) ``` --- # CV on MLR Models - Can use CV error to choose between these models - In `scikit-learn` use the `cross_validate()` function from the `model_selection` submodule + Uses a `scoring` [input](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) to determine the metric --- # CV on MLR Models - Can use CV error to choose between these models - In `scikit-learn` use the `cross_validate()` function from the `model_selection` submodule + Uses a `scoring` [input](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) to determine the metric ```python from sklearn.model_selection import cross_validate from sklearn import linear_model reg1 = linear_model.LinearRegression() cv1 = cross_validate(reg1, bike_data["year"].values.reshape(-1,1), bike_data["log_selling_price"].values, cv=5, scoring=('neg_mean_squared_error'), return_train_score=True) print(cv1.keys()) ``` ``` ## dict_keys(['fit_time', 'score_time', 'test_score', 'train_score']) ``` ```python print(cv1) ``` ``` ## {'fit_time': array([0.00300002, 0.00100112, 0.00099754, 0.00153255, 0.00099945]), 'score_time': array([0. , 0.00100136, 0.00100136, 0.001019 , 0.00100279]), 'test_score': array([-0.33432825, -0.39699181, -0.22164746, -0.20264027, -0.38421905]), 'train_score': array([-0.2926948 , -0.2784276 , -0.32125059, -0.32566269, -0.28087055])} ``` --- # CV on MLR Models - Can use CV error to choose between these models - In `scikit-learn` use the `cross_validate()` function from the `model_selection` submodule + Uses a `scoring` [input](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) to determine the metric ```python print(cv1["test_score"]) #get CV RMSE ``` ``` ## [-0.33432825 -0.39699181 -0.22164746 -0.20264027 -0.38421905] ``` ```python round(np.sqrt(-sum(cv1["test_score"])/5),4) ``` ``` ## 0.5549 ``` --- # CV on MLR Models - Fit our other models ```python reg2 = linear_model.LinearRegression() cv2 = cross_validate(reg2, bike_data["log_km_driven"].values.reshape(-1,1), bike_data["log_selling_price"].values, cv=5, scoring='neg_mean_squared_error') reg3 = linear_model.LinearRegression() cv3 = cross_validate(reg3, bike_data[["year", "log_km_driven"]], bike_data["log_selling_price"].values, cv = 5, scoring='neg_mean_squared_error') ``` --- # CV on MLR Models - Compare the MSE values ```python print(round(np.sqrt(-sum(cv1["test_score"])/5),4), round(np.sqrt(-sum(cv2["test_score"])/5),4), round(np.sqrt(-sum(cv3["test_score"])/5),4)) ``` ``` ## 0.5549 0.6021 0.518 ``` - Now we would refit the 'best' model on the full data set! --- # Recap Cross-validation gives a way to use more of the data while still seeing how the model does on test data - Commonly 5 fold or 10 fold is done - Once a best model is chosen, model is refit on entire data set - **We'll see how CV can be used to select tuning parameters for certain models** + In this case, we often use both CV and a train/test split together!