class: center, middle, inverse, title-slide .title[ # Loss Functions & Model Performance ] .author[ ### Justin Post ] --- layout: false class: title-slide-section-red, middle # Bagging Trees & Random Forests Justin Post --- layout: true <div class="my-footer"><img src="data:image/png;base64,#img/logo.png" style="height: 60px;"/></div> --- # Recap - MLR, Penalized MLR, & Regression Trees - Commonly used model with a numeric response - Logistic Regression, Penalized Logistic Regression, & Classification Trees - Commonly used model with a binary response - MLR & Logistic regression are more structured (linear) - Trees easier to read but more variable (non-linear) --- # Prediction with Tree Based Methods If we care mostly about prediction not interpretation - Often use **bootstrapping** to get multiple samples to fit on - Can average across many fitted trees - Decreases variance over an individual tree fit --- # Prediction with Tree Based Methods If we care mostly about prediction not interpretation - Often use **bootstrapping** to get multiple samples to fit on - Can average across many fitted trees - Decreases variance over an individual tree fit Major ensemble tree methods 1. Bagging (boostrap aggregation) 2. Random Forests (extends idea of bagging - includes bagging as a special case) 3. Boosting (*slow* training of trees) --- # Bagging Bagging = Bootstrap Aggregation - a general method Bootstrapping - resample from the data (non-parametric) or a fitted model (parameteric) - for non-parameteric + treats sample as population + resampling done with replacement + can get same observation multiple times --- # Bagging Bagging = Bootstrap Aggregation - a general method Bootstrapping - resample from the data (non-parametric) or a fitted model (parameteric) - for non-parameteric + treats sample as population + resampling done with replacement + can get same observation multiple times - method or estimation applied to each resample - traditionally used to obtain standard errors (measures of variability) or construct confidence intervals --- # Non-Parametric Bootstrapping <img src="data:image/png;base64,#img/bootstrap-sample.png" width="800px" style="display: block; margin: auto;" /> --- # Bagging Process for Regression Trees: 1. Create a bootstrap sample (same size as actual sample) + `sample(data, size = n, replace = TRUE)` --- # Bagging Process for Regression Trees: 1. Create a bootstrap sample (same size as actual sample) + `sample(data, size = n, replace = TRUE)` 2. Train tree on this sample (no pruning necessary) + Call prediction for a given set of `\(x\)` values `\(\hat{y}^{*1}(x)\)` --- # Bagging Process for Regression Trees: 1. Create a bootstrap sample (same size as actual sample) + `sample(data, size = n, replace = TRUE)` 2. Train tree on this sample (no pruning necessary) + Call prediction for a given set of `\(x\)` values `\(\hat{y}^{*1}(x)\)` 3. Repeat B = 1000 times (books often say 100, no set mark) + Obtain `\(\hat{y}^{*j}(x)\)`, `\(j = 1, ..., B\)` --- # Bagging Process for Regression Trees: 1. Create a bootstrap sample (same size as actual sample) + `sample(data, size = n, replace = TRUE)` 2. Train tree on this sample (no pruning necessary) + Call prediction for a given set of `\(x\)` values `\(\hat{y}^{*1}(x)\)` 3. Repeat B = 1000 times (books often say 100, no set mark) + Obtain `\(\hat{y}^{*j}(x)\)`, `\(j = 1, ..., B\)` 4. Final prediction is average of these predictions + `\(\hat{y}(x) = \frac{1}{B}\sum_{j=1}^{B} \hat{y}^{*j}(x)\)` --- # Bagging For Classification Trees: 1. Create a bootstrap sample (same size as actual sample) + `sample(data, size = n, replace = TRUE)` 2. Train tree on this sample (no pruning necessary) + Call class prediction for a given set of `\(x\)` values `\(\hat{y}^{*1}(x)\)` 3. Repeat B = 1000 times (books often say 100, no set mark) + Obtain `\(\hat{y}^{*j}(x)\)`, `\(j = 1, ..., B\)` --- # Bagging For Classification Trees: 1. Create a bootstrap sample (same size as actual sample) + `sample(data, size = n, replace = TRUE)` 2. Train tree on this sample (no pruning necessary) + Call class prediction for a given set of `\(x\)` values `\(\hat{y}^{*1}(x)\)` 3. Repeat B = 1000 times (books often say 100, no set mark) + Obtain `\(\hat{y}^{*j}(x)\)`, `\(j = 1, ..., B\)` 4. (One option) Use **majority vote** as final classification prediction (i.e. use most common prediction made by all bootstrap trees) --- layout: false # Bagging Example ```python import pandas as pd import numpy as np import matplotlib.pyplot as plt bike_data = pd.read_csv("data/bikeDetails.csv") #create response and new predictor bike_data['log_selling_price'] = np.log(bike_data['selling_price']) bike_data['log_km_driven'] = np.log(bike_data['km_driven']) #Add a Categorical Predictor via a Dummy Variable bike_data["one_owner"] = pd.get_dummies(bike_data["owner"]).iloc[:,0] pd.get_dummies(bike_data["owner"]) ``` ``` ## 1st owner 2nd owner 3rd owner 4th owner ## 0 1 0 0 0 ## 1 1 0 0 0 ## 2 1 0 0 0 ## 3 1 0 0 0 ## 4 0 1 0 0 ## ... ... ... ... ... ## 1056 1 0 0 0 ## 1057 1 0 0 0 ## 1058 0 1 0 0 ## 1059 1 0 0 0 ## 1060 1 0 0 0 ## ## [1061 rows x 4 columns] ``` --- # Bagging Example - We can use the `RandomForestRegressor` function with `max_features` set to `None` - No tuning parameters really needed. Can set `max_depth` or `min_samples_leaf` as before - Default says to train on 100 trees (bootstrap samples) ```python from sklearn.ensemble import RandomForestRegressor bag_tree = RandomForestRegressor(max_features = None, n_estimators = 500) ``` --- # Bagging Example - We can use the `RandomForestRegressor` function with `max_features` set to `None` - No tuning parameters really needed. Can set `max_depth` or `min_samples_leaf` as before - Default says to train on 100 trees (bootstrap samples) ```python from sklearn.ensemble import RandomForestRegressor bag_tree = RandomForestRegressor(max_features = None, n_estimators = 500) bag_tree.fit(bike_data[['log_km_driven', 'year']], bike_data['log_selling_price']) ``` --- # Bagging Example - We can use the `RandomForestRegressor` function with `max_features` set to `None` - No tuning parameters really needed. Can set `max_depth` or `min_samples_leaf` as before - Default says to train on 100 trees (bootstrap samples) ```python from sklearn.ensemble import RandomForestRegressor bag_tree = RandomForestRegressor(max_features = None, n_estimators = 500) bag_tree.fit(bike_data[['log_km_driven', 'year']], bike_data['log_selling_price']) ``` - Still predict with `.predict()` ```python print(bag_tree.predict(np.array([[9.5, 1990], [9.5, 2015], [10.6, 1990], [10.6, 2015]]))) ``` ``` ## [10.72006819 10.75964713 9.78477023 10.43076735] ## ## C:\python\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names ## warnings.warn( ``` ```python print(np.exp(bag_tree.predict(np.array([[9.5, 1990], [9.5, 2015], [10.6, 1990], [10.6, 2015]])))) ``` ``` ## [45254.98858725 47082.05121112 17761.17607055 33886.34290019] ## ## C:\python\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names ## warnings.warn( ``` --- # Bagging Example - Can look at variable importance measure ```python bag_tree.feature_importances_ ``` ``` ## array([0.422601, 0.577399]) ``` ```python plt.barh(bike_data.columns[[8,2]], bag_tree.feature_importances_); plt.xlabel("Importance");plt.show() ``` <img src="data:image/png;base64,#08-Bagging_And_Random_Forests_files/figure-html/unnamed-chunk-9-1.svg" width="350px" style="display: block; margin: auto;" /><img src="data:image/png;base64,#08-Bagging_And_Random_Forests_files/figure-html/unnamed-chunk-9-2.svg" width="350px" style="display: block; margin: auto;" /> --- # Bagging Example - Fit the bagged tree model ```python bag_tree2 = RandomForestRegressor(max_features = None, n_estimators = 500) bag_tree2.fit(bike_data[['log_km_driven', 'year', 'one_owner']], bike_data['log_selling_price']) ``` - Compare predictions between models ```python to_predict = np.array([[9.5, 1990, 1], [9.5, 1990, 0], [9.5, 2000, 1], [9.5, 2000, 0]]) pred_compare = pd.DataFrame(zip(bag_tree.predict(to_predict[:,0:2]), bag_tree2.predict(to_predict)), columns = ["No Cat", "Cat"]) ``` ``` ## C:\python\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names ## warnings.warn( ## C:\python\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names ## warnings.warn( ``` ```python pd.concat([pred_compare, np.exp(pred_compare)], axis = 1) ``` ``` ## No Cat Cat No Cat Cat ## 0 10.720068 10.255931 45254.988587 28450.793742 ## 1 10.720068 11.119023 45254.988587 67442.005845 ## 2 9.842763 9.690384 18821.644366 16161.442855 ## 3 9.842763 10.094008 18821.644366 24197.571505 ``` --- # Variable Importance ```python plt.barh(bike_data.columns[[8,2, 9]], bag_tree2.feature_importances_); plt.xlabel("Importance");plt.show() ``` <img src="data:image/png;base64,#08-Bagging_And_Random_Forests_files/figure-html/unnamed-chunk-13-5.svg" width="400px" style="display: block; margin: auto;" /><img src="data:image/png;base64,#08-Bagging_And_Random_Forests_files/figure-html/unnamed-chunk-13-6.svg" width="400px" style="display: block; margin: auto;" /> --- # Compare CV Error of Bagged Tree to Other Models ```python from sklearn.model_selection import GridSearchCV, cross_validate from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor bag_cv = cross_validate(RandomForestRegressor(n_estimators = 500, max_depth = 4, min_samples_leaf = 10), bike_data[['log_km_driven', 'year', 'one_owner']], bike_data['log_selling_price'], cv = 5, scoring = "neg_mean_squared_error") ``` --- # Compare CV Error of Bagged Tree to Other Models ```python from sklearn.model_selection import GridSearchCV, cross_validate from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor bag_cv = cross_validate(RandomForestRegressor(n_estimators = 500, max_depth = 4, min_samples_leaf = 10), bike_data[['log_km_driven', 'year', 'one_owner']], bike_data['log_selling_price'], cv = 5, scoring = "neg_mean_squared_error") ``` ```python rtree_tune = GridSearchCV(DecisionTreeRegressor(), {'max_depth': range(2,15),'min_samples_leaf':[3, 10, 50, 100]}, cv = 5, scoring = "neg_mean_squared_error") \ .fit(bike_data[['log_km_driven', 'year', 'one_owner']], bike_data['log_selling_price']) rtree_cv = cross_validate(rtree_tune.best_estimator_, bike_data[['log_km_driven', 'year', 'one_owner']], bike_data['log_selling_price'], cv = 5, scoring = "neg_mean_squared_error") ``` --- # Compare CV Error of Bagged Tree to Other Models ```python from sklearn.model_selection import GridSearchCV, cross_validate from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor bag_cv = cross_validate(RandomForestRegressor(n_estimators = 500, max_depth = 4, min_samples_leaf = 10), bike_data[['log_km_driven', 'year', 'one_owner']], bike_data['log_selling_price'], cv = 5, scoring = "neg_mean_squared_error") ``` ```python rtree_tune = GridSearchCV(DecisionTreeRegressor(), {'max_depth': range(2,15),'min_samples_leaf':[3, 10, 50, 100]}, cv = 5, scoring = "neg_mean_squared_error") \ .fit(bike_data[['log_km_driven', 'year', 'one_owner']], bike_data['log_selling_price']) rtree_cv = cross_validate(rtree_tune.best_estimator_, bike_data[['log_km_driven', 'year', 'one_owner']], bike_data['log_selling_price'], cv = 5, scoring = "neg_mean_squared_error") ``` ```python mlr_cv = cross_validate(LinearRegression(), bike_data[['log_km_driven', 'year', 'one_owner']], bike_data['log_selling_price'], cv = 5, scoring = "neg_mean_squared_error") ``` --- # Compare CV Error of Bagged Tree to Other Models ```python from sklearn.model_selection import GridSearchCV, cross_validate from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor bag_cv = cross_validate(RandomForestRegressor(n_estimators = 500, max_depth = 4, min_samples_leaf = 10), bike_data[['log_km_driven', 'year', 'one_owner']], bike_data['log_selling_price'], cv = 5, scoring = "neg_mean_squared_error") ``` ```python rtree_tune = GridSearchCV(DecisionTreeRegressor(), {'max_depth': range(2,15),'min_samples_leaf':[3, 10, 50, 100]}, cv = 5, scoring = "neg_mean_squared_error") \ .fit(bike_data[['log_km_driven', 'year', 'one_owner']], bike_data['log_selling_price']) rtree_cv = cross_validate(rtree_tune.best_estimator_, bike_data[['log_km_driven', 'year', 'one_owner']], bike_data['log_selling_price'], cv = 5, scoring = "neg_mean_squared_error") ``` ```python mlr_cv = cross_validate(LinearRegression(), bike_data[['log_km_driven', 'year', 'one_owner']], bike_data['log_selling_price'], cv = 5, scoring = "neg_mean_squared_error") ``` ```python print(np.sqrt([-sum(bag_cv['test_score'])/5, -sum(rtree_cv['test_score'])/5, -sum(mlr_cv['test_score'])/5])) ``` ``` ## [0.50540748 0.51469632 0.51735197] ``` --- # Prediction with Tree Based Methods If we care mostly about prediction not interpretation - Often use **bootstrapping** to get multiple samples to fit on - Can average across many fitted trees - Decreases variance over an individual tree fit Major ensemble tree methods 1. Bagging (boostrap aggregation) 2. Random Forests (extends idea of bagging - includes bagging as a special case) 3. Boosting (*slow* training of trees) --- # Random Forests - Uses same idea as bagging - Create multiple trees from bootstrap samples - Average results --- # Random Forests - Uses same idea as bagging - Create multiple trees from bootstrap samples - Average results Difference: - Don't use all predictors! - Consider splits using a random subset of predictors each time --- # Random Forests - Uses same idea as bagging - Create multiple trees from bootstrap samples - Average results Difference: - Don't use all predictors! - Consider splits using a random subset of predictors each time But why? - If a really strong predictor exists, every bootstrap tree will probably use it for the first split (2nd split, etc.) - Makes bagged trees predictions more correlated - Correlation --> smaller reduction in variance from aggregation --- # Random Forests By randomly selecting a subset of predictors, a good predictor or two won't dominate the tree fits - Rules of thumb say use `num_features` = `\(\sqrt{\mbox{# predictors}}\)` (classification) or `num_features` = `\(\mbox{# predictors}/3\)` (regression) (randomly selected) predictors - If `num_features` = number of predictors then you have bagging! + Default for `RandomForestRegressor()` - Better to determine `num_features` via CV (or other measure) --- # Random Forests - Select best random forest model using `GridSearchCV()` ```python parameters = {"max_features": range(1,4), "max_depth": [3, 4, 5, 10, 15],'min_samples_leaf':[3, 10, 50, 100]} rf_tune = GridSearchCV(RandomForestRegressor(n_estimators = 500), parameters, cv = 5, scoring = "neg_mean_squared_error") rf_tune.fit(bike_data[['log_km_driven', 'year', 'one_owner']], bike_data['log_selling_price']) ``` ```python print(rf_tune.best_estimator_) ``` ``` ## RandomForestRegressor(max_depth=5, max_features=2, min_samples_leaf=3, ## n_estimators=500) ``` --- # Random Forests Compare all model CV errors ```python rf_cv = cross_validate(rf_tune.best_estimator_, bike_data[['log_km_driven', 'year', 'one_owner']], bike_data['log_selling_price'], cv = 5, scoring = "neg_mean_squared_error") print(np.sqrt([-sum(bag_cv['test_score'])/5, -sum(rtree_cv['test_score'])/5, -sum(mlr_cv['test_score'])/5, -sum(rf_cv['test_score'])/5])) ``` ``` ## [0.50540748 0.51469632 0.51735197 0.50338755] ``` --- # Recap Averaging many trees can greatly improve prediction - Comes at a loss of interpretability - Variable importance measures can be used Bagging - Fit many trees on bootstrap samples and combine predictions in some way Random Forest - Do bagging but randomly select the predictors to use for each split