class: center, middle, inverse, title-slide .title[ # Modeling Data Recap ] .author[ ### Justin Post ] --- layout: false class: title-slide-section-red, middle # Modeling Data Recap Justin Post --- layout: true <div class="my-footer"><img src="data:image/png;base64,#img/logo.png" style="height: 60px;"/></div> --- # Recap - Programming in python - 5 V's of Big Data + Volume + Variety + Velocity + Veracity (Variability) + Value - Understanding of the Big Data pipeline and basics of handling Big Data + Databases/Data Lakes/Data Warehouses/etc. + SQL basics + Hadoop + Spark - Now: Modeling (Big) Data --- # Common Uses for Data Four major goals with data: 1. Description (EDA) 2. Inference 3. Prediction/Classification 4. Pattern Finding --- # Statistical Learning **Statistical learning** - Inference, prediction/classification, and pattern finding - Supervised learning - a variable (or variables) represents an **output** or **response** of interest + May model response and - Make **inference** on the model parameters - **predict** a value or **classify** an observation <img src="data:image/png;base64,#img/tree.png" width="450px" style="display: block; margin: auto;" /> --- # What is a Statistical Model? - A mathematical representation of some phenomenon on which you've observed data - Predictive model used to: + *Predict* a **numeric response** + *Classify* an observation into a **category** --- # What is a Statistical Model? - A mathematical representation of some phenomenon on which you've observed data - Predictive model used to: + *Predict* a **numeric response** + *Classify* an observation into a **category** - Common Supervised Learning Models + Least Squares Regression + Penalized regression + Generalized linear models + Regression/classification trees + Random forests, boosting, bagging ... and many more - tons of models! --- # Fitting a Model Given a model, we **fit** or **train** it using the data <img src="data:image/png;base64,#img/slr.png" width="500px" style="display: block; margin: auto;" /> --- # Fitting a Model Given a model, we **fit** or **train** it using the data <img src="data:image/png;base64,#img/slr.png" width="500px" style="display: block; margin: auto;" /> - Models can be used to yield predicted responses for each observation, call these `\(\hat{y}_i\)` --- # Quantifying How Well the Model Predicts Need a way to quantify how well our prediction is doing (a model metric) - For a numeric response, we commonly use squared error loss to evaluate a prediction `$$L(y_i,\hat{y}_i) = (y_i-\hat{y}_i)^2$$` --- # Quantifying How Well the Model Predicts Need a way to quantify how well our prediction is doing (a model metric) - For a numeric response, we commonly use squared error loss to evaluate a prediction `$$L(y_i,\hat{y}_i) = (y_i-\hat{y}_i)^2$$` - Use Root Mean Square Error as a metric across all observations `$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n} L(y_i, \hat{y}_i)} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}$$` --- # Quantifying How Well the Model Predicts Need a way to quantify how well our prediction is doing (a model metric) - For classification (binary response here), we can look at accuracy - Accuracy `$$\frac{\mbox{Sum of correct predictions}}{\mbox{Total number of predictions}}$$` --- # Training vs Test Sets Ideally we want our model to predict well for observations **it has yet to see** --- # Training vs Test Sets Ideally we want our model to predict well for observations **it has yet to see** - Predictions over the observations used to fit or train the model are called the **training (set) error** `$$\mbox{Training RMSE} = \sqrt{\frac{1}{\mbox{# of obs used to fit model}}\sum_{\mbox{obs used to fit model}}(y-\hat{y})^2}$$` - If we only consider this, we'll have no idea how the model will fare on data it hasn't seen! --- # Training vs Test Sets One method is to split the data into a **training set** and **test set** - On the training set we can fit (or train) our models - We can then predict for the test set observations and judge effectiveness with RMSE <img src="data:image/png;base64,#img/trainingtest.png" width="600px" style="display: block; margin: auto;" /> --- # Issues with Trainging vs Test Sets Why may we not want to just do a basic training/test set? - If we don't have much data, we aren't using it all when fitting the models - Data is randomly split into training/test - Instead, we could consider splitting the data multiple ways and averaging the test error over the results! --- # Cross-Validation Idea `\(k\)` fold Cross-Validation (CV) - Split data into k folds - Train model on first k-1 folds, find test error on kth fold - Train model on first k-2 folds and kth fold, find test error on (k-1)st fold - ... Find CV error by combining test errors appropriately --- # Cross-Validation Idea `\(k\)` fold Cross-Validation (CV) - Split data into k folds - Train model on first k-1 folds, find test error on kth fold - Train model on first k-2 folds and kth fold, find test error on (k-1)st fold - ... Find CV error by combining test errors appropriately - Key = no predictions used in the RMSE were done on data used to train that model! - Once a best model is chosen, model is refit on entire data set --- # May Use Both Training/Test & CV - Recall: LASSO model is similar to an MLR model but shrinks coefficients and may set some to 0 + Tuning parameter must be chosen (usually by CV) --- # May Use Both Training/Test & CV - Recall: LASSO model is similar to an MLR model but shrinks coefficients and may set some to 0 + Tuning parameter must be chosen (usually by CV) - Training/Test split gives us a way to validate our model's performance - CV can be used on the training set to select **tuning parameters** - Helps determine the 'best' model for a class of models - With many competing model types, compare best models on test set via our metric --- # Plan - May want to review the videos/notebooks from earlier - Learn more supervised learning methods - Implement in `sklearn` and `pyspark` - Consider nuances of different loss functions and model metrics - See how to use model **pipelines** <img src="data:image/png;base64,#img/pipeline1.png" width="750px" style="display: block; margin: auto;" />