Weeks 12 & 13 Overview
This wraps up the content for weeks 10 & 11. Now we require some practice! You should head back to our Moodle site to check out your assessment for this week.
It’s time to start modeling data!
Statistical modeling is a way to help us and others understand, interpret and make informed decisions based on data. We will also spend a lot of time on using different techniques to help us pick a best performing model from many. In addition, we will learn how to model data in the tidy framework using tidymodels. R tidymodels aims to do the same but for machine learning. It presents itself as a one-stop shop for everything ML-related, from processing data to training and evaluation models. It’s an ecosystem of its own.
Weeks 12 & 13 Additional Readings/Learning Materials
- tidymodels homepage for you to explore
- short video on testing vs training data sets
- Linear regression with multiple predictors
- TidyTuesday LASSO demonstration>
- Logistic Regression:9.1-9.3
- Regression and Classification Trees article
- Communicating a story with data (Hans Rosling)
- ISLR book: Read section (Big picture ideas) 2.1, 2.2, (Basic regression) 3.1, 3.2, 3.3, (Basic logistic regression) 4.1, 4.2, 4.3, (CV) 5.1, (Regularized models) 6.2
- Chapters 4-15 of the tidymodels (especially chapter 9)
- Google’s open course materials has information about these models, loss functions, metrics, etc: https://developers.google.com/machine-learning/crash-course/classification/video-lecture
- (Optional) Elements of Statistical Learning (see table of contents - similar readings, just more advanced)
- (Optional) tidymodels blog
Learning Objectives
Upon completion of these two weeks, students will be able to:
Linear Regression Models
describe the idea of supervised learning and compare and contrast it with unsupervised learning
conduct an exploratory data analysis
utilize the lm() function along with formula notation in R to fit linear models
- lay out the basic simple linear regression model and explain how the models are commonly fit to data in R
- access elements of a fitted lm object
- find predictions using a simple linear regression model and provide standard errors, confidence bounds, and prediction bounds
- utilize the lm() function and formula notation in R to fit multiple linear regression models and polynomial regression models
- define the terms polynomial regression and multiple linear regression
- lay out the basic simple linear regression model and explain how the models are commonly fit to data in R
explain the concept of how a multiple linear regression model is usually fit and what the effect of adding in more predictor terms is on the model
- find predictions using a multiple linear regression model and provide standard errors, confidence bounds, and prediction bounds
- explain how adding categorical predictors changes a linear model
- interpret the coefficients of a general linear model and use it for prediction purposes
- find predictions using a multiple linear regression model and provide standard errors, confidence bounds, and prediction bounds
describe common methods used to select between models
define prediction error, training sets, and test sets
- explain why splitting data into training and tests sets is needed
- discuss the nature/behavior of predicting on your training set vs predicting on a test set
- explain why splitting data into training and tests sets is needed
use a linear regression or logistic regression model to perform classification
- describe the type of scenario where logistic regression may be a reasonable modeling choose
- define the logistic regression model and state its advantages for modeling binary outcomes
- create visuals to help explore binary outcome data appropriate for logistic regression
- interpret model coefficients from a logistic regression model
- fit logistic regression models in R and use them for prediction purposes (probability, link, odds, or classification)
- predict for new data using a logistic regression model
- describe the type of scenario where logistic regression may be a reasonable modeling choose
Nonlinear & Ensemble Models
fit and interpret regression and classification trees in R
- describe the terms regression tree and classification tree
- explain the difference between using a tree based method and using a linear method
- provide visuals of tree fits in R
- predict using a tree fit in R
- roughly break down the steps used in fitting a regression tree
- explain the term “greedy” algorithm
- prune a fitted tree and describe why this is often needed
- compare and contrast the fitting and pruning of regression trees vs classification trees
- describe the pros and cons of using tree based methods
- describe the terms regression tree and classification tree
select a final model using cross validation
fit and interpret bagged tree and random forests models in R
- explain the term ensemble methods and how it can be applied to tree based methods
- describe why ensemble methods can often improve predictions
- give the cons of using ensemble methods
- investigate variable importance measures for ensemble trees
- Outline the bagging algorithm
- outline the random forest algorithm
- explain the term ensemble methods and how it can be applied to tree based methods
Tidymodels Framework and Model Fitting
Use the tidymodels framework to fit and evaluate models
- describe how recipes work and why they are useful when fitting models
- outline the purpose of how tidymodels creates models by specifying the model type and engine
- utilize workflows for fitting models
- describe how recipes work and why they are useful when fitting models
Compare and contrast using a training/test set, using cross-validation only, and using both the training set with CV and a test set
Describe and explain the pros and cons of commonly used model metrics
Explain the difference between a loss function and a model metric
Use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!