Weeks 12 & 13 Overview

Published

2025-10-24

This wraps up the content for weeks 10 & 11. Now we require some practice! You should head back to our Moodle site to check out your assessment for this week.

It’s time to start modeling data! This section is going to be a quick whirlwind look at machine learning and how to do it in a structured way in R.

Statistical modeling is a way to help us and others understand, interpret and make informed decisions based on data. We will also spend a lot of time on using different techniques to help us pick a best performing model from many. In addition, we will learn how to model data in the tidy framework using tidymodels. tidymodels aims to do the same thing the tidyverse did for data manipulation but for machine learning. It presents itself as a one-stop shop for everything ML-related, from processing data to training and evaluation models. It’s an ecosystem of its own. It is a little tough to learn but we’ll see the basics of it!

Weeks 12 & 13 Additional Readings/Learning Materials

tidymodels homepage for you to explore
short video on testing vs training data sets
Linear regression with multiple predictors
TidyTuesday LASSO demonstration>
Logistic Regression:9.1-9.3
Regression and Classification Trees article
Communicating a story with data (Hans Rosling)
- ISLR book: Read section (Big picture ideas) 2.1, 2.2, (Basic regression) 3.1, 3.2, 3.3, (Basic logistic regression) 4.1, 4.2, 4.3, (CV) 5.1, (Regularized models) 6.2
- Chapters 4-15 of the tidymodels (especially chapter 9)
- Google’s open course materials has information about these models, loss functions, metrics, etc: https://developers.google.com/machine-learning/crash-course/classification/video-lecture
- (Optional) Elements of Statistical Learning (see table of contents - similar readings, just more advanced)
- (Optional) tidymodels blog

Learning Objectives

Upon completion of these two weeks, students will be able to:

Linear Regression Models

describe the idea of supervised learning and compare and contrast it with unsupervised learning
conduct an exploratory data analysis
utilize the lm() function along with formula notation in R to fit linear models
1. lay out the basic simple linear regression model and explain how the models are commonly fit to data in R
2. access elements of a fitted lm object
3. find predictions using a simple linear regression model and provide standard errors, confidence bounds, and prediction bounds
4. utilize the lm() function and formula notation in R to fit multiple linear regression models and polynomial regression models
5. define the terms polynomial regression and multiple linear regression
explain the concept of how a multiple linear regression model is usually fit and what the effect of adding in more predictor terms is on the model
1. find predictions using a multiple linear regression model and provide standard errors, confidence bounds, and prediction bounds
2. explain how adding categorical predictors changes a linear model
3. interpret the coefficients of a general linear model and use it for prediction purposes
describe common methods used to select between models
define prediction error, training sets, and test sets
1. explain why splitting data into training and tests sets is needed
2. discuss the nature/behavior of predicting on your training set vs predicting on a test set
use a linear regression or logistic regression model to perform classification
1. describe the type of scenario where logistic regression may be a reasonable modeling choose
2. define the logistic regression model and state its advantages for modeling binary outcomes
3. create visuals to help explore binary outcome data appropriate for logistic regression
4. interpret model coefficients from a logistic regression model
5. fit logistic regression models in R and use them for prediction purposes (probability, link, odds, or classification)
6. predict for new data using a logistic regression model

Nonlinear & Ensemble Models

fit and interpret regression and classification trees in R
1. describe the terms regression tree and classification tree
2. explain the difference between using a tree based method and using a linear method
3. provide visuals of tree fits in R
4. predict using a tree fit in R
5. roughly break down the steps used in fitting a regression tree
6. explain the term “greedy” algorithm
7. prune a fitted tree and describe why this is often needed
8. compare and contrast the fitting and pruning of regression trees vs classification trees
9. describe the pros and cons of using tree based methods
select a final model using cross validation
fit and interpret bagged tree and random forests models in R
1. explain the term ensemble methods and how it can be applied to tree based methods
2. describe why ensemble methods can often improve predictions
3. give the cons of using ensemble methods
4. investigate variable importance measures for ensemble trees
5. Outline the bagging algorithm
6. outline the random forest algorithm

Tidymodels Framework and Model Fitting

Use the tidymodels framework to fit and evaluate models
1. describe how recipes work and why they are useful when fitting models
2. outline the purpose of how tidymodels creates models by specifying the model type and engine
3. utilize workflows for fitting models
Compare and contrast using a training/test set, using cross-validation only, and using both the training set with CV and a test set
Describe and explain the pros and cons of commonly used model metrics
Explain the difference between a loss function and a model metric

Use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!