Analysis of Big Data
Welcome to ST 554 - Analysis of Big Data (with python
)
In this course we’ll look at common issues, analysis, and software used with big data. We’ll discuss the major aspects with the commonly cited ‘5 V’s of Big Data’:
Volume, Variety, Velocity, Veracity (Variability), and Value
The course covers
- Programming in
python
- Understanding how big data is managed
- Predictive modeling in
python
and withSpark
viapyspark
- Handling streaming data
Using python
as our programming language we’ll learn about using Jupyter notebooks to share and document our work. We’ll use pyspark
as our interface to the Spark
software, which is commonly used to handle big data.
Course Learning Outcomes
At the end of this course students will be able to
- explain the steps and purpose of
python
programs (CO 1) - efficiently read in, combine, and manipulate data in python (CO 2)
- utilize help and other resources to customize programs (CO 3)
- write programs using good programming practices (CO 4)
- explore, manage, and solve common common problems with big data (CO 5)
Weekly To-do List
Generally speaking, each week will have a few videos to watch and readings to do as well as corresponding homework assignments (see the syllabus on Moodle for homework policies).
- There will be two exams and the exam windows (days when you can take the exams) are available on the syllabus and course schedule.
- There will be three projects, the third of which will count as the final for the course. These will require a reasonably substantial time commitment.
Getting Help!
To obtain course help there are a number of options:
- Discussion Forum on Moodle - This should be used for any question you feel comfortable asking and having others view. The TA, other students, and I will answer questions on this board. This will be the fastest way to receive a response!
- E-mail - If there is a question that you don’t feel comfortable asking the whole class you can use e-mail. The TA and I will be checking daily (during the regular work week).
- Zoom Office Hour Sessions - These sessions can be used to share screens and have multiple users. You can do text chat, voice, and video. They are great for a class like this!
Spring 2025 Course Schedule
Topic/Week | Learning Materials | Assignments |
---|---|---|
Week 1 1/6-1/10 |
01 - Course Goals & Other Resources 02 - Basic Use of Python 03 - Modules 04 - JupyterLab Notebooks & Markdown 05 - List Basics and Strings 06 - Numeric Types and Booleans 07 - Common Uses for Data |
HW 1 due W, 1/15 |
Week 2 1/13-1/17 |
08 - User Defined Functions 09 - Control Flow 10 - Lists and Tuples 11 - Dictionaries 12 - Numpy |
HW 2 due W, 1/22 |
Week 3 1/21-1/24 (Off M) |
13 - Exploratory Data Analysis Concepts 14 - Pandas Series 15 - Pandas DataFrames 16 - Pandas for Reading Data 17 - Numeric Summaries |
HW 3 due W, 1/29 |
Week 4 1/27-1/31 |
18 - More Function Writing 19 - Plotting with Matplotlib 20 - Plotting with pandas 21 - Error Handling |
HW 4 due W, 2/5 |
Week 5 2/3-2/7 |
22 - Big Recap! 23 - Fitting and Evaluating SLR Models 24 - Prediction and Training/Test Set Ideas 25 - Cross-Validation 26 - Multiple Linear Regression 27 - LASSO |
Exam 1 Th/F 2/6-2/7 - covers weeks 1-4 Project 1 due W, 2/19 |
Week 6 2/10-2/14 (Off T) |
No new material. Project work time! | |
Week 7 2/17-2/21 |
01 - Big Data Basics 02 - The Role of Statistics in Big Data 03 - Databases & SQL 04 - SQL Joins 05 - SQL Resources 06 - Data Pipelines & Storage |
HW 5 due W, 2/26 |
Week 8 2/24-2/28 |
07 - Legacy Software: HDFS 08 - Connecting to our JupyterHub Environment 09 - Spark for Big Data |
HW 6 due W, 3/5 |
Week 9 3/3-3/7 |
10 - pyspark: RDDs 11 - pyspark: pandas-on-Spark 12 - pyspark: Spark SQL |
Project 2 due W, 3/19 |
Week 10 3/10-3/14 |
No new material - spring break | |
Week 11 3/17-3/21 |
01 - Modeling Recap 02 - Modeling Example 03 - Logistic Regression Basics 04 - Logistic Regression Extensions 05 - Regularized Regression |
HW 7 due W, 3/26 |
Week 12 3/24-3/28 |
06 - Loss Functions & Model Performance 07 - Classification & Regression Trees 08 - Bagging Trees & Random Forests 09 - kNN |
HW 8 due W, 4/2 |
Week 13 3/31-4/4 |
10 - Spark MLlib Basics 11 - Model Pipelines in MLlib 12 - MLflow 13 - MLOps |
HW 9 due W, 4/9 |
Week 14 4/7-4/11 |
01 - Streaming Data Concepts 02 - Common Streaming Tasks 03 - Spark Structured Streaming 04 - Reading & Writing Streams with Spark Structured Streaming |
Exam 2 Th/F 4/10-4/11 - covers weeks 1-13 (emphasis on 5-13) |
Week 15 4/14-4/18 |
05 - Transformations, Windowing, & Aggregations 06 - Streaming Joins |
Final Project due M, 4/28 |
Week 16 4/21-4/22 |
No new material. |