Analysis of Big Data

Published

2025-03-31

Welcome to ST 554 - Analysis of Big Data (with python)

In this course we’ll look at common issues, analysis, and software used with big data. We’ll discuss the major aspects with the commonly cited ‘5 V’s of Big Data’:

Volume, Variety, Velocity, Veracity (Variability), and Value

The course covers

Using python as our programming language we’ll learn about using Jupyter notebooks to share and document our work. We’ll use pyspark as our interface to the Spark software, which is commonly used to handle big data.

Course Learning Outcomes

At the end of this course students will be able to

  • explain the steps and purpose of python programs (CO 1)
  • efficiently read in, combine, and manipulate data in python (CO 2)
  • utilize help and other resources to customize programs (CO 3)
  • write programs using good programming practices (CO 4)
  • explore, manage, and solve common common problems with big data (CO 5)

Weekly To-do List

Generally speaking, each week will have a few videos to watch and readings to do as well as corresponding homework assignments (see the syllabus on Moodle for homework policies).

  • There will be two exams and the exam windows (days when you can take the exams) are available on the syllabus and course schedule.
  • There will be three projects, the third of which will count as the final for the course. These will require a reasonably substantial time commitment.

Getting Help!

To obtain course help there are a number of options:

  • Discussion Forum on Moodle - This should be used for any question you feel comfortable asking and having others view. The TA, other students, and I will answer questions on this board. This will be the fastest way to receive a response!
  • E-mail - If there is a question that you don’t feel comfortable asking the whole class you can use e-mail. The TA and I will be checking daily (during the regular work week).
  • Zoom Office Hour Sessions - These sessions can be used to share screens and have multiple users. You can do text chat, voice, and video. They are great for a class like this!

Spring 2025 Course Schedule

Topic/Week Learning Materials Assignments
Week 1
1/6-1/10
01 - Course Goals & Other Resources
02 - Basic Use of Python
03 - Modules
04 - JupyterLab Notebooks & Markdown
05 - List Basics and Strings
06 - Numeric Types and Booleans
07 - Common Uses for Data
HW 1 due W, 1/15
Week 2
1/13-1/17
08 - User Defined Functions
09 - Control Flow
10 - Lists and Tuples
11 - Dictionaries
12 - Numpy
HW 2 due W, 1/22
Week 3
1/21-1/24 (Off M)
13 - Exploratory Data Analysis Concepts
14 - Pandas Series
15 - Pandas DataFrames
16 - Pandas for Reading Data
17 - Numeric Summaries
HW 3 due W, 1/29
Week 4
1/27-1/31
18 - More Function Writing
19 - Plotting with Matplotlib
20 - Plotting with pandas
21 - Error Handling
HW 4 due W, 2/5
Week 5
2/3-2/7
22 - Big Recap!
23 - Fitting and Evaluating SLR Models
24 - Prediction and Training/Test Set Ideas
25 - Cross-Validation
26 - Multiple Linear Regression
27 - LASSO

Exam 1 Th/F 2/6-2/7 - covers weeks 1-4

Project 1 due W, 2/19

Week 6
2/10-2/14 (Off T)
No new material. Project work time!
Week 7
2/17-2/21
01 - Big Data Basics
02 - The Role of Statistics in Big Data
03 - Databases & SQL
04 - SQL Joins
05 - SQL Resources
06 - Data Pipelines & Storage
HW 5 due W, 2/26
Week 8
2/24-2/28
07 - Legacy Software: HDFS
08 - Connecting to our JupyterHub Environment
09 - Spark for Big Data
HW 6 due W, 3/5
Week 9
3/3-3/7
10 - pyspark: RDDs
11 - pyspark: pandas-on-Spark
12 - pyspark: Spark SQL
Project 2 due W, 3/19
Week 10
3/10-3/14
No new material - spring break
Week 11
3/17-3/21
01 - Modeling Recap
02 - Modeling Example
03 - Logistic Regression Basics
04 - Logistic Regression Extensions
05 - Regularized Regression
HW 7 due W, 3/26
Week 12
3/24-3/28
06 - Loss Functions & Model Performance
07 - Classification & Regression Trees
08 - Bagging Trees & Random Forests
09 - kNN
HW 8 due W, 4/2
Week 13
3/31-4/4
10 - Spark MLlib Basics
11 - Model Pipelines in MLlib
12 - MLflow
13 - MLOps
HW 9 due W, 4/9
Week 14
4/7-4/11
01 - Streaming Data Concepts
02 - Common Streaming Tasks
03 - Spark Structured Streaming
04 - Reading & Writing Streams with Spark Structured Streaming

Exam 2 Th/F 4/10-4/11 - covers weeks 1-13

(emphasis on 5-13)

Week 15
4/14-4/18
05 - Transformations, Windowing, & Aggregations
06 - Streaming Joins
Final Project due M, 4/28
Week 16
4/21-4/22
No new material.