`pyspark`: Resilient Distributed Data Sets

Published

2026-01-29

The video below describes the underlying data structure in spark called the resilient distributed data (RDD) data set. While we rarely utilize these data structures and their functionality exactly, it is useful to have an idea about RDDs and the functionality they have.

I highly recommend watching the video using the ‘full’ Panopto player. There is a ‘pop out’ button in the bottom right of the video to enter this viewer.

The pyspark code used in the notes and the example done at the end of the notes is available in this notebook. You’ll need to download this .ipynb file and upload it to your JupyterHub environment. Make sure that the kernel used to run the notebook is a pyspark kernel!

Remember, if you are off campus you should log in to the VPN and then you can access our JupyterHub.

If you are not an NC State student, you can download docker and gain access to Spark with a Jupyter notebook interface reasonably quickly!

Notes

Additional Readings for Week 9

pyspark

Looking for More?

Data Engineering Topics to Learn

Use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!