Legacy Big Data Software: Hadoop Distributed File System
The video below discusses a legacy big data software called Hadoop along with the common Map-Reduce actions that Hadoop allows one to perform on big data. While not a widely used software at this point, understanding the language and ideas of what made Hadoop popular are important.
I highly recommend watching the video using the ‘full’ Panopto player. There is a ‘pop out’ button in the bottom right of the video to enter this viewer.
The Charles Dickens text analyzed can be found here.
Notes
Additional Readings for Week 8
Hadoop
- Hadoop Tutorial, Hadoop Architecture, Wikipedia article on Hadoop, What is Hadoop?, Another Intro to Hadoop
- Hadoop YARN: Docs, Article
- HDFS & the Cloud: Article 1, Book (not free though)
- S3 storage
- Azure Blob Storage
MapReduce
- MapReduce Examples: Article 1, Article 2
- A good intro through an example using PySpark
- Hadoop vs Spark (IBM)
Spark
- What is spark?
- Basic Spark Tutorial
- Spark APIs (databricks)
- More on DAGs
Use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!