class: center, middle, inverse, title-slide .title[ # Streaming Data Concepts ] .author[ ### Justin Post ] --- layout: false class: title-slide-section-red, middle # Streaming Data Concepts Justin Post --- # Recap - 5 V's of Big Data + Volume + Variety + Velocity + Veracity (Variability) + Value - Understanding of the Big Data pipeline and basics of handling Big Data + Databases/Data Lakes/Data Warehouses/etc. + Hadoop + Spark - Modeling data + Machine learning algorithms + Tuning and testing models Now: Common issues seen on data with velocity --- layout: true <div class="my-footer"><img src="data:image/png;base64,#img/logo.png" style="height: 60px;"/></div> --- # Batch data Batch data - data that updates only at certain times - can often be much larger in volume Example: Update to a database at the end of each hour/day - update inventory status - update employee time/roster - update electricity usage and populate/send bills --- # Streaming Data Streaming data - data that is generated over time (usually continuously) - often smaller amounts of data --- # Streaming Data Streaming data - data that is generated over time (usually continuously) - often smaller amounts of data Commonly a stream of logs that record events - temperature sensors - customers using a web app - in-game player activity/clicks - financial trading --- # Streaming Data Streaming data - data that is generated over time (usually continuously) - often small amounts of data with high velocity Commonly a stream of logs that record events - temperature sensors - customers using a web app - in-game player activity/clicks - financial trading Data streams often in an unstructured or semi-structured format - JSON data or XML key-value pairs --- # Data Producers and Consumers <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#img/kafka_producers_consumers.png" alt="https://docs.cloudera.com/documentation/kafka/1-2-x/topics/kafka.html" width="650px" /> <p class="caption">https://docs.cloudera.com/documentation/kafka/1-2-x/topics/kafka.html</p> </div> --- # Example Streaming Setup <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#img/nest-kafka.png" alt="https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html" width="650px" /> <p class="caption">https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html</p> </div> --- # Example Streaming Setup <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#img/SparkStreamingExample.PNG" alt="https://www.youtube.com/watch?v=Mxr408U_gqo&t=2s" width="700px" /> <p class="caption">https://www.youtube.com/watch?v=Mxr408U_gqo&t=2s</p> </div> --- # Batch vs Streaming Analysis Can process a stream in real-time, batch process at intervals, or do a combined approach --- # Batch vs Streaming Analysis Can process a stream in real-time, batch process at intervals, or do a combined approach - Usually processed sequentially and incrementally as the records come in - Often analyzed over windows of time - Also stored for later batch processing --- # Batch vs Streaming Analysis Can process a stream in real-time, batch process at intervals, or do a combined approach - Usually processed sequentially and incrementally as the records come in - Often analyzed over windows of time - Also stored for later batch processing [Example](https://www.infoworld.com/article/3646589/what-is-streaming-data-event-stream-processing-explained.html) - Acoustic monitor on a machine - Stream process detects an abnormal squeak and issues an alert - Batch process invokes a model to predict time to failure based on the squeak progression + Schedule maintenance for the machine before it is likely to fail --- # Common Issues Raised by Streaming Data Preprocessing/Sending alerts + Missing data from a censor + Tracking a fleet of vehicles on speed, geo-fences, etc. --- # Common Issues Raised by Streaming Data Preprocessing/Sending alerts + Missing data from a censor + Tracking a fleet of vehicles on speed, geo-fences, etc. Combining data streams and dealing with time intervals + Uber request --- # Common Issues Raised by Streaming Data Preprocessing/Sending alerts + Missing data from a censor + Tracking a fleet of vehicles on speed, geo-fences, etc. Combining data streams and dealing with time intervals + Uber request Detecting trends, counting, and averages + Algorithmic stock market trading + Trending twitter posts --- # Common Issues Raised by Streaming Data Preprocessing/Sending alerts + Missing data from a censor + Tracking a fleet of vehicles on speed, geo-fences, etc. Combining data streams and dealing with time intervals + Uber request Detecting trends, counting, and averages + Algorithmic stock market trading + Trending twitter posts Updating or using predictive models + Product recommendations --- # Summarizing Streaming Data Over Windows Often want to summarize/find trends/etc. over certain windows of time <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#img/structured-streaming-window.png" alt="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html" width="800px" /> <p class="caption">https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html</p> </div> --- # Recap - Important information can be gleaned from streaming data - Dealing with data as it comes in over time creates a number of common use cases + Preprocessing/Sending alerts + Combining data streams and dealing with time intervals + Detecting trends, counting, and averages (over certain windows or buckets of time) + Updating or using predictive models