class: center, middle, inverse, title-slide .title[ # Big Data Basics ] .author[ ### Justin Post ] --- # Where are we? - Foundation in Python through `JupyterLab` Now: - What is Big Data? - What are common things we'll run into with Big Data? - What issues are we going to try and tackle? - What software can be useful? --- # What is Big Data? Useful definition: - Big data = data that you can't handle 'normally' <!--too big to fit in memory, data being added constantly, etc.--> --- # What is Big Data? Useful definition: - Big data = data that you can't handle 'normally' - Big Data usually requires learning of new tools - [Want to feel overwhelmed?](https://mattturck.com/landscape/mad2024.pdf) --- # What are Common Attributes of Big Data? - Fifth **V** = [Value](https://www.ibm.com/think/topics/big-data-analytics) <img src="data:image/png;base64,#img/fourvs.jpg" width="650px" style="display: block; margin: auto;" /> --- # What are We Going to Consider? - Basics of Big Data storage <br><br><br> <!--storage systems, file types, splitting the data up, backups/fault recovery--> - Basics of Big Data querying <br><br><br> <!--querying data in a Data Lake, Data Warehouse, and Database--> - Basics of handling streaming data <br><br><br> <!--Create our own streaming data and do some basic responses to it--> - Summarizing and modeling Big Data <br><br><br> <!--sampling from big data, mapreduce for counting/basic stuff, sufficient statistics and updating models--> --- # Data and Model Pipelines In the end, we should have a reasonable idea of how data comes in, gets transformed/combined/etc., can be used to build models or predict an outcome! <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#img/Data-Processing-Pipelines-patterns-informatica.png" alt="Image from informatica.com" width="700px" /> <p class="caption">Image from informatica.com</p> </div> --- # Recap - Big Data requires learning new tools & considering different algorithms - Storage and retrieval of data is important - Modeling and summarizing data can be done - Should consider overall process/pipeline of data from start to finish