class: center, middle, inverse, title-slide .title[ # Spark MLlib Basics ] .author[ ### Justin Post ] --- layout: false class: title-slide-section-red, middle # Spark MLlib Basics Justin Post --- layout: true <div class="my-footer"><img src="data:image/png;base64,#img/logo.png" style="height: 60px;"/></div> --- # Big Picture - We've studied the idea of data pipelines - We've looked at considerations for doing (supervised) modeling and how to judge those models <div style = "float:left"> Next up: <ul> <li> Using Spark to do our modeling</li> <li> Understanding model pipelines</li> <li> Documenting the model building process</li> <li> Practical considerations for ML and big data</li> <li> Streaming Data</li> </ul> </div> <div style = "float:right"> <img src="data:image/png;base64,#img/ml-engineering.jpg" width="450px" style="display: block; margin: auto;" /> </div> --- # Spark Recap Create a **Spark Session** in `pyspark` Defines: - Cluster and workers - Spark coordinator (i.e. the **Driver**) - Name of the app ```python from pyspark.sql import SparkSession spark = SparkSession.builder.master('local[*]').appName('my_app').getOrCreate() ``` --- # Spark Recap Spark handles big data and is fault tolerant! - Turns transformations and actions into a directed acyclic graph (DAG) that allows computation to be picked back up if something fails <img src="data:image/png;base64,#img/dag_visual.jpg" width="450px" style="display: block; margin: auto;" /> --- # Spark Recap - [All transformations in Spark](https://spark.apache.org/docs/latest/rdd-programming-guide.html) are _lazy_ - **Transformations** are built up and computation done only when needed - Makes computation faster! + Spark can realize a dataset created through map will be used in a reduce and return only the result of the reduce rather than the larger mapped dataset <img src="data:image/png;base64,#img/sparkLazy.jpg" width="400px" style="display: block; margin: auto;" /> --- # Spark Recap Two major DataFrame APIs in `pyspark` - [pandas-on-Spark](https://spark.apache.org/docs/3.3.1/api/python/reference/pyspark.pandas/index.html) DataFrames through the `pyspark.pandas` module - [Spark SQL](https://spark.apache.org/docs/3.3.1/api/python/reference/pyspark.sql.html) DataFrames through `pyspark.sql` module --- # Spark Recap Two major DataFrame APIs in `pyspark` - [pandas-on-Spark](https://spark.apache.org/docs/3.3.1/api/python/reference/pyspark.pandas/index.html) DataFrames through the `pyspark.pandas` module - [Spark SQL](https://spark.apache.org/docs/3.3.1/api/python/reference/pyspark.sql.html) DataFrames through `pyspark.sql` module Recommended to use spark SQL for machine learning! - Common actions to return data + `show(n)`, `take(n)`, `collect()` - Common transformations done with [SQL like functions](https://spark.apache.org/docs/3.3.1/api/python/reference/pyspark.sql/functions.html) ```python from pyspark.sql.functions import * df.withColumn("Age_cat", when(df.Age>75, "75+") .when(df.Age>=70, "70-75") .otherwise("<70")) ``` --- # Spark `MLlib` `MLlib` allows for fitting ML models in spark! <img src="data:image/png;base64,#img/mllib.png" width="500px" style="display: block; margin: auto;" /> - Syntax of model fitting, CV, etc. very similar to `sklearn`! --- # Spark `MLlib` - Two major components: + Transformers (Create polynomials, standardize data, etc.) + Estimators (Models) <img src="data:image/png;base64,#img/pipeline1.png" width="650px" style="display: block; margin: auto;" /> --- # Spark `MLlib` - Two major components: + Transformers (Create polynomials, standardize data, **models**, etc.) + Estimators (Models) <img src="data:image/png;base64,#img/pipeline2.png" width="650px" style="display: block; margin: auto;" /> --- # How to fit an ML Model in Spark? - Setting up response and predictors is different: + Create a `label` column which represents the response + Create a `features` column with all of the predictors in it! --- # How to fit an ML Model in Spark? - Setting up response and predictors is different: + Create a `label` column which represents the response + Create a `features` column with all of the predictors in it! - Many functions with a `.transform()` method ```python from pyspark.ml.feature import SQLTransformer sqlTrans = SQLTransformer(statement = "SELECT year, log(km_driven) as log_km_driven FROM __THIS__") sqlTrans.transform(bike) ``` --- # How to fit an ML Model in Spark? - Setting up response and predictors is different: + Create a `label` column which represents the response + Create a `features` column with all of the predictors in it! - Many functions with a `.transform()` method ```python from pyspark.ml.feature import SQLTransformer sqlTrans = SQLTransformer(statement = "SELECT year, log(km_driven) as log_km_driven FROM __THIS__") sqlTrans.transform(bike) ``` - Models and CV function have a `.fit()` method (once fitted a `.transform()` method too!) ```python from pyspark.ml.regression import LinearRegression lr = LinearRegression(regParam = 0, elasticNetParam = 0).fit(...) ``` --- # Jump Into Pyspark! - Go through basic example of fitting a linear regression model in Spark `MLlib` --- # Recap - Setting up response and predictors: + Create a `label` column which represents the response + Create a `features` column with all of the predictors in it! - Many functions with a `.transform()` method - Models and CV function have a `.fit()` method (once fitted a `.transform()` method too!)