layout: false class: title-slide-section-red, middle # Common Uses for Data Justin Post --- layout: true <div class="my-footer"><img src="data:image/png;base64,#img/logo.png" style="height: 60px;"/></div> --- # Big Picture .left45[ - 5 V's of Big Data + Volume + Variety + Velocity + Veracity (Variability) + Value - Will look at the Big Data pipeline later + Databases/Data Lakes/Data Warehouses/etc. + SQL basics + Hadoop & Spark ] .right50[ <img src="data:image/png;base64,#img/big-data-characteristics.png" width="370px" style="display: block; margin: auto;" /> ] -- - **What to do with the data?** --- # Statistical Learning **Statistical learning** - Inference, prediction/classification, and pattern finding - Supervised learning - a variable (or variables) represents an **output** or **response** of interest -- + May model response and - Make **inference** on the model parameters - **predict** a value or **classify** an observation -- - Unsupervised learning - **No output or response variable** to shoot for + Goal - learn about patterns and relationships in the data --- # Standard Rectangular Data <img src="data:image/png;base64,#img/rectangular_data.png" width="850px" style="display: block; margin: auto;" /> --- # Data Driven Goals Four major goals when using data: 1. Description <div style="float: left; width: 45%;"> <img src="data:image/png;base64,#img/summary_stats.png" width="270px" style="display: block; margin: auto;" /> </div> <div style="float: right; width: 45%;"> <img src="data:image/png;base64,#img/graph.png" width="500px" style="display: block; margin: auto;" /> </div> <!--comment--> --- # Data Driven Goals Four major goals when using data: <ol start = "2"> <li> Prediction/Classification</li> </ol> <div style="float: left; width: 45%;"> <img src="data:image/png;base64,#img/slr.png" width="600px" style="display: block; margin: auto;" /> </div> <div style="float: left; width: 45%;"> <img src="data:image/png;base64,#img/tree.png" width="600px" style="display: block; margin: auto;" /> </div> --- # Data Driven Goals Four major goals when using data: <ol start = "3"> <li> Inference</li> <ul> <li> Confidence Intervals</lis> <li> Hypothesis Testing</li> </ul> </ol> --- # Data Driven Goals Four major goals when using data: <ol start = "4"> <li> Pattern Finding</li> </ol> <img src="data:image/png;base64,#img/clustering.png" width="650px" style="display: block; margin: auto;" /> --- # 1. Describing Data Goal: Describe the **distribution** of the variable - Distribution = pattern and frequency with which you observe a variable - Numeric variable - entries are a numerical value where math can be performed -- For a single numeric variable, + Shape: Histogram, Density plot, ... + Measures of center: Mean, Median, ... + Measures of spread: Variance, Standard Deviation, Quartiles, IQR, ... -- For two numeric variables, + Shape: Scatter plot + Measures of Dependence: Correlation --- # Quick Example Read in some data ```python import pandas as pd wine_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/winequality-full.csv") wine_data.head() ``` ``` ## fixed acidity volatile acidity citric acid ... alcohol quality type ## 0 7.4 0.70 0.00 ... 9.4 5 Red ## 1 7.8 0.88 0.00 ... 9.8 5 Red ## 2 7.8 0.76 0.04 ... 9.8 5 Red ## 3 11.2 0.28 0.56 ... 9.8 6 Red ## 4 7.4 0.70 0.00 ... 9.4 5 Red ## ## [5 rows x 13 columns] ``` --- # Lots of Summaries! - Use the `describe()` method on a `pandas` data frame ```python wine_data.describe() ``` ``` ## fixed acidity volatile acidity ... alcohol quality ## count 6497.000000 6497.000000 ... 6497.000000 6497.000000 ## mean 7.215307 0.339666 ... 10.491801 5.818378 ## std 1.296434 0.164636 ... 1.192712 0.873255 ## min 3.800000 0.080000 ... 8.000000 3.000000 ## 25% 6.400000 0.230000 ... 9.500000 5.000000 ## 50% 7.000000 0.290000 ... 10.300000 6.000000 ## 75% 7.700000 0.400000 ... 11.300000 6.000000 ## max 15.900000 1.580000 ... 14.900000 9.000000 ## ## [8 rows x 12 columns] ``` --- # Graphs - Many standard graphs to summarize with as well .left45[ ```python wine_data.alcohol.plot.density() ``` <img src="data:image/png;base64,#07-Common_Uses_For_Data_files/figure-html/unnamed-chunk-11-1.svg" width="350px" style="display: block; margin: auto;" /> ] .right45[ ```python wine_data.plot.scatter(x = "alcohol", y = "residual sugar") ``` <img src="data:image/png;base64,#07-Common_Uses_For_Data_files/figure-html/unnamed-chunk-12-3.svg" width="350px" style="display: block; margin: auto;" /> ] --- # 2. Prediction/Classification - A mathematical representation of some phenomenon on which you've observed data - Form of the model can vary greatly! -- **Simple Linear Regression Model** `$$\mbox{response = intercept + slope*predictor + Error}$$` `$$Y_i = \beta_0+\beta_1x_i+E_i$$` -- - Assumptions often made about the data generating process to make inference (not required) --- # Simple Linear Regression Model - We'll learn how to 'fit' this model later ```python from sklearn import linear_model reg = linear_model.LinearRegression() #Create a reg object reg.fit(X = wine_data['alcohol'].values.reshape(-1,1), y = wine_data['residual sugar'].values) ``` ```{=html} <style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: "▸";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: "▾";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: "";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id="sk-container-id-1" class="sk-top-container"><div class="sk-text-repr-fallback"><pre>LinearRegression()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-1" type="checkbox" checked><label for="sk-estimator-id-1" class="sk-toggleable__label sk-toggleable__label-arrow">LinearRegression</label><div class="sk-toggleable__content"><pre>LinearRegression()</pre></div></div></div></div></div> ``` ```python print(round(reg.intercept_, 3), round(reg.coef_[0], 3)) ``` ``` ## 20.486 -1.434 ``` --- # Simple Linear Regression Model ```python import seaborn as sns sns.regplot(x = wine_data["alcohol"], y = wine_data["residual sugar"], scatter_kws={'s':2}) ``` <img src="data:image/png;base64,#07-Common_Uses_For_Data_files/figure-html/unnamed-chunk-14-5.svg" width="400px" style="display: block; margin: auto;" /> --- # 2. Prediction/Classification - A mathematical representation of some phenomenon on which you've observed data - Form of the model can vary greatly! **Regression Tree** - This model can be used for prediction ```python from sklearn.tree import DecisionTreeRegressor reg_tree = DecisionTreeRegressor(max_depth=2) reg_tree.fit(X = wine_data['alcohol'].values.reshape(-1,1), y = wine_data['residual sugar'].values) ``` ```{=html} <style>#sk-container-id-2 {color: black;background-color: white;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: "▸";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: "▾";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: "";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id="sk-container-id-2" class="sk-top-container"><div class="sk-text-repr-fallback"><pre>DecisionTreeRegressor(max_depth=2)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-2" type="checkbox" checked><label for="sk-estimator-id-2" class="sk-toggleable__label sk-toggleable__label-arrow">DecisionTreeRegressor</label><div class="sk-toggleable__content"><pre>DecisionTreeRegressor(max_depth=2)</pre></div></div></div></div></div> ``` --- # Regression Tree - This model can be used for prediction ```python from sklearn.tree import plot_tree plot_tree(reg_tree) ``` <img src="data:image/png;base64,#07-Common_Uses_For_Data_files/figure-html/unnamed-chunk-16-7.svg" width="450px" style="display: block; margin: auto;" /> --- # 2. Prediction/Classification - A mathematical representation of some phenomenon on which you've observed data - Form of the model can vary greatly! **Logistic Regression** - Consider binary response and classification as the task `$$P(\mbox{success}|\mbox{predictor}) = \frac{e^{\mbox{intercept+slope*predictor}}}{1+e^{\mbox{intercept+slope*predictor}}}$$` `$$P(\mbox{success}|\mbox{predictor}) = \frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}$$` -- We'll investigate a number of different models later in the course! <!--- Classify result as a 'success' for values of the predictor where this probability is larger than 0.5 (otherwise classify as a 'failure')--> --- # Recap Four major goals with data: 1. Description 2. Prediction/Classification 3. Inference 4. Pattern Finding - Descriptive Statistics try to summarize the distribution of the variable - Supervised Learning methods try to relate predictors to a response variable through a model - Some models used for inference and prediction/classification - Some used just for prediction/classification