layout: false class: title-slide-section-red, middle # Common Uses for Data Justin Post --- layout: true <div class="my-footer"><img src="data:image/png;base64,#img/logo.png" alt = "" style="height: 60px;"/></div> --- # Big Picture .left45[ - 5 V's of Big Data + Volume + Variety + Velocity + Veracity (Variability) + Value - Will look at the Big Data pipeline later + Databases/Data Lakes/Data Warehouses/etc. + SQL basics + Hadoop & Spark ] .right50[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#img/big-data-characteristics.png" alt="Image displays the five V's of big data in a circle. These are Volume: Huge amount of data, Variety: Different formats of data from various sources, Value: Extract useful data, Velocity: High speed of accumulation of data, Veracity: Inconsistencies and uncertainty in data" width="370px" /> <p class="caption">Source: https://bit.ly/five_vs</p> </div> ] -- - **What to do with the data?** --- # Statistical Learning **Statistical learning** - Inference, prediction/classification, and pattern finding - Supervised learning - a variable (or variables) represents an **output** or **response** of interest -- + May model response and - Make **inference** on the model parameters - **predict** a value or **classify** an observation -- - Unsupervised learning - **No output or response variable** to shoot for + Goal - learn about patterns and relationships in the data --- # Standard Rectangular Data <img src="data:image/png;base64,#img/rectangular_data.png" alt="A rectangular data set is show as a grid of values. The rows represent observations and the columns represent variables." width="850px" style="display: block; margin: auto;" /> --- # Data Driven Goals Four major goals when using data: 1. Description <div style="float: left; width: 45%;"> <img src="data:image/png;base64,#img/summary_stats.png" alt="A table of summary statistics is shown. The rows represent different groups (seasons here) and the columns represent different summary statistics." width="270px" style="display: block; margin: auto;" /> </div> <div style="float: right; width: 45%;"> <img src="data:image/png;base64,#img/graph.png" alt="A graph with season on the x-axis and yardage on the y-axis is shown. Six lines are shown representing different summary statistics over time." width="500px" style="display: block; margin: auto;" /> </div> <!--comment--> --- # Data Driven Goals Four major goals when using data: <ol start = "2"> <li> Prediction/Classification</li> </ol> <div style="float: left; width: 45%;"> <img src="data:image/png;base64,#img/slr.png" alt="A scatter plot with a simple linear regression line overlaid is shown." width="600px" style="display: block; margin: auto;" /> </div> <div style="float: left; width: 45%;"> <img src="data:image/png;base64,#img/tree.png" alt="A tree diagram is shown." width="600px" style="display: block; margin: auto;" /> </div> --- # Data Driven Goals Four major goals when using data: <ol start = "3"> <li> Inference</li> <ul> <li> Confidence Intervals</lis> <li> Hypothesis Testing</li> </ul> </ol> --- # Data Driven Goals Four major goals when using data: <ol start = "4"> <li> Pattern Finding</li> </ol> <img src="data:image/png;base64,#img/clustering.png" alt="A scatter plot is shown with points colored by three separate groups." width="650px" style="display: block; margin: auto;" /> --- # 1. Describing Data Goal: Describe the **distribution** of the variable - Distribution = pattern and frequency with which you observe a variable - Numeric variable - entries are a numerical value where math can be performed -- For a single numeric variable, + Shape: Histogram, Density plot, ... + Measures of center: Mean, Median, ... + Measures of spread: Variance, Standard Deviation, Quartiles, IQR, ... -- For two numeric variables, + Shape: Scatter plot + Measures of Dependence: Correlation --- # Quick Example Read in some data ```python import pandas as pd wine_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/winequality-full.csv") wine_data.head() ``` ``` ## fixed acidity volatile acidity citric acid ... alcohol quality type ## 0 7.4 0.70 0.00 ... 9.4 5 Red ## 1 7.8 0.88 0.00 ... 9.8 5 Red ## 2 7.8 0.76 0.04 ... 9.8 5 Red ## 3 11.2 0.28 0.56 ... 9.8 6 Red ## 4 7.4 0.70 0.00 ... 9.4 5 Red ## ## [5 rows x 13 columns] ``` --- # Lots of Summaries! - Use the `describe()` method on a `pandas` data frame ```python wine_data.describe() ``` ``` ## fixed acidity volatile acidity ... alcohol quality ## count 6497.000000 6497.000000 ... 6497.000000 6497.000000 ## mean 7.215307 0.339666 ... 10.491801 5.818378 ## std 1.296434 0.164636 ... 1.192712 0.873255 ## min 3.800000 0.080000 ... 8.000000 3.000000 ## 25% 6.400000 0.230000 ... 9.500000 5.000000 ## 50% 7.000000 0.290000 ... 10.300000 6.000000 ## 75% 7.700000 0.400000 ... 11.300000 6.000000 ## max 15.900000 1.580000 ... 14.900000 9.000000 ## ## [8 rows x 12 columns] ``` --- # Graphs - Many standard graphs to summarize with as well .left45[ ```python wine_data.alcohol.plot.density() ``` <img src="data:image/png;base64,#07-Common_Uses_For_Data_files/figure-html/unnamed-chunk-11-1.svg" alt="A smoothed histogram is shown. The x-axis represents the alcohol content and the y-axis the fitted density. The graph goes up sharply at 8 and then steps back down towards 0 at 14." width="350px" style="display: block; margin: auto;" /> ] .right45[ ```python wine_data.plot.scatter(x = "alcohol", y = "residual sugar") ``` <img src="data:image/png;base64,#07-Common_Uses_For_Data_files/figure-html/unnamed-chunk-12-3.svg" alt="A scatter plot between alcohol (x) and residual sugar (y) is shown. A general, weak, negative trend exists." width="350px" style="display: block; margin: auto;" /> ] --- # 2. Prediction/Classification - A mathematical representation of some phenomenon on which you've observed data - Form of the model can vary greatly! -- **Simple Linear Regression Model** `$$\mbox{response = intercept + slope*predictor + Error}$$` `$$Y_i = \beta_0+\beta_1x_i+E_i$$` -- - Assumptions often made about the data generating process to make inference (not required) --- # Simple Linear Regression Model - We'll learn how to 'fit' this model later ```python from sklearn import linear_model reg = linear_model.LinearRegression() #Create a reg object reg.fit(X = wine_data['alcohol'].values.reshape(-1,1), y = wine_data['residual sugar'].values) ``` ```python print(round(reg.intercept_, 3), round(reg.coef_[0], 3)) ``` ``` ## 20.486 -1.434 ``` --- # Simple Linear Regression Model ```python import seaborn as sns sns.regplot(x = wine_data["alcohol"], y = wine_data["residual sugar"], scatter_kws={'s':2}) ``` <img src="data:image/png;base64,#07-Common_Uses_For_Data_files/figure-html/unnamed-chunk-16-5.svg" alt="A scatter plot between alcohol (x) and residual sugar (y) is shown. A general, weak, negative trend exists. The simple linear regression line is overlayed and has a negative slope." width="400px" style="display: block; margin: auto;" /> --- # 2. Prediction/Classification - A mathematical representation of some phenomenon on which you've observed data - Form of the model can vary greatly! **Regression Tree** - This model can be used for prediction ```python from sklearn.tree import DecisionTreeRegressor reg_tree = DecisionTreeRegressor(max_depth=2) reg_tree.fit(X = wine_data['alcohol'].values.reshape(-1,1), y = wine_data['residual sugar'].values) ``` --- # Regression Tree - This model can be used for prediction ```python from sklearn.tree import plot_tree plot_tree(reg_tree) ``` <img src="data:image/png;base64,#07-Common_Uses_For_Data_files/figure-html/unnamed-chunk-19-7.svg" alt="A tree diagram is shown. The first node splits into two branches and each of those nodes split into two more branches. Observations are predicting by following the flow of the tree to a terminal node." width="450px" style="display: block; margin: auto;" /> --- # 2. Prediction/Classification - A mathematical representation of some phenomenon on which you've observed data - Form of the model can vary greatly! **Logistic Regression** - Consider binary response and classification as the task `$$P(\mbox{success}|\mbox{predictor}) = \frac{e^{\mbox{intercept+slope*predictor}}}{1+e^{\mbox{intercept+slope*predictor}}}$$` `$$P(\mbox{success}|\mbox{predictor}) = \frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}$$` -- We'll investigate a number of different models later in the course! <!--- Classify result as a 'success' for values of the predictor where this probability is larger than 0.5 (otherwise classify as a 'failure')--> --- # Recap Four major goals with data: 1. Description 2. Prediction/Classification 3. Inference 4. Pattern Finding - Descriptive Statistics try to summarize the distribution of the variable - Supervised Learning methods try to relate predictors to a response variable through a model - Some models used for inference and prediction/classification - Some used just for prediction/classification