class: center, middle, inverse, title-slide .title[ # Exploratory Data Analysis (EDA) Concepts ] .author[ ### Justin Post ] --- layout: true <div class="my-footer"><img src="data:image/png;base64,#img/logo.png" style="height: 60px;"/></div> --- # Recap! - Data Science!! - R Projects/Quarto/Git/GitHub for reproducibility/communication - R Data Structures + Vectors, Matrices, Data Frames, Lists - R Control Flow + if/then/else, loops, function writing - Reading & Manipulating data with the tidyverse! - Next: Gain meaningful insights from data through EDA - Later: Dashboards, Predictive Modeling, & More --- # EDA Basics - Get to know your data! - EDA generally consists of a few steps: + Understand how your data is stored + Do basic data validation + Determine rate of missing values + Clean data up data as needed + Investigate distributions - Univariate measures/graphs - Multivariate measures/graphs + Apply transformations and repeat previous step --- # Understand How Data is Stored Let's read in some data! - [Appendicitis Data](https://www4.stat.ncsu.edu/~online/datasets/app_data.xlsx) > This dataset was acquired in a retrospective study from a cohort of pediatric patients admitted with abdominal pain to Children’s Hospital St. Hedwig in Regensburg, Germany. ... Alongside multiple US images for each subject, the dataset includes information encompassing laboratory tests, physical examination results, clinical scores, such as Alvarado and pediatric appendicitis scores, and expert-produced ultrasonographic findings. Lastly, the subjects were labeled w.r.t. three target variables: diagnosis (appendicitis vs. no appendicitis), management (surgical vs. conservative) and severity (complicated vs. uncomplicated or no appendicitis). ... --- # Understand How Data is Stored ```r #download data to local folder library(tidyverse) library(readxl) app_data <- read_excel("data/app_data.xlsx", sheet = 1) ``` - Column data types should make sense for what you expect! ```r app_data ``` ``` ## # A tibble: 782 x 58 ## Age BMI Sex Height Weight Length_of_Stay Management Severity ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> ## 1 12.7 16.8999999999999~ fema~ 148 37 3 conservat~ uncompl~ ## 2 14.1 31.9 male 147 69.5 2 conservat~ uncompl~ ## 3 14.1 23.3 fema~ 163 62 4 conservat~ uncompl~ ## 4 16.4 20.6 fema~ 165 56 3 conservat~ uncompl~ ## 5 11.1 16.8999999999999~ fema~ 163 45 3 conservat~ uncompl~ ## # i 777 more rows ## # i 50 more variables: Diagnosis_Presumptive <chr>, Diagnosis <chr>, ## # Alvarado_Score <dbl>, Paedriatic_Appendicitis_Score <dbl>, ## # Appendix_on_US <chr>, Appendix_Diameter <dbl>, Migratory_Pain <chr>, ## # Lower_Right_Abd_Pain <chr>, Contralateral_Rebound_Tenderness <chr>, ## # Coughing_Pain <chr>, Nausea <chr>, Loss_of_Appetite <chr>, ## # Body_Temperature <dbl>, WBC_Count <dbl>, Neutrophil_Percentage <dbl>, ... ``` --- # Understand How Data is Stored - Check the structure of the data! ```r str(app_data) ``` ``` ## tibble [782 x 58] (S3: tbl_df/tbl/data.frame) ## $ Age : num [1:782] 12.7 14.1 14.1 16.4 11.1 ... ## $ BMI : chr [1:782] "16.899999999999999" "31.9" "23.3" "20.6" ... ## $ Sex : chr [1:782] "female" "male" "female" "female" ... ## $ Height : num [1:782] 148 147 163 165 163 121 140 NA 131 174 ... ## $ Weight : num [1:782] 37 69.5 62 56 45 45 38.5 21.5 26.7 45.5 ... ## $ Length_of_Stay : num [1:782] 3 2 4 3 3 3 3 2 3 3 ... ## $ Management : chr [1:782] "conservative" "conservative" "conservative" "conservative" ... ## $ Severity : chr [1:782] "uncomplicated" "uncomplicated" "uncomplicated" "uncomplicated" ... ## $ Diagnosis_Presumptive : chr [1:782] "appendicitis" "appendicitis" "appendicitis" "appendicitis" ... ## $ Diagnosis : chr [1:782] "appendicitis" "no appendicitis" "no appendicitis" "no appendicitis" ... ## $ Alvarado_Score : num [1:782] 4 5 5 7 5 6 5 3 7 4 ... ## $ Paedriatic_Appendicitis_Score : num [1:782] 3 4 3 6 6 7 6 3 6 4 ... ## $ Appendix_on_US : chr [1:782] "yes" "no" "no" "no" ... ## $ Appendix_Diameter : num [1:782] 7.1 NA NA NA 7 NA NA NA 3.7 8 ... ## $ Migratory_Pain : chr [1:782] "no" "yes" "no" "yes" ... ## $ Lower_Right_Abd_Pain : chr [1:782] "yes" "yes" "yes" "yes" ... ## $ Contralateral_Rebound_Tenderness: chr [1:782] "yes" "yes" "yes" "no" ... ## $ Coughing_Pain : chr [1:782] "no" "no" "no" "no" ... ## $ Nausea : chr [1:782] "no" "no" "no" "yes" ... ## $ Loss_of_Appetite : chr [1:782] "yes" "yes" "no" "yes" ... ## $ Body_Temperature : num [1:782] 37 36.9 36.6 36 36.9 36.9 36.7 36.8 37.3 37.1 ... ## $ WBC_Count : num [1:782] 7.7 8.1 13.2 11.4 8.1 9.5 10 8 20.9 5.8 ... ## $ Neutrophil_Percentage : num [1:782] 68.2 64.8 74.8 63 44 71.4 69.1 79.6 76 47.2 ... ## $ Segmented_Neutrophils : num [1:782] NA NA NA NA NA NA NA NA NA NA ... ## $ Neutrophilia : chr [1:782] "no" "no" "no" "no" ... ## $ RBC_Count : num [1:782] 5.27 5.26 3.98 4.64 4.44 4.96 4.77 4.89 4.61 4.78 ... ## $ Hemoglobin : num [1:782] 14.8 15.7 11.4 13.6 12.6 12.5 12.7 12 13.4 12.9 ... ## $ RDW : num [1:782] 12.2 12.7 12.2 13.2 13.6 13.3 12.6 13.9 12 12.6 ... ## $ Thrombocyte_Count : num [1:782] 254 151 300 258 311 249 337 412 350 220 ... ## $ Ketones_in_Urine : chr [1:782] "++" "no" "no" "no" ... ## $ RBC_in_Urine : chr [1:782] "+" "no" "no" "no" ... ## $ WBC_in_Urine : chr [1:782] "no" "no" "no" "no" ... ## $ CRP : num [1:782] 0 3 3 0 0 63 9 0 20 0 ... ## $ Dysuria : chr [1:782] "no" "yes" "no" "yes" ... ## $ Stool : chr [1:782] "normal" "normal" "constipation" "normal" ... ## $ Peritonitis : chr [1:782] "no" "no" "no" "no" ... ## $ Psoas_Sign : chr [1:782] "yes" "yes" "yes" "yes" ... ## $ Ipsilateral_Rebound_Tenderness : chr [1:782] "no" "no" "no" "no" ... ## $ US_Performed : chr [1:782] "yes" "yes" "yes" "yes" ... ## $ US_Number : num [1:782] 882 883 884 886 887 888 889 890 891 893 ... ## $ Free_Fluids : chr [1:782] "no" "no" "no" "no" ... ## $ Appendix_Wall_Layers : chr [1:782] "intact" NA NA NA ... ## $ Target_Sign : chr [1:782] NA NA NA NA ... ## $ Appendicolith : chr [1:782] "suspected" NA NA NA ... ## $ Perfusion : chr [1:782] NA NA NA NA ... ## $ Perforation : chr [1:782] "no" NA NA NA ... ## $ Surrounding_Tissue_Reaction : chr [1:782] "yes" NA NA NA ... ## $ Appendicular_Abscess : chr [1:782] "no" NA NA NA ... ## $ Abscess_Location : chr [1:782] NA NA NA NA ... ## $ Pathological_Lymph_Nodes : chr [1:782] "yes" NA NA "yes" ... ## $ Lymph_Nodes_Location : chr [1:782] "reUB" NA NA "reUB" ... ## $ Bowel_Wall_Thickening : chr [1:782] NA NA NA NA ... ## $ Conglomerate_of_Bowel_Loops : chr [1:782] NA NA NA NA ... ## $ Ileus : chr [1:782] NA NA NA NA ... ## $ Coprostasis : chr [1:782] NA NA NA NA ... ## $ Meteorism : chr [1:782] NA "yes" "yes" NA ... ## $ Enteritis : chr [1:782] NA NA "yes" "yes" ... ## $ Gynecological_Findings : chr [1:782] NA NA NA NA ... ``` --- # Convert Columns Explicitly - `as.*()` family of functions can help coerce columns to the correct type ```r app_data <- app_data |> mutate(BMI = as.numeric(BMI), US_Number = as.character(US_Number)) app_data ``` ``` ## # A tibble: 782 x 58 ## Age BMI Sex Height Weight Length_of_Stay Management Severity ## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> ## 1 12.7 16.9 female 148 37 3 conservative uncomplicated ## 2 14.1 31.9 male 147 69.5 2 conservative uncomplicated ## 3 14.1 23.3 female 163 62 4 conservative uncomplicated ## 4 16.4 20.6 female 165 56 3 conservative uncomplicated ## 5 11.1 16.9 female 163 45 3 conservative uncomplicated ## # i 777 more rows ## # i 50 more variables: Diagnosis_Presumptive <chr>, Diagnosis <chr>, ## # Alvarado_Score <dbl>, Paedriatic_Appendicitis_Score <dbl>, ## # Appendix_on_US <chr>, Appendix_Diameter <dbl>, Migratory_Pain <chr>, ## # Lower_Right_Abd_Pain <chr>, Contralateral_Rebound_Tenderness <chr>, ## # Coughing_Pain <chr>, Nausea <chr>, Loss_of_Appetite <chr>, ## # Body_Temperature <dbl>, WBC_Count <dbl>, Neutrophil_Percentage <dbl>, ... ``` --- # Do Basic Data Validation - Can use the `psych::describe()` function - Check that the min's, max's, etc. all make sense! ```r psych::describe(app_data) ``` ``` ## vars n mean sd median trimmed mad ## Age 1 781 11.35 3.53 11.44 11.53 3.59 ## BMI 2 755 18.91 4.39 18.06 18.43 3.91 ## Sex* 3 780 1.52 0.50 2.00 1.52 0.00 ## Height 4 756 148.02 19.73 149.65 149.33 19.50 ## Weight 5 779 43.17 17.39 41.40 42.18 18.68 ## Length_of_Stay 6 778 4.28 2.57 3.00 3.85 1.48 ## Management* 7 781 1.42 0.57 1.00 1.35 0.00 ## Severity* 8 781 1.85 0.36 2.00 1.93 0.00 ## Diagnosis_Presumptive* 9 780 4.04 2.86 3.00 3.17 0.00 ## Diagnosis* 10 780 1.41 0.49 1.00 1.38 0.00 ## Alvarado_Score 11 730 5.92 2.16 6.00 5.96 2.97 ## Paedriatic_Appendicitis_Score 12 730 5.25 1.96 5.00 5.21 1.48 ## Appendix_on_US* 13 777 1.65 0.48 2.00 1.69 0.00 ## Appendix_Diameter 14 498 7.76 2.54 7.50 7.63 2.22 ## Migratory_Pain* 15 773 1.27 0.45 1.00 1.22 0.00 ## Lower_Right_Abd_Pain* 16 774 1.95 0.22 2.00 2.00 0.00 ## Contralateral_Rebound_Tenderness* 17 767 1.39 0.49 1.00 1.36 0.00 ## Coughing_Pain* 18 766 1.28 0.45 1.00 1.23 0.00 ## Nausea* 19 774 1.59 0.49 2.00 1.61 0.00 ## Loss_of_Appetite* 20 772 1.51 0.50 2.00 1.51 0.00 ## Body_Temperature 21 775 37.40 0.90 37.20 37.36 0.74 ## WBC_Count 22 776 12.67 5.37 12.00 12.26 5.78 ## Neutrophil_Percentage 23 679 71.79 14.46 75.50 72.94 14.08 ## Segmented_Neutrophils 24 54 64.93 15.09 64.50 65.66 15.57 ## Neutrophilia* 25 732 1.49 0.50 1.00 1.49 0.00 ## RBC_Count 26 764 4.80 0.50 4.78 4.78 0.36 ## Hemoglobin 27 764 13.38 1.39 13.30 13.34 1.04 ## RDW 28 756 13.18 4.54 12.70 12.81 0.74 ## Thrombocyte_Count 29 764 285.25 72.49 276.00 281.32 65.23 ## Ketones_in_Urine* 30 582 3.22 1.07 4.00 3.40 0.00 ## RBC_in_Urine* 31 576 3.43 1.10 4.00 3.66 0.00 ## WBC_in_Urine* 32 583 3.65 0.90 4.00 3.92 0.00 ## CRP 33 771 31.39 57.43 7.00 16.79 10.38 ## Dysuria* 34 753 1.06 0.23 1.00 1.00 0.00 ## Stool* 35 765 3.49 0.97 4.00 3.73 0.00 ## Peritonitis* 36 773 2.65 0.58 3.00 2.75 0.00 ## Psoas_Sign* 37 745 1.31 0.46 1.00 1.27 0.00 ## Ipsilateral_Rebound_Tenderness* 38 619 1.06 0.24 1.00 1.00 0.00 ## US_Performed* 39 778 1.98 0.14 2.00 2.00 0.00 ## US_Number* 40 760 380.50 219.54 380.50 380.50 281.69 ## Free_Fluids* 41 719 1.43 0.50 1.00 1.41 0.00 ## Appendix_Wall_Layers* 42 218 1.75 0.96 1.00 1.69 0.00 ## Target_Sign* 43 138 1.63 0.48 2.00 1.66 0.00 ## Appendicolith* 44 69 2.00 0.99 2.00 2.00 1.48 ## Perfusion* 45 63 1.59 0.66 2.00 1.51 1.48 ## Perforation* 46 81 2.33 1.34 2.00 2.29 1.48 ## Surrounding_Tissue_Reaction* 47 252 1.83 0.38 2.00 1.91 0.00 ## Appendicular_Abscess* 48 85 1.46 0.84 1.00 1.33 0.00 ## Abscess_Location* 49 13 3.38 1.98 2.00 3.27 1.48 ## Pathological_Lymph_Nodes* 50 203 1.76 0.43 2.00 1.82 0.00 ## Lymph_Nodes_Location* 51 121 14.62 6.41 17.00 14.88 8.90 ## Bowel_Wall_Thickening* 52 99 1.56 0.50 2.00 1.57 0.00 ## Conglomerate_of_Bowel_Loops* 53 43 1.49 0.51 1.00 1.49 0.00 ## Ileus* 54 60 1.38 0.49 1.00 1.35 0.00 ## Coprostasis* 55 71 1.65 0.48 2.00 1.68 0.00 ## Meteorism* 56 140 1.92 0.27 2.00 2.00 0.00 ## Enteritis* 57 66 1.77 0.42 2.00 1.83 0.00 ## Gynecological_Findings* 58 26 6.96 2.85 6.00 6.95 2.97 ## min max range skew kurtosis se ## Age 0.00 18.36 18.36 -0.44 -0.18 0.13 ## BMI 7.83 38.16 30.33 1.13 1.66 0.16 ## Sex* 1.00 2.00 1.00 -0.07 -2.00 0.02 ## Height 53.00 192.00 139.00 -0.68 0.67 0.72 ## Weight 3.96 103.00 99.04 0.52 0.01 0.62 ## Length_of_Stay 1.00 28.00 27.00 3.23 17.36 0.09 ## Management* 1.00 4.00 3.00 1.00 0.22 0.02 ## Severity* 1.00 2.00 1.00 -1.93 1.73 0.01 ## Diagnosis_Presumptive* 1.00 16.00 15.00 2.43 4.11 0.10 ## Diagnosis* 1.00 2.00 1.00 0.38 -1.86 0.02 ## Alvarado_Score 0.00 10.00 10.00 -0.12 -0.81 0.08 ## Paedriatic_Appendicitis_Score 0.00 10.00 10.00 0.19 -0.45 0.07 ## Appendix_on_US* 1.00 2.00 1.00 -0.62 -1.62 0.02 ## Appendix_Diameter 2.70 17.00 14.30 0.50 -0.07 0.11 ## Migratory_Pain* 1.00 2.00 1.00 1.02 -0.97 0.02 ## Lower_Right_Abd_Pain* 1.00 2.00 1.00 -3.98 13.89 0.01 ## Contralateral_Rebound_Tenderness* 1.00 2.00 1.00 0.46 -1.79 0.02 ## Coughing_Pain* 1.00 2.00 1.00 0.95 -1.09 0.02 ## Nausea* 1.00 2.00 1.00 -0.35 -1.88 0.02 ## Loss_of_Appetite* 1.00 2.00 1.00 -0.03 -2.00 0.02 ## Body_Temperature 26.90 40.20 13.30 -1.47 22.77 0.03 ## WBC_Count 2.60 37.70 35.10 0.76 0.62 0.19 ## Neutrophil_Percentage 27.20 97.70 70.50 -0.65 -0.51 0.56 ## Segmented_Neutrophils 32.00 91.00 59.00 -0.33 -0.72 2.05 ## Neutrophilia* 1.00 2.00 1.00 0.03 -2.00 0.02 ## RBC_Count 3.62 14.00 10.38 8.28 149.36 0.02 ## Hemoglobin 8.20 36.00 27.80 5.56 89.57 0.05 ## RDW 11.20 86.90 75.70 14.90 229.39 0.17 ## Thrombocyte_Count 91.00 708.00 617.00 0.70 1.57 2.62 ## Ketones_in_Urine* 1.00 4.00 3.00 -1.10 -0.20 0.04 ## RBC_in_Urine* 1.00 4.00 3.00 -1.60 0.74 0.05 ## WBC_in_Urine* 1.00 4.00 3.00 -2.36 3.84 0.04 ## CRP 0.00 365.00 365.00 2.92 9.47 2.07 ## Dysuria* 1.00 2.00 1.00 3.76 12.14 0.01 ## Stool* 1.00 4.00 3.00 -1.86 2.05 0.03 ## Peritonitis* 1.00 3.00 2.00 -1.40 0.94 0.02 ## Psoas_Sign* 1.00 2.00 1.00 0.80 -1.36 0.02 ## Ipsilateral_Rebound_Tenderness* 1.00 2.00 1.00 3.65 11.31 0.01 ## US_Performed* 1.00 2.00 1.00 -6.98 46.76 0.00 ## US_Number* 1.00 760.00 759.00 0.00 -1.20 7.96 ## Free_Fluids* 1.00 2.00 1.00 0.28 -1.93 0.02 ## Appendix_Wall_Layers* 1.00 4.00 3.00 0.54 -1.62 0.06 ## Target_Sign* 1.00 2.00 1.00 -0.53 -1.73 0.04 ## Appendicolith* 1.00 3.00 2.00 0.00 -1.98 0.12 ## Perfusion* 1.00 4.00 3.00 0.99 1.13 0.08 ## Perforation* 1.00 4.00 3.00 0.28 -1.73 0.15 ## Surrounding_Tissue_Reaction* 1.00 2.00 1.00 -1.70 0.91 0.02 ## Appendicular_Abscess* 1.00 3.00 2.00 1.26 -0.38 0.09 ## Abscess_Location* 1.00 7.00 6.00 0.57 -1.37 0.55 ## Pathological_Lymph_Nodes* 1.00 2.00 1.00 -1.20 -0.56 0.03 ## Lymph_Nodes_Location* 1.00 24.00 23.00 -0.19 -1.21 0.58 ## Bowel_Wall_Thickening* 1.00 2.00 1.00 -0.22 -1.97 0.05 ## Conglomerate_of_Bowel_Loops* 1.00 2.00 1.00 0.04 -2.04 0.08 ## Ileus* 1.00 2.00 1.00 0.47 -1.81 0.06 ## Coprostasis* 1.00 2.00 1.00 -0.61 -1.66 0.06 ## Meteorism* 1.00 2.00 1.00 -3.10 7.66 0.02 ## Enteritis* 1.00 2.00 1.00 -1.27 -0.39 0.05 ## Gynecological_Findings* 1.00 13.00 12.00 0.08 -0.29 0.56 ``` --- # Determine Rate of Missing Values - Use `is.na()` ```r colSums(is.na(app_data)) ``` ``` ## Age BMI ## 1 27 ## Sex Height ## 2 26 ## Weight Length_of_Stay ## 3 4 ## Management Severity ## 1 1 ## Diagnosis_Presumptive Diagnosis ## 2 2 ## Alvarado_Score Paedriatic_Appendicitis_Score ## 52 52 ## Appendix_on_US Appendix_Diameter ## 5 284 ## Migratory_Pain Lower_Right_Abd_Pain ## 9 8 ## Contralateral_Rebound_Tenderness Coughing_Pain ## 15 16 ## Nausea Loss_of_Appetite ## 8 10 ## Body_Temperature WBC_Count ## 7 6 ## Neutrophil_Percentage Segmented_Neutrophils ## 103 728 ## Neutrophilia RBC_Count ## 50 18 ## Hemoglobin RDW ## 18 26 ## Thrombocyte_Count Ketones_in_Urine ## 18 200 ## RBC_in_Urine WBC_in_Urine ## 206 199 ## CRP Dysuria ## 11 29 ## Stool Peritonitis ## 17 9 ## Psoas_Sign Ipsilateral_Rebound_Tenderness ## 37 163 ## US_Performed US_Number ## 4 22 ## Free_Fluids Appendix_Wall_Layers ## 63 564 ## Target_Sign Appendicolith ## 644 713 ## Perfusion Perforation ## 719 701 ## Surrounding_Tissue_Reaction Appendicular_Abscess ## 530 697 ## Abscess_Location Pathological_Lymph_Nodes ## 769 579 ## Lymph_Nodes_Location Bowel_Wall_Thickening ## 661 683 ## Conglomerate_of_Bowel_Loops Ileus ## 739 722 ## Coprostasis Meteorism ## 711 642 ## Enteritis Gynecological_Findings ## 716 756 ``` --- # Determine Rate of Missing Values - Stay in the `tidyverse` ```r sum_na <- function(column){ sum(is.na(column)) } na_counts <- app_data |> summarize(across(everything(), sum_na)) na_counts ``` ``` ## # A tibble: 1 x 58 ## Age BMI Sex Height Weight Length_of_Stay Management Severity ## <int> <int> <int> <int> <int> <int> <int> <int> ## 1 1 27 2 26 3 4 1 1 ## # i 50 more variables: Diagnosis_Presumptive <int>, Diagnosis <int>, ## # Alvarado_Score <int>, Paedriatic_Appendicitis_Score <int>, ## # Appendix_on_US <int>, Appendix_Diameter <int>, Migratory_Pain <int>, ## # Lower_Right_Abd_Pain <int>, Contralateral_Rebound_Tenderness <int>, ## # Coughing_Pain <int>, Nausea <int>, Loss_of_Appetite <int>, ## # Body_Temperature <int>, WBC_Count <int>, Neutrophil_Percentage <int>, ## # Segmented_Neutrophils <int>, Neutrophilia <int>, RBC_Count <int>, ... ``` --- # Clean Up Data As Needed - Can remove rows with missing using `tidyr::drop_na()` function ```r names(app_data)[na_counts < 30] ``` ``` ## [1] "Age" "BMI" ## [3] "Sex" "Height" ## [5] "Weight" "Length_of_Stay" ## [7] "Management" "Severity" ## [9] "Diagnosis_Presumptive" "Diagnosis" ## [11] "Appendix_on_US" "Migratory_Pain" ## [13] "Lower_Right_Abd_Pain" "Contralateral_Rebound_Tenderness" ## [15] "Coughing_Pain" "Nausea" ## [17] "Loss_of_Appetite" "Body_Temperature" ## [19] "WBC_Count" "RBC_Count" ## [21] "Hemoglobin" "RDW" ## [23] "Thrombocyte_Count" "CRP" ## [25] "Dysuria" "Stool" ## [27] "Peritonitis" "US_Performed" ## [29] "US_Number" ``` --- # Clean Up Data As Needed - Can remove rows with missing using `tidyr::drop_na()` function ```r app_data |> drop_na(names(app_data)[na_counts < 30]) ``` ``` ## # A tibble: 674 x 58 ## Age BMI Sex Height Weight Length_of_Stay Management Severity ## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> ## 1 12.7 16.9 female 148 37 3 conservative uncomplicated ## 2 14.1 31.9 male 147 69.5 2 conservative uncomplicated ## 3 14.1 23.3 female 163 62 4 conservative uncomplicated ## 4 16.4 20.6 female 165 56 3 conservative uncomplicated ## 5 11.1 16.9 female 163 45 3 conservative uncomplicated ## # i 669 more rows ## # i 50 more variables: Diagnosis_Presumptive <chr>, Diagnosis <chr>, ## # Alvarado_Score <dbl>, Paedriatic_Appendicitis_Score <dbl>, ## # Appendix_on_US <chr>, Appendix_Diameter <dbl>, Migratory_Pain <chr>, ## # Lower_Right_Abd_Pain <chr>, Contralateral_Rebound_Tenderness <chr>, ## # Coughing_Pain <chr>, Nausea <chr>, Loss_of_Appetite <chr>, ## # Body_Temperature <dbl>, WBC_Count <dbl>, Neutrophil_Percentage <dbl>, ... ``` --- # May Want to Impute Values - We lose information when removing rows! - Can **impute** missing values with `tidyr::replace_na()` ```r app_data <- app_data |> replace_na(list(BMI = mean(app_data$BMI, na.rm = TRUE), Height = mean(app_data$Height, na.rm = TRUE))) app_data ``` ``` ## # A tibble: 782 x 58 ## Age BMI Sex Height Weight Length_of_Stay Management Severity ## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> ## 1 12.7 16.9 female 148 37 3 conservative uncomplicated ## 2 14.1 31.9 male 147 69.5 2 conservative uncomplicated ## 3 14.1 23.3 female 163 62 4 conservative uncomplicated ## 4 16.4 20.6 female 165 56 3 conservative uncomplicated ## 5 11.1 16.9 female 163 45 3 conservative uncomplicated ## # i 777 more rows ## # i 50 more variables: Diagnosis_Presumptive <chr>, Diagnosis <chr>, ## # Alvarado_Score <dbl>, Paedriatic_Appendicitis_Score <dbl>, ## # Appendix_on_US <chr>, Appendix_Diameter <dbl>, Migratory_Pain <chr>, ## # Lower_Right_Abd_Pain <chr>, Contralateral_Rebound_Tenderness <chr>, ## # Coughing_Pain <chr>, Nausea <chr>, Loss_of_Appetite <chr>, ## # Body_Temperature <dbl>, WBC_Count <dbl>, Neutrophil_Percentage <dbl>, ... ``` --- # EDA Basics - Get to know your data! - EDA generally consists of a few steps: + Understand how your data is stored + Do basic data validation + Determine rate of missing values + Clean data up data as needed + Investigate distributions - Univariate measures/graphs - Multivariate measures/graphs + Apply transformations and repeat previous step --- # Investigate distributions - How to summarize data depends on the type of data + Categorical (Qualitative) variable - entries are a label or attribute + Numeric (Quantitative) variable - entries are a numerical value where math can be performed --- layout: false # Investigate distributions - How to summarize data depends on the type of data + Categorical (Qualitative) variable - entries are a label or attribute + Numeric (Quantitative) variable - entries are a numerical value where math can be performed - Numerical summaries (across subgroups) + Contingency Tables (for categorical data) + Mean/Median + Standard Deviation/Variance/IQR + Quantiles/Percentiles - Graphical summaries (across subgroups) + Bar plots (for categorical data) + Histograms + Box plots + Scatter plots --- layout: true <div class="my-footer"><img src="data:image/png;base64,#img/logo.png" style="height: 60px;"/></div> --- # Categorical Data Goal: Describe the **distribution** of the variable - Distribution = pattern and frequency with which you observe a variable - Categorical variable - entries are a label or attribute + Describe the relative frequency (or count) for each category Variables of interest for this section: - `Sex`, `Diagnosis`, `Severity` --- # Factors A factor variable is really useful for certain categorical variables! **Factor** - special class of vector with a `levels` attribute - Can have more descriptive labels, ordering of categories, etc. - Levels define **all** possible values for that variable + Great for variable like `Day` (Monday, Tuesday, ..., Sunday) + Not great for variable like `Name` where new values may come up - Great for plotting as you can order the levels and give nicer labels --- # Factors - Let's create factor versions of our three variables ```r unique(app_data$Sex) ``` ``` ## [1] "female" "male" NA ``` ```r unique(app_data$Diagnosis) ``` ``` ## [1] "appendicitis" "no appendicitis" NA ``` ```r unique(app_data$Severity) ``` ``` ## [1] "uncomplicated" NA "complicated" ``` - Now we can use `factor()` or `as.factor()` to coerce the character variables --- # Factors - Let's create factor versions of our three variables ```r app_data |> mutate(SexF = factor(Sex, levels = c("female", "male"), labels = c("Female", "Male")), DiagnosisF = as.factor(Diagnosis), SeverityF = as.factor(Severity)) |> select(SexF, DiagnosisF, SeverityF) ``` ``` ## # A tibble: 782 x 3 ## SexF DiagnosisF SeverityF ## <fct> <fct> <fct> ## 1 Female appendicitis uncomplicated ## 2 Male no appendicitis uncomplicated ## 3 Female no appendicitis uncomplicated ## 4 Female no appendicitis uncomplicated ## 5 Female appendicitis uncomplicated ## # i 777 more rows ``` --- layout: false # Contingency Tables - Summarize categorical data by looking at counts! ```r app_data |> group_by(SexF) |> drop_na(SexF) |> summarize(count = n()) ``` ``` ## # A tibble: 2 x 2 ## SexF count ## <fct> <int> ## 1 Female 377 ## 2 Male 403 ``` ```r app_data |> group_by(DiagnosisF) |> drop_na(DiagnosisF) |> summarize(count = n()) ``` ``` ## # A tibble: 2 x 2 ## DiagnosisF count ## <fct> <int> ## 1 appendicitis 463 ## 2 no appendicitis 317 ``` --- # Contingency Tables - Summarize categorical data by looking at counts across combinations of variables! ```r app_data |> group_by(SexF, DiagnosisF) |> drop_na(SexF, DiagnosisF) |> summarize(count = n()) |> pivot_wider(names_from = DiagnosisF, values_from = count) ``` ``` ## # A tibble: 2 x 3 ## # Groups: SexF [2] ## SexF appendicitis `no appendicitis` ## <fct> <int> <int> ## 1 Female 200 176 ## 2 Male 262 141 ``` --- # Contingency Tables - Summarize categorical data by looking at counts across combinations of variables! ```r app_data |> group_by(SexF, DiagnosisF, SeverityF) |> drop_na(SexF, DiagnosisF, SeverityF) |> summarize(count = n()) |> pivot_wider(names_from = DiagnosisF, values_from = count) ``` ``` ## # A tibble: 4 x 4 ## # Groups: SexF [2] ## SexF SeverityF appendicitis `no appendicitis` ## <fct> <fct> <int> <int> ## 1 Female complicated 55 1 ## 2 Female uncomplicated 145 175 ## 3 Male complicated 63 NA ## 4 Male uncomplicated 199 141 ``` --- # Bar Charts - Main visual used is a bar plot! Simply displays our counts with bars. <img src="data:image/png;base64,#24-EDA_Concepts_files/figure-html/unnamed-chunk-19-1.svg" width="700px" height="400px" style="display: block; margin: auto;" /> --- # Bar Charts - Main visual used is a bar plot! Simply displays our counts with bars. <img src="data:image/png;base64,#24-EDA_Concepts_files/figure-html/unnamed-chunk-20-1.svg" width="700px" height="400px" style="display: block; margin: auto;" /> --- # Bar Charts - Main visual used is a bar plot! Simply displays our counts with bars. <img src="data:image/png;base64,#24-EDA_Concepts_files/figure-html/unnamed-chunk-21-1.svg" width="700px" height="400px" style="display: block; margin: auto;" /> --- # Numeric Data Goal: Describe the **distribution** of the variable - Distribution = pattern and frequency with which you observe a variable - Numeric variable - entries are a numerical value where math can be performed For a single numeric variable, describe the distribution via + Shape: Histogram, Density plot, ... + Measures of center: Mean, Median, ... + Measures of spread: Variance, Standard Deviation, Quartiles, IQR, ... For two numeric variables, describe the distribution via + Shape: Scatter plot, ... + Measures of linear relationship: Covariance, Correlation --- # Summarizing Center and Spread - We summarize center and spread for a numeric variable because it is difficult to compare entire distributions! + Consider the distributions of `Weight` for those with appendicitis and those without <img src="data:image/png;base64,#24-EDA_Concepts_files/figure-html/unnamed-chunk-22-1.svg" width="700px" height="400px" style="display: block; margin: auto;" /> --- # Summarizing Center and Spread - Mean and Median give good measures of the 'middle' type observations ```r app_data |> group_by(Diagnosis) |> drop_na(Diagnosis, Weight) |> summarize(mean_weight = mean(Weight), median_weight = median(Weight)) ``` ``` ## # A tibble: 2 x 3 ## Diagnosis mean_weight median_weight ## <chr> <dbl> <dbl> ## 1 appendicitis 41.7 39.5 ## 2 no appendicitis 45.3 46.3 ``` --- # Summarizing Center and Spread - Of course we need to understand the variability we see as well! Variance, standard deviation, and IQR are good measures of that. ```r app_data |> group_by(Diagnosis) |> drop_na(Diagnosis, Weight) |> summarize(across(Weight, .fns = list("mean" = mean, "median" = median, "var" = var, "sd" = sd, "IQR" = IQR), .names = "{.fn}_{.col}")) ``` ``` ## # A tibble: 2 x 6 ## Diagnosis mean_Weight median_Weight var_Weight sd_Weight IQR_Weight ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 appendicitis 41.7 39.5 305. 17.5 23.4 ## 2 no appendicitis 45.3 46.3 293. 17.1 23.5 ``` --- # Summarizing Shape - Most easily done via histograms and density plots + Histograms are more variable, which can be bad! <div style = "float:left; width = '40%'"> <img src="data:image/png;base64,#24-EDA_Concepts_files/figure-html/unnamed-chunk-25-1.svg" width="400px" style="display: block; margin: auto;" /> </div> <div style = "float:right; width = '40%'"> <img src="data:image/png;base64,#24-EDA_Concepts_files/figure-html/unnamed-chunk-26-1.svg" width="400px" style="display: block; margin: auto;" /> </div> --- # Summarizing Two Numeric Variables - To look at the distribution of two numeric variables together, we usually look at a scatter plot! <img src="data:image/png;base64,#24-EDA_Concepts_files/figure-html/unnamed-chunk-27-1.svg" width="700px" height="400px" style="display: block; margin: auto;" /> --- # Summarizing Two Numeric Variables - Again, difficult to describe the relationship generally! + Numerically we commonly describe the 'linear-ness' of the relationship + Done through covariance and correlation ```r app_data |> drop_na(Weight, Age) |> summarize(cov = cov(Weight, Age), corr = cor(Weight, Age)) ``` ``` ## # A tibble: 1 x 2 ## cov corr ## <dbl> <dbl> ## 1 47.0 0.766 ``` --- # Summarizing Two Numeric Variables - Again, difficult to describe the relationship generally! + Numerically we commonly describe the 'linear-ness' of the relationship + Done through covariance and correlation ```r app_data |> drop_na(Weight, Age) |> summarize(cov = cov(Weight, Age), corr = cor(Weight, Age)) ``` ``` ## # A tibble: 1 x 2 ## cov corr ## <dbl> <dbl> ## 1 47.0 0.766 ``` --- # Summarizing Two Numeric Variables - Of course we want to bring in subgroups to compare them! <img src="data:image/png;base64,#24-EDA_Concepts_files/figure-html/unnamed-chunk-30-1.svg" width="700px" height="400px" style="display: block; margin: auto;" /> --- # Summarizing Two Numeric Variables - Summarize based on groups! ```r app_data |> drop_na(Weight, Age, Diagnosis) |> group_by(Diagnosis) |> summarize(cov = cov(Weight, Age), corr = cor(Weight, Age)) ``` ``` ## # A tibble: 2 x 3 ## Diagnosis cov corr ## <chr> <dbl> <dbl> ## 1 appendicitis 48.0 0.775 ## 2 no appendicitis 44.3 0.748 ``` --- # Summarizing Two Numeric Variables - We can do really interesting stuff to add in additional variables (like a third numeric variable) <img src="data:image/png;base64,#24-EDA_Concepts_files/figure-html/unnamed-chunk-32-1.svg" width="700px" height="400px" style="display: block; margin: auto;" /> --- # Recap - EDA is often the first step to an analysis: + Understand how your data is stored + Do basic data validation + Determine rate of missing values + Clean data up data as needed + Investigate distributions - Univariate measures/graphs - Multivariate measures/graphs + Apply transformations and repeat previous step