Exploratory Data Analysis (EDA) Concepts

---

---

# Recap!

- Data Science!!
- R Projects/Quarto/Git/GitHub for reproducibility/communication
- R Data Structures

+ Vectors, Matrices, Data Frames, Lists

- R Control Flow

+ if/then/else, loops, function writing
    
- Reading & Manipulating data with the tidyverse!
- Next: Gain meaningful insights from data through EDA
- Later: Dashboards, Predictive Modeling, & More

---

# EDA Basics

- Get to know your data!

- EDA generally consists of a few steps:

+ Understand how your data is stored
    + Do basic data validation
    + Determine rate of missing values
    + Clean data up data as needed
    + Investigate distributions
        - Univariate measures/graphs
        - Multivariate measures/graphs
    + Apply transformations and repeat previous step
    
---

# Understand How Data is Stored

Let's read in some data!

- [Appendicitis Data](https://www4.stat.ncsu.edu/~online/datasets/app_data.xlsx)

> This dataset was acquired in a retrospective study from a cohort of pediatric patients admitted with abdominal pain to Children’s Hospital St. Hedwig in Regensburg, Germany. ... Alongside multiple US images for each subject, the dataset includes information encompassing laboratory tests, physical examination results, clinical scores, such as Alvarado and pediatric appendicitis scores, and expert-produced ultrasonographic findings. Lastly, the subjects were labeled w.r.t. three target variables: diagnosis (appendicitis vs. no appendicitis), management (surgical vs. conservative) and severity (complicated vs. uncomplicated or no appendicitis). ...

---

# Understand How Data is Stored

```r
#download data to local folder
library(tidyverse)
library(readxl)
app_data <- read_excel("data/app_data.xlsx", sheet = 1)
```

- Column data types should make sense for what you expect!

```r
app_data
```

```
## # A tibble: 782 x 58
##     Age BMI               Sex   Height Weight Length_of_Stay Management Severity
##   <dbl> <chr>             <chr>  <dbl>  <dbl>          <dbl> <chr>      <chr>   
## 1  12.7 16.8999999999999~ fema~    148   37                3 conservat~ uncompl~
## 2  14.1 31.9              male     147   69.5              2 conservat~ uncompl~
## 3  14.1 23.3              fema~    163   62                4 conservat~ uncompl~
## 4  16.4 20.6              fema~    165   56                3 conservat~ uncompl~
## 5  11.1 16.8999999999999~ fema~    163   45                3 conservat~ uncompl~
## # i 777 more rows
## # i 50 more variables: Diagnosis_Presumptive <chr>, Diagnosis <chr>,
## #   Alvarado_Score <dbl>, Paedriatic_Appendicitis_Score <dbl>,
## #   Appendix_on_US <chr>, Appendix_Diameter <dbl>, Migratory_Pain <chr>,
## #   Lower_Right_Abd_Pain <chr>, Contralateral_Rebound_Tenderness <chr>,
## #   Coughing_Pain <chr>, Nausea <chr>, Loss_of_Appetite <chr>,
## #   Body_Temperature <dbl>, WBC_Count <dbl>, Neutrophil_Percentage <dbl>, ...
```

---

# Understand How Data is Stored

- Check the structure of the data!

```r
str(app_data)
```

```
## tibble [782 x 58] (S3: tbl_df/tbl/data.frame)
##  $ Age                             : num [1:782] 12.7 14.1 14.1 16.4 11.1 ...
##  $ BMI                             : chr [1:782] "16.899999999999999" "31.9" "23.3" "20.6" ...
##  $ Sex                             : chr [1:782] "female" "male" "female" "female" ...
##  $ Height                          : num [1:782] 148 147 163 165 163 121 140 NA 131 174 ...
##  $ Weight                          : num [1:782] 37 69.5 62 56 45 45 38.5 21.5 26.7 45.5 ...
##  $ Length_of_Stay                  : num [1:782] 3 2 4 3 3 3 3 2 3 3 ...
##  $ Management                      : chr [1:782] "conservative" "conservative" "conservative" "conservative" ...
##  $ Severity                        : chr [1:782] "uncomplicated" "uncomplicated" "uncomplicated" "uncomplicated" ...
##  $ Diagnosis_Presumptive           : chr [1:782] "appendicitis" "appendicitis" "appendicitis" "appendicitis" ...
##  $ Diagnosis                       : chr [1:782] "appendicitis" "no appendicitis" "no appendicitis" "no appendicitis" ...
##  $ Alvarado_Score                  : num [1:782] 4 5 5 7 5 6 5 3 7 4 ...
##  $ Paedriatic_Appendicitis_Score   : num [1:782] 3 4 3 6 6 7 6 3 6 4 ...
##  $ Appendix_on_US                  : chr [1:782] "yes" "no" "no" "no" ...
##  $ Appendix_Diameter               : num [1:782] 7.1 NA NA NA 7 NA NA NA 3.7 8 ...
##  $ Migratory_Pain                  : chr [1:782] "no" "yes" "no" "yes" ...
##  $ Lower_Right_Abd_Pain            : chr [1:782] "yes" "yes" "yes" "yes" ...
##  $ Contralateral_Rebound_Tenderness: chr [1:782] "yes" "yes" "yes" "no" ...
##  $ Coughing_Pain                   : chr [1:782] "no" "no" "no" "no" ...
##  $ Nausea                          : chr [1:782] "no" "no" "no" "yes" ...
##  $ Loss_of_Appetite                : chr [1:782] "yes" "yes" "no" "yes" ...
##  $ Body_Temperature                : num [1:782] 37 36.9 36.6 36 36.9 36.9 36.7 36.8 37.3 37.1 ...
##  $ WBC_Count                       : num [1:782] 7.7 8.1 13.2 11.4 8.1 9.5 10 8 20.9 5.8 ...
##  $ Neutrophil_Percentage           : num [1:782] 68.2 64.8 74.8 63 44 71.4 69.1 79.6 76 47.2 ...
##  $ Segmented_Neutrophils           : num [1:782] NA NA NA NA NA NA NA NA NA NA ...
##  $ Neutrophilia                    : chr [1:782] "no" "no" "no" "no" ...
##  $ RBC_Count                       : num [1:782] 5.27 5.26 3.98 4.64 4.44 4.96 4.77 4.89 4.61 4.78 ...
##  $ Hemoglobin                      : num [1:782] 14.8 15.7 11.4 13.6 12.6 12.5 12.7 12 13.4 12.9 ...
##  $ RDW                             : num [1:782] 12.2 12.7 12.2 13.2 13.6 13.3 12.6 13.9 12 12.6 ...
##  $ Thrombocyte_Count               : num [1:782] 254 151 300 258 311 249 337 412 350 220 ...
##  $ Ketones_in_Urine                : chr [1:782] "++" "no" "no" "no" ...
##  $ RBC_in_Urine                    : chr [1:782] "+" "no" "no" "no" ...
##  $ WBC_in_Urine                    : chr [1:782] "no" "no" "no" "no" ...
##  $ CRP                             : num [1:782] 0 3 3 0 0 63 9 0 20 0 ...
##  $ Dysuria                         : chr [1:782] "no" "yes" "no" "yes" ...
##  $ Stool                           : chr [1:782] "normal" "normal" "constipation" "normal" ...
##  $ Peritonitis                     : chr [1:782] "no" "no" "no" "no" ...
##  $ Psoas_Sign                      : chr [1:782] "yes" "yes" "yes" "yes" ...
##  $ Ipsilateral_Rebound_Tenderness  : chr [1:782] "no" "no" "no" "no" ...
##  $ US_Performed                    : chr [1:782] "yes" "yes" "yes" "yes" ...
##  $ US_Number                       : num [1:782] 882 883 884 886 887 888 889 890 891 893 ...
##  $ Free_Fluids                     : chr [1:782] "no" "no" "no" "no" ...
##  $ Appendix_Wall_Layers            : chr [1:782] "intact" NA NA NA ...
##  $ Target_Sign                     : chr [1:782] NA NA NA NA ...
##  $ Appendicolith                   : chr [1:782] "suspected" NA NA NA ...
##  $ Perfusion                       : chr [1:782] NA NA NA NA ...
##  $ Perforation                     : chr [1:782] "no" NA NA NA ...
##  $ Surrounding_Tissue_Reaction     : chr [1:782] "yes" NA NA NA ...
##  $ Appendicular_Abscess            : chr [1:782] "no" NA NA NA ...
##  $ Abscess_Location                : chr [1:782] NA NA NA NA ...
##  $ Pathological_Lymph_Nodes        : chr [1:782] "yes" NA NA "yes" ...
##  $ Lymph_Nodes_Location            : chr [1:782] "reUB" NA NA "reUB" ...
##  $ Bowel_Wall_Thickening           : chr [1:782] NA NA NA NA ...
##  $ Conglomerate_of_Bowel_Loops     : chr [1:782] NA NA NA NA ...
##  $ Ileus                           : chr [1:782] NA NA NA NA ...
##  $ Coprostasis                     : chr [1:782] NA NA NA NA ...
##  $ Meteorism                       : chr [1:782] NA "yes" "yes" NA ...
##  $ Enteritis                       : chr [1:782] NA NA "yes" "yes" ...
##  $ Gynecological_Findings          : chr [1:782] NA NA NA NA ...
```

---

# Convert Columns Explicitly

- `as.*()` family of functions can help coerce columns to the correct type

```r
app_data <- app_data |>
  mutate(BMI = as.numeric(BMI),
         US_Number = as.character(US_Number))
app_data
```

```
## # A tibble: 782 x 58
##     Age   BMI Sex    Height Weight Length_of_Stay Management   Severity     
##   <dbl> <dbl> <chr>   <dbl>  <dbl>          <dbl> <chr>        <chr>        
## 1  12.7  16.9 female    148   37                3 conservative uncomplicated
## 2  14.1  31.9 male      147   69.5              2 conservative uncomplicated
## 3  14.1  23.3 female    163   62                4 conservative uncomplicated
## 4  16.4  20.6 female    165   56                3 conservative uncomplicated
## 5  11.1  16.9 female    163   45                3 conservative uncomplicated
## # i 777 more rows
## # i 50 more variables: Diagnosis_Presumptive <chr>, Diagnosis <chr>,
## #   Alvarado_Score <dbl>, Paedriatic_Appendicitis_Score <dbl>,
## #   Appendix_on_US <chr>, Appendix_Diameter <dbl>, Migratory_Pain <chr>,
## #   Lower_Right_Abd_Pain <chr>, Contralateral_Rebound_Tenderness <chr>,
## #   Coughing_Pain <chr>, Nausea <chr>, Loss_of_Appetite <chr>,
## #   Body_Temperature <dbl>, WBC_Count <dbl>, Neutrophil_Percentage <dbl>, ...
```

---

# Do Basic Data Validation

- Can use the `psych::describe()` function
- Check that the min's, max's, etc. all make sense!

```r
psych::describe(app_data)
```

```
##                                   vars   n   mean     sd median trimmed    mad
## Age                                  1 781  11.35   3.53  11.44   11.53   3.59
## BMI                                  2 755  18.91   4.39  18.06   18.43   3.91
## Sex*                                 3 780   1.52   0.50   2.00    1.52   0.00
## Height                               4 756 148.02  19.73 149.65  149.33  19.50
## Weight                               5 779  43.17  17.39  41.40   42.18  18.68
## Length_of_Stay                       6 778   4.28   2.57   3.00    3.85   1.48
## Management*                          7 781   1.42   0.57   1.00    1.35   0.00
## Severity*                            8 781   1.85   0.36   2.00    1.93   0.00
## Diagnosis_Presumptive*               9 780   4.04   2.86   3.00    3.17   0.00
## Diagnosis*                          10 780   1.41   0.49   1.00    1.38   0.00
## Alvarado_Score                      11 730   5.92   2.16   6.00    5.96   2.97
## Paedriatic_Appendicitis_Score       12 730   5.25   1.96   5.00    5.21   1.48
## Appendix_on_US*                     13 777   1.65   0.48   2.00    1.69   0.00
## Appendix_Diameter                   14 498   7.76   2.54   7.50    7.63   2.22
## Migratory_Pain*                     15 773   1.27   0.45   1.00    1.22   0.00
## Lower_Right_Abd_Pain*               16 774   1.95   0.22   2.00    2.00   0.00
## Contralateral_Rebound_Tenderness*   17 767   1.39   0.49   1.00    1.36   0.00
## Coughing_Pain*                      18 766   1.28   0.45   1.00    1.23   0.00
## Nausea*                             19 774   1.59   0.49   2.00    1.61   0.00
## Loss_of_Appetite*                   20 772   1.51   0.50   2.00    1.51   0.00
## Body_Temperature                    21 775  37.40   0.90  37.20   37.36   0.74
## WBC_Count                           22 776  12.67   5.37  12.00   12.26   5.78
## Neutrophil_Percentage               23 679  71.79  14.46  75.50   72.94  14.08
## Segmented_Neutrophils               24  54  64.93  15.09  64.50   65.66  15.57
## Neutrophilia*                       25 732   1.49   0.50   1.00    1.49   0.00
## RBC_Count                           26 764   4.80   0.50   4.78    4.78   0.36
## Hemoglobin                          27 764  13.38   1.39  13.30   13.34   1.04
## RDW                                 28 756  13.18   4.54  12.70   12.81   0.74
## Thrombocyte_Count                   29 764 285.25  72.49 276.00  281.32  65.23
## Ketones_in_Urine*                   30 582   3.22   1.07   4.00    3.40   0.00
## RBC_in_Urine*                       31 576   3.43   1.10   4.00    3.66   0.00
## WBC_in_Urine*                       32 583   3.65   0.90   4.00    3.92   0.00
## CRP                                 33 771  31.39  57.43   7.00   16.79  10.38
## Dysuria*                            34 753   1.06   0.23   1.00    1.00   0.00
## Stool*                              35 765   3.49   0.97   4.00    3.73   0.00
## Peritonitis*                        36 773   2.65   0.58   3.00    2.75   0.00
## Psoas_Sign*                         37 745   1.31   0.46   1.00    1.27   0.00
## Ipsilateral_Rebound_Tenderness*     38 619   1.06   0.24   1.00    1.00   0.00
## US_Performed*                       39 778   1.98   0.14   2.00    2.00   0.00
## US_Number*                          40 760 380.50 219.54 380.50  380.50 281.69
## Free_Fluids*                        41 719   1.43   0.50   1.00    1.41   0.00
## Appendix_Wall_Layers*               42 218   1.75   0.96   1.00    1.69   0.00
## Target_Sign*                        43 138   1.63   0.48   2.00    1.66   0.00
## Appendicolith*                      44  69   2.00   0.99   2.00    2.00   1.48
## Perfusion*                          45  63   1.59   0.66   2.00    1.51   1.48
## Perforation*                        46  81   2.33   1.34   2.00    2.29   1.48
## Surrounding_Tissue_Reaction*        47 252   1.83   0.38   2.00    1.91   0.00
## Appendicular_Abscess*               48  85   1.46   0.84   1.00    1.33   0.00
## Abscess_Location*                   49  13   3.38   1.98   2.00    3.27   1.48
## Pathological_Lymph_Nodes*           50 203   1.76   0.43   2.00    1.82   0.00
## Lymph_Nodes_Location*               51 121  14.62   6.41  17.00   14.88   8.90
## Bowel_Wall_Thickening*              52  99   1.56   0.50   2.00    1.57   0.00
## Conglomerate_of_Bowel_Loops*        53  43   1.49   0.51   1.00    1.49   0.00
## Ileus*                              54  60   1.38   0.49   1.00    1.35   0.00
## Coprostasis*                        55  71   1.65   0.48   2.00    1.68   0.00
## Meteorism*                          56 140   1.92   0.27   2.00    2.00   0.00
## Enteritis*                          57  66   1.77   0.42   2.00    1.83   0.00
## Gynecological_Findings*             58  26   6.96   2.85   6.00    6.95   2.97
##                                     min    max  range  skew kurtosis   se
## Age                                0.00  18.36  18.36 -0.44    -0.18 0.13
## BMI                                7.83  38.16  30.33  1.13     1.66 0.16
## Sex*                               1.00   2.00   1.00 -0.07    -2.00 0.02
## Height                            53.00 192.00 139.00 -0.68     0.67 0.72
## Weight                             3.96 103.00  99.04  0.52     0.01 0.62
## Length_of_Stay                     1.00  28.00  27.00  3.23    17.36 0.09
## Management*                        1.00   4.00   3.00  1.00     0.22 0.02
## Severity*                          1.00   2.00   1.00 -1.93     1.73 0.01
## Diagnosis_Presumptive*             1.00  16.00  15.00  2.43     4.11 0.10
## Diagnosis*                         1.00   2.00   1.00  0.38    -1.86 0.02
## Alvarado_Score                     0.00  10.00  10.00 -0.12    -0.81 0.08
## Paedriatic_Appendicitis_Score      0.00  10.00  10.00  0.19    -0.45 0.07
## Appendix_on_US*                    1.00   2.00   1.00 -0.62    -1.62 0.02
## Appendix_Diameter                  2.70  17.00  14.30  0.50    -0.07 0.11
## Migratory_Pain*                    1.00   2.00   1.00  1.02    -0.97 0.02
## Lower_Right_Abd_Pain*              1.00   2.00   1.00 -3.98    13.89 0.01
## Contralateral_Rebound_Tenderness*  1.00   2.00   1.00  0.46    -1.79 0.02
## Coughing_Pain*                     1.00   2.00   1.00  0.95    -1.09 0.02
## Nausea*                            1.00   2.00   1.00 -0.35    -1.88 0.02
## Loss_of_Appetite*                  1.00   2.00   1.00 -0.03    -2.00 0.02
## Body_Temperature                  26.90  40.20  13.30 -1.47    22.77 0.03
## WBC_Count                          2.60  37.70  35.10  0.76     0.62 0.19
## Neutrophil_Percentage             27.20  97.70  70.50 -0.65    -0.51 0.56
## Segmented_Neutrophils             32.00  91.00  59.00 -0.33    -0.72 2.05
## Neutrophilia*                      1.00   2.00   1.00  0.03    -2.00 0.02
## RBC_Count                          3.62  14.00  10.38  8.28   149.36 0.02
## Hemoglobin                         8.20  36.00  27.80  5.56    89.57 0.05
## RDW                               11.20  86.90  75.70 14.90   229.39 0.17
## Thrombocyte_Count                 91.00 708.00 617.00  0.70     1.57 2.62
## Ketones_in_Urine*                  1.00   4.00   3.00 -1.10    -0.20 0.04
## RBC_in_Urine*                      1.00   4.00   3.00 -1.60     0.74 0.05
## WBC_in_Urine*                      1.00   4.00   3.00 -2.36     3.84 0.04
## CRP                                0.00 365.00 365.00  2.92     9.47 2.07
## Dysuria*                           1.00   2.00   1.00  3.76    12.14 0.01
## Stool*                             1.00   4.00   3.00 -1.86     2.05 0.03
## Peritonitis*                       1.00   3.00   2.00 -1.40     0.94 0.02
## Psoas_Sign*                        1.00   2.00   1.00  0.80    -1.36 0.02
## Ipsilateral_Rebound_Tenderness*    1.00   2.00   1.00  3.65    11.31 0.01
## US_Performed*                      1.00   2.00   1.00 -6.98    46.76 0.00
## US_Number*                         1.00 760.00 759.00  0.00    -1.20 7.96
## Free_Fluids*                       1.00   2.00   1.00  0.28    -1.93 0.02
## Appendix_Wall_Layers*              1.00   4.00   3.00  0.54    -1.62 0.06
## Target_Sign*                       1.00   2.00   1.00 -0.53    -1.73 0.04
## Appendicolith*                     1.00   3.00   2.00  0.00    -1.98 0.12
## Perfusion*                         1.00   4.00   3.00  0.99     1.13 0.08
## Perforation*                       1.00   4.00   3.00  0.28    -1.73 0.15
## Surrounding_Tissue_Reaction*       1.00   2.00   1.00 -1.70     0.91 0.02
## Appendicular_Abscess*              1.00   3.00   2.00  1.26    -0.38 0.09
## Abscess_Location*                  1.00   7.00   6.00  0.57    -1.37 0.55
## Pathological_Lymph_Nodes*          1.00   2.00   1.00 -1.20    -0.56 0.03
## Lymph_Nodes_Location*              1.00  24.00  23.00 -0.19    -1.21 0.58
## Bowel_Wall_Thickening*             1.00   2.00   1.00 -0.22    -1.97 0.05
## Conglomerate_of_Bowel_Loops*       1.00   2.00   1.00  0.04    -2.04 0.08
## Ileus*                             1.00   2.00   1.00  0.47    -1.81 0.06
## Coprostasis*                       1.00   2.00   1.00 -0.61    -1.66 0.06
## Meteorism*                         1.00   2.00   1.00 -3.10     7.66 0.02
## Enteritis*                         1.00   2.00   1.00 -1.27    -0.39 0.05
## Gynecological_Findings*            1.00  13.00  12.00  0.08    -0.29 0.56
```

---

# Determine Rate of Missing Values

- Use `is.na()`

```r
colSums(is.na(app_data))
```

```
##                              Age                              BMI 
##                                1                               27 
##                              Sex                           Height 
##                                2                               26 
##                           Weight                   Length_of_Stay 
##                                3                                4 
##                       Management                         Severity 
##                                1                                1 
##            Diagnosis_Presumptive                        Diagnosis 
##                                2                                2 
##                   Alvarado_Score    Paedriatic_Appendicitis_Score 
##                               52                               52 
##                   Appendix_on_US                Appendix_Diameter 
##                                5                              284 
##                   Migratory_Pain             Lower_Right_Abd_Pain 
##                                9                                8 
## Contralateral_Rebound_Tenderness                    Coughing_Pain 
##                               15                               16 
##                           Nausea                 Loss_of_Appetite 
##                                8                               10 
##                 Body_Temperature                        WBC_Count 
##                                7                                6 
##            Neutrophil_Percentage            Segmented_Neutrophils 
##                              103                              728 
##                     Neutrophilia                        RBC_Count 
##                               50                               18 
##                       Hemoglobin                              RDW 
##                               18                               26 
##                Thrombocyte_Count                 Ketones_in_Urine 
##                               18                              200 
##                     RBC_in_Urine                     WBC_in_Urine 
##                              206                              199 
##                              CRP                          Dysuria 
##                               11                               29 
##                            Stool                      Peritonitis 
##                               17                                9 
##                       Psoas_Sign   Ipsilateral_Rebound_Tenderness 
##                               37                              163 
##                     US_Performed                        US_Number 
##                                4                               22 
##                      Free_Fluids             Appendix_Wall_Layers 
##                               63                              564 
##                      Target_Sign                    Appendicolith 
##                              644                              713 
##                        Perfusion                      Perforation 
##                              719                              701 
##      Surrounding_Tissue_Reaction             Appendicular_Abscess 
##                              530                              697 
##                 Abscess_Location         Pathological_Lymph_Nodes 
##                              769                              579 
##             Lymph_Nodes_Location            Bowel_Wall_Thickening 
##                              661                              683 
##      Conglomerate_of_Bowel_Loops                            Ileus 
##                              739                              722 
##                      Coprostasis                        Meteorism 
##                              711                              642 
##                        Enteritis           Gynecological_Findings 
##                              716                              756
```

---

# Determine Rate of Missing Values

- Stay in the `tidyverse`

```r
sum_na <- function(column){
  sum(is.na(column))
}
na_counts <- app_data |>
  summarize(across(everything(), sum_na))
na_counts
```

```
## # A tibble: 1 x 58
##     Age   BMI   Sex Height Weight Length_of_Stay Management Severity
##   <int> <int> <int>  <int>  <int>          <int>      <int>    <int>
## 1     1    27     2     26      3              4          1        1
## # i 50 more variables: Diagnosis_Presumptive <int>, Diagnosis <int>,
## #   Alvarado_Score <int>, Paedriatic_Appendicitis_Score <int>,
## #   Appendix_on_US <int>, Appendix_Diameter <int>, Migratory_Pain <int>,
## #   Lower_Right_Abd_Pain <int>, Contralateral_Rebound_Tenderness <int>,
## #   Coughing_Pain <int>, Nausea <int>, Loss_of_Appetite <int>,
## #   Body_Temperature <int>, WBC_Count <int>, Neutrophil_Percentage <int>,
## #   Segmented_Neutrophils <int>, Neutrophilia <int>, RBC_Count <int>, ...
```

---

# Clean Up Data As Needed

- Can remove rows with missing using `tidyr::drop_na()` function

```r
names(app_data)[na_counts < 30]
```

```
##  [1] "Age"                              "BMI"                             
##  [3] "Sex"                              "Height"                          
##  [5] "Weight"                           "Length_of_Stay"                  
##  [7] "Management"                       "Severity"                        
##  [9] "Diagnosis_Presumptive"            "Diagnosis"                       
## [11] "Appendix_on_US"                   "Migratory_Pain"                  
## [13] "Lower_Right_Abd_Pain"             "Contralateral_Rebound_Tenderness"
## [15] "Coughing_Pain"                    "Nausea"                          
## [17] "Loss_of_Appetite"                 "Body_Temperature"                
## [19] "WBC_Count"                        "RBC_Count"                       
## [21] "Hemoglobin"                       "RDW"                             
## [23] "Thrombocyte_Count"                "CRP"                             
## [25] "Dysuria"                          "Stool"                           
## [27] "Peritonitis"                      "US_Performed"                    
## [29] "US_Number"
```

---

# Clean Up Data As Needed

- Can remove rows with missing using `tidyr::drop_na()` function

```r
app_data |> 
  drop_na(names(app_data)[na_counts < 30])
```

```
## # A tibble: 674 x 58
##     Age   BMI Sex    Height Weight Length_of_Stay Management   Severity     
##   <dbl> <dbl> <chr>   <dbl>  <dbl>          <dbl> <chr>        <chr>        
## 1  12.7  16.9 female    148   37                3 conservative uncomplicated
## 2  14.1  31.9 male      147   69.5              2 conservative uncomplicated
## 3  14.1  23.3 female    163   62                4 conservative uncomplicated
## 4  16.4  20.6 female    165   56                3 conservative uncomplicated
## 5  11.1  16.9 female    163   45                3 conservative uncomplicated
## # i 669 more rows
## # i 50 more variables: Diagnosis_Presumptive <chr>, Diagnosis <chr>,
## #   Alvarado_Score <dbl>, Paedriatic_Appendicitis_Score <dbl>,
## #   Appendix_on_US <chr>, Appendix_Diameter <dbl>, Migratory_Pain <chr>,
## #   Lower_Right_Abd_Pain <chr>, Contralateral_Rebound_Tenderness <chr>,
## #   Coughing_Pain <chr>, Nausea <chr>, Loss_of_Appetite <chr>,
## #   Body_Temperature <dbl>, WBC_Count <dbl>, Neutrophil_Percentage <dbl>, ...
```

---

# May Want to Impute Values

- We lose information when removing rows!

- Can **impute** missing values with `tidyr::replace_na()`

```r
app_data <- app_data |> 
  replace_na(list(BMI = mean(app_data$BMI, na.rm = TRUE),
                  Height = mean(app_data$Height, na.rm = TRUE)))
app_data
```

---

# EDA Basics

- Get to know your data!

- EDA generally consists of a few steps:

---

# Investigate distributions

- How to summarize data depends on the type of data

+ Categorical (Qualitative) variable - entries are a label or attribute   
  + Numeric (Quantitative) variable - entries are a numerical value where math can be performed

---

# Investigate distributions

- How to summarize data depends on the type of data

+ Categorical (Qualitative) variable - entries are a label or attribute   
  + Numeric (Quantitative) variable - entries are a numerical value where math can be performed

- Numerical summaries (across subgroups)

+ Contingency Tables (for categorical data)
    + Mean/Median  
    + Standard Deviation/Variance/IQR
    + Quantiles/Percentiles

- Graphical summaries (across subgroups)

+ Bar plots (for categorical data)
    + Histograms  
    + Box plots  
    + Scatter plots

---

---

# Categorical Data

Goal: Describe the **distribution** of the variable

- Distribution = pattern and frequency with which you observe a variable  
- Categorical variable - entries are a label or attribute

+ Describe the relative frequency (or count) for each category

Variables of interest for this section:

- `Sex`, `Diagnosis`, `Severity`

---

# Factors

A factor variable is really useful for certain categorical variables!

**Factor** - special class of vector with a `levels` attribute

- Can have more descriptive labels, ordering of categories, etc.
- Levels define **all** possible values for that variable

+ Great for variable like `Day` (Monday, Tuesday, ..., Sunday)  
    + Not great for variable like `Name` where new values may come up

- Great for plotting as you can order the levels and give nicer labels

---

# Factors

- Let's create factor versions of our three variables

```r
unique(app_data$Sex)
```

```
## [1] "female" "male"   NA
```

```r
unique(app_data$Diagnosis)
```

```
## [1] "appendicitis"    "no appendicitis" NA
```

```r
unique(app_data$Severity)
```

```
## [1] "uncomplicated" NA              "complicated"
```

- Now we can use `factor()` or `as.factor()` to coerce the character variables

---

# Factors

- Let's create factor versions of our three variables

```r
app_data |>
  mutate(SexF = factor(Sex, levels = c("female", "male"), labels = c("Female", "Male")),
         DiagnosisF = as.factor(Diagnosis),
         SeverityF = as.factor(Severity)) |>
  select(SexF, DiagnosisF, SeverityF)
```

```
## # A tibble: 782 x 3
##   SexF   DiagnosisF      SeverityF    
##   <fct>  <fct>           <fct>        
## 1 Female appendicitis    uncomplicated
## 2 Male   no appendicitis uncomplicated
## 3 Female no appendicitis uncomplicated
## 4 Female no appendicitis uncomplicated
## 5 Female appendicitis    uncomplicated
## # i 777 more rows
```

---

# Contingency Tables

- Summarize categorical data by looking at counts!

```r
app_data |>
  group_by(SexF) |>
  drop_na(SexF) |>
  summarize(count = n())
```

```
## # A tibble: 2 x 2
##   SexF   count
##   <fct>  <int>
## 1 Female   377
## 2 Male     403
```

```r
app_data |>
  group_by(DiagnosisF) |>
  drop_na(DiagnosisF) |>
  summarize(count = n())
```

```
## # A tibble: 2 x 2
##   DiagnosisF      count
##   <fct>           <int>
## 1 appendicitis      463
## 2 no appendicitis   317
```

---

# Contingency Tables

- Summarize categorical data by looking at counts across combinations of variables!

```r
app_data |>
  group_by(SexF, DiagnosisF) |>
  drop_na(SexF, DiagnosisF) |>
  summarize(count = n()) |>
  pivot_wider(names_from = DiagnosisF, values_from = count)
```

```
## # A tibble: 2 x 3
## # Groups:   SexF [2]
##   SexF   appendicitis `no appendicitis`
##   <fct>         <int>             <int>
## 1 Female          200               176
## 2 Male            262               141
```

---

# Contingency Tables

- Summarize categorical data by looking at counts across combinations of variables!

```r
app_data |>
  group_by(SexF, DiagnosisF, SeverityF) |>
  drop_na(SexF, DiagnosisF, SeverityF) |>
  summarize(count = n()) |>
  pivot_wider(names_from = DiagnosisF, values_from = count)
```

```
## # A tibble: 4 x 4
## # Groups:   SexF [2]
##   SexF   SeverityF     appendicitis `no appendicitis`
##   <fct>  <fct>                <int>             <int>
## 1 Female complicated             55                 1
## 2 Female uncomplicated          145               175
## 3 Male   complicated             63                NA
## 4 Male   uncomplicated          199               141
```

---

# Bar Charts

- Main visual used is a bar plot! Simply displays our counts with bars.

---

# Bar Charts

- Main visual used is a bar plot! Simply displays our counts with bars.

---

# Bar Charts

- Main visual used is a bar plot! Simply displays our counts with bars.

---

# Numeric Data

Goal: Describe the **distribution** of the variable

- Distribution = pattern and frequency with which you observe a variable  
- Numeric variable - entries are a numerical value where math can be performed

For a single numeric variable, describe the distribution via

+ Shape: Histogram, Density plot, ...
+ Measures of center: Mean, Median, ...
+ Measures of spread: Variance, Standard Deviation, Quartiles, IQR, ...

For two numeric variables, describe the distribution via

+ Shape: Scatter plot, ...
+ Measures of linear relationship: Covariance, Correlation

---

# Summarizing Center and Spread

- We summarize center and spread for a numeric variable because it is difficult to compare entire distributions!

+ Consider the distributions of `Weight` for those with appendicitis and those without

---

# Summarizing Center and Spread

- Mean and Median give good measures of the 'middle' type observations

```r
app_data |>
  group_by(Diagnosis) |>
  drop_na(Diagnosis, Weight) |>
  summarize(mean_weight = mean(Weight), 
            median_weight = median(Weight))
```

```
## # A tibble: 2 x 3
##   Diagnosis       mean_weight median_weight
##   <chr>                 <dbl>         <dbl>
## 1 appendicitis           41.7          39.5
## 2 no appendicitis        45.3          46.3
```

---

# Summarizing Center and Spread

- Of course we need to understand the variability we see as well! Variance, standard deviation, and IQR are good measures of that.

```r
app_data |>
  group_by(Diagnosis) |>
  drop_na(Diagnosis, Weight) |>
  summarize(across(Weight, .fns = list("mean" = mean, 
                                       "median" = median, 
                                       "var" = var, 
                                       "sd" = sd, 
                                       "IQR" = IQR), .names = "{.fn}_{.col}"))
```

```
## # A tibble: 2 x 6
##   Diagnosis       mean_Weight median_Weight var_Weight sd_Weight IQR_Weight
##   <chr>                 <dbl>         <dbl>      <dbl>     <dbl>      <dbl>
## 1 appendicitis           41.7          39.5       305.      17.5       23.4
## 2 no appendicitis        45.3          46.3       293.      17.1       23.5
```

---

# Summarizing Shape

- Most easily done via histograms and density plots

+ Histograms are more variable, which can be bad!

---