Published

2025-03-31

Open In Colab

Numerical Summaries

Justin Post

  • Usual first step in an analysis is to get to know your data (an Exploratory Data Analysis (EDA))

  • EDA generally consists of a few steps:

    • Understand how your data is stored
    • Do basic data validation
    • Determine rate of missing values
    • Clean data up data as needed
    • Investigate distributions
      • Univariate measures/graphs
      • Multivariate measures/graphs
    • Apply transformations and repeat previous step

Note: These types of webpages are built from Jupyter notebooks (.ipynb files). You can access your own versions of them by clicking here. It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you’d like!


Understand How Data is Stored

First, let’s read in some data. Recall, for .csv files (comma separated value files) we can read them in using pandas read_csv() function.

We’ll read in the classic titanic data set.

import pandas as pd
titanic_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/titanic.csv")
  • The .info() method allows us to see how our variables are stored (among other things)
  • Column data types should make sense for what you expect!
titanic_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1310 entries, 0 to 1309
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   float64
 1   survived   1309 non-null   float64
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   float64
 6   parch      1309 non-null   float64
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(7), object(7)
memory usage: 143.4+ KB
  • .head() and .tail() help to see what we have as well
titanic_data.head() #clearly some missing values with NaNs
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
0 1.0 1.0 Allen, Miss. Elisabeth Walton female 29.0000 0.0 0.0 24160 211.3375 B5 S 2 NaN St Louis, MO
1 1.0 1.0 Allison, Master. Hudson Trevor male 0.9167 1.0 2.0 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON
2 1.0 0.0 Allison, Miss. Helen Loraine female 2.0000 1.0 2.0 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
3 1.0 0.0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1.0 2.0 113781 151.5500 C22 C26 S NaN 135.0 Montreal, PQ / Chesterville, ON
4 1.0 0.0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1.0 2.0 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
titanic_data.tail() #note the last row of NaN (not a number)
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
1305 3.0 0.0 Zabour, Miss. Thamine female NaN 1.0 0.0 2665 14.4542 NaN C NaN NaN NaN
1306 3.0 0.0 Zakarian, Mr. Mapriededer male 26.5 0.0 0.0 2656 7.2250 NaN C NaN 304.0 NaN
1307 3.0 0.0 Zakarian, Mr. Ortin male 27.0 0.0 0.0 2670 7.2250 NaN C NaN NaN NaN
1308 3.0 0.0 Zimmerman, Mr. Leo male 29.0 0.0 0.0 315082 7.8750 NaN S NaN NaN NaN
1309 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Do Basic Data Validation

  • Use the describe() method on a data frame
  • Check that the min’s, max’s, etc. all make sense!
titanic_data.describe()
pclass survived age sibsp parch fare body
count 1309.000000 1309.000000 1046.000000 1309.000000 1309.000000 1308.000000 121.000000
mean 2.294882 0.381971 29.881135 0.498854 0.385027 33.295479 160.809917
std 0.837836 0.486055 14.413500 1.041658 0.865560 51.758668 97.696922
min 1.000000 0.000000 0.166700 0.000000 0.000000 0.000000 1.000000
25% 2.000000 0.000000 21.000000 0.000000 0.000000 7.895800 72.000000
50% 3.000000 0.000000 28.000000 0.000000 0.000000 14.454200 155.000000
75% 3.000000 1.000000 39.000000 1.000000 0.000000 31.275000 256.000000
max 3.000000 1.000000 80.000000 8.000000 9.000000 512.329200 328.000000
  • Recall we can subset our columns with []
titanic_data.columns
Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')
  • We can determine which percentiles of selected columns to return by combining the column subsetting via selection brackets [] and the .describe() method
titanic_data[["age", "sibsp", "parch", "fare"]].describe(percentiles = [0.05, 0.25, 0.99])
age sibsp parch fare
count 1046.000000 1309.000000 1309.000000 1308.000000
mean 29.881135 0.498854 0.385027 33.295479
std 14.413500 1.041658 0.865560 51.758668
min 0.166700 0.000000 0.000000 0.000000
5% 5.000000 0.000000 0.000000 7.225000
25% 21.000000 0.000000 0.000000 7.895800
50% 28.000000 0.000000 0.000000 14.454200
99% 65.000000 5.000000 4.000000 262.375000
max 80.000000 8.000000 9.000000 512.329200

Determine Rate of Missing Values

  • Use is.null() method to determine the missing values
titanic_data.isnull()
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
0 False False False False False False False False False False False False True False
1 False False False False False False False False False False False False True False
2 False False False False False False False False False False False True True False
3 False False False False False False False False False False False True False False
4 False False False False False False False False False False False True True False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1305 False False False False True False False False False True False True True True
1306 False False False False False False False False False True False True False True
1307 False False False False False False False False False True False True True True
1308 False False False False False False False False False True False True True True
1309 True True True True True True True True True True True True True True

1310 rows × 14 columns

  • Yikes! Can’t make heads or tails of that.
  • This is a DataFrame of booleans!
  • Use the .sum() method to see how many null values we have for each column
titanic_data.isnull().sum()
0
pclass 1
survived 1
name 1
sex 1
age 264
sibsp 1
parch 1
ticket 1
fare 2
cabin 1015
embarked 3
boat 824
body 1189
home.dest 565

  • This type of multiple method use gives us a good chance to use our \ operator to create more readable code by making it multi-line
titanic_data.isnull() \
  .sum()
0
pclass 1
survived 1
name 1
sex 1
age 264
sibsp 1
parch 1
ticket 1
fare 2
cabin 1015
embarked 3
boat 824
body 1189
home.dest 565


Clean Up Data As Needed

  • We can remove rows with missing using .dropna() method
  • First, remove the cabin, boat, and body variables since they have so many missing values
    • If we want to just remove some columns, can use the .drop() method
sub_titanic_data = titanic_data.drop(columns = ["body", "cabin", "boat"])
sub_titanic_data.shape
(1310, 11)
  • Check on the missingness now
sub_titanic_data.isnull().sum()
0
pclass 1
survived 1
name 1
sex 1
age 264
sibsp 1
parch 1
ticket 1
fare 2
embarked 3
home.dest 565

  • Now we are ready to use the .dropna() method to remove any rows with missing data
temp = sub_titanic_data.dropna()
temp.shape #notice the reduction in rows
(684, 11)
temp.isnull().sum() #no more missing values
0
pclass 0
survived 0
name 0
sex 0
age 0
sibsp 0
parch 0
ticket 0
fare 0
embarked 0
home.dest 0

  • Usually, you don’t want to drop all the rows with any missing data as you are throwing out useful info.
  • One option is to impute the missing values… this can be dangerous but can be done with .fillna() method
sub_titanic_data.fillna(value = 0) #note, for instance, some values of age are 0 now and the last row is all 0 values
pclass survived name sex age sibsp parch ticket fare embarked home.dest
0 1.0 1.0 Allen, Miss. Elisabeth Walton female 29.0000 0.0 0.0 24160 211.3375 S St Louis, MO
1 1.0 1.0 Allison, Master. Hudson Trevor male 0.9167 1.0 2.0 113781 151.5500 S Montreal, PQ / Chesterville, ON
2 1.0 0.0 Allison, Miss. Helen Loraine female 2.0000 1.0 2.0 113781 151.5500 S Montreal, PQ / Chesterville, ON
3 1.0 0.0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1.0 2.0 113781 151.5500 S Montreal, PQ / Chesterville, ON
4 1.0 0.0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1.0 2.0 113781 151.5500 S Montreal, PQ / Chesterville, ON
... ... ... ... ... ... ... ... ... ... ... ...
1305 3.0 0.0 Zabour, Miss. Thamine female 0.0000 1.0 0.0 2665 14.4542 C 0
1306 3.0 0.0 Zakarian, Mr. Mapriededer male 26.5000 0.0 0.0 2656 7.2250 C 0
1307 3.0 0.0 Zakarian, Mr. Ortin male 27.0000 0.0 0.0 2670 7.2250 C 0
1308 3.0 0.0 Zimmerman, Mr. Leo male 29.0000 0.0 0.0 315082 7.8750 S 0
1309 0.0 0.0 0 0 0.0000 0.0 0.0 0 0.0000 0 0

1310 rows × 11 columns

  • Can set the value you want to impute by passing a dictionary of key/value pairs
sub_titanic_data.fillna(value = {"home.dest": "Unknown", "age": 200})
pclass survived name sex age sibsp parch ticket fare embarked home.dest
0 1.0 1.0 Allen, Miss. Elisabeth Walton female 29.0000 0.0 0.0 24160 211.3375 S St Louis, MO
1 1.0 1.0 Allison, Master. Hudson Trevor male 0.9167 1.0 2.0 113781 151.5500 S Montreal, PQ / Chesterville, ON
2 1.0 0.0 Allison, Miss. Helen Loraine female 2.0000 1.0 2.0 113781 151.5500 S Montreal, PQ / Chesterville, ON
3 1.0 0.0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1.0 2.0 113781 151.5500 S Montreal, PQ / Chesterville, ON
4 1.0 0.0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1.0 2.0 113781 151.5500 S Montreal, PQ / Chesterville, ON
... ... ... ... ... ... ... ... ... ... ... ...
1305 3.0 0.0 Zabour, Miss. Thamine female 200.0000 1.0 0.0 2665 14.4542 C Unknown
1306 3.0 0.0 Zakarian, Mr. Mapriededer male 26.5000 0.0 0.0 2656 7.2250 C Unknown
1307 3.0 0.0 Zakarian, Mr. Ortin male 27.0000 0.0 0.0 2670 7.2250 C Unknown
1308 3.0 0.0 Zimmerman, Mr. Leo male 29.0000 0.0 0.0 315082 7.8750 S Unknown
1309 NaN NaN NaN NaN 200.0000 NaN NaN NaN NaN NaN Unknown

1310 rows × 11 columns


Investigate distributions

  • How to summarize data depends on the type of data

    • Categorical (Qualitative) variable - entries are a label or attribute
    • Numeric (Quantitative) variable - entries are a numerical value where math can be performed
  • Numerical summaries (across subgroups)

    • Contingency Tables (for categorical data)
    • Mean/Median
    • Standard Deviation/Variance/IQR
    • Quantiles/Percentiles
  • Graphical summaries (across subgroups)

    • Bar plots (for categorical data)
    • Histograms
    • Box plots
    • Scatter plots

Categorical Data

Goal: Describe the distribution of the variable

  • Distribution = pattern and frequency with which you observe a variable
  • Categorical variable - entries are a label or attribute
    • Describe the relative frequency (or count) for each category
    • Using pandas .value_counts() method and crosstab() function

Variables of interest for this section: + embarked (where journey started)

sub_titanic_data.embarked[0:2]
embarked
0 S
1 S

type(sub_titanic_data.embarked[0])
str

The str type isn’t ideal for summarizaitons. A different data type is better!

Category Type Variables

A category type variable is really useful for categorical variables.

  • Akin to a factor variable in R (if you know those)
  • Can have more descriptive labels, ordering of categories, etc.

Let’s give the embarked variable more descriptive values and by converting it to a category type and manipulating it that way.

sub_titanic_data["embarkedC"] = sub_titanic_data.embarked.astype("category")
sub_titanic_data.embarkedC[0:2]
embarkedC
0 S
1 S

  • Now we can use the .cat.rename_categories() method on this category variable
sub_titanic_data.embarkedC = sub_titanic_data.embarkedC.cat.rename_categories(["Cherbourg", "Queenstown", "Southampton"])
sub_titanic_data.embarkedC[0:2]
embarkedC
0 Southampton
1 Southampton

Way better! Now let’s grab two more categorical variables and do similar things:

  • sex (Male or Female)
  • survived (survived or died)
#convert sec variable
sub_titanic_data["sexC"] = sub_titanic_data.sex.astype("category")
sub_titanic_data.sexC = sub_titanic_data.sexC.cat.rename_categories(["Female", "Male"])
#convert survived variable
sub_titanic_data["survivedC"] = sub_titanic_data.survived.astype("category")
sub_titanic_data.survivedC = sub_titanic_data.survivedC.cat.rename_categories(["Died", "Survived"])

Contingency tables

  • Tables of counts are the main numerical summary for categorical data
  • Create one-way contingency tables (.value_counts() method) (one-way because we are looking at one variable at a time)
sub_titanic_data.embarkedC.value_counts(dropna = False)
count
embarkedC
Southampton 914
Cherbourg 270
Queenstown 123
NaN 3

sub_titanic_data.survivedC.value_counts()
count
survivedC
Died 809
Survived 500

sub_titanic_data.sexC.value_counts()
count
sexC
Male 843
Female 466

  • Alternatively, we can find a one-way table using the pd.cross_tab() function
    • This function is meant to take two columns (or more) and return tabulations between those two variables
    • We can define a dummy variable to cross with
    • index argument is the row variable and columns argument is the column variable
sub_titanic_data["dummy"] = 0
pd.crosstab(index = sub_titanic_data.embarkedC, columns = sub_titanic_data.dummy)
dummy 0
embarkedC
Cherbourg 270
Queenstown 123
Southampton 914
pd.crosstab(index = sub_titanic_data.sexC, columns = sub_titanic_data.dummy)
dummy 0
sexC
Female 466
Male 843
  • To summarize two categorical variables together, we use a two-way contingency table
  • Now the cross_tab() function can be used more naturally
pd.crosstab(
  sub_titanic_data.embarkedC, #index variable
  sub_titanic_data.survivedC) #column variable
survivedC Died Survived
embarkedC
Cherbourg 120 150
Queenstown 79 44
Southampton 610 304
pd.crosstab(
  sub_titanic_data.sexC,
  sub_titanic_data.survivedC)
survivedC Died Survived
sexC
Female 127 339
Male 682 161
  • Add marginal totals with margins = True argument
pd.crosstab(
  sub_titanic_data.embarkedC,
  sub_titanic_data.survivedC,
  margins = True)
survivedC Died Survived All
embarkedC
Cherbourg 120 150 270
Queenstown 79 44 123
Southampton 610 304 914
All 809 498 1307
  • Add row and columns names for clarity
    • Use rownames and colnames arguments
pd.crosstab(
  sub_titanic_data.embarkedC,
  sub_titanic_data.survivedC,
  margins = True,
  rownames = ["Embarked Port"],
  colnames = ["Survival Status"]
  )
Survival Status Died Survived All
Embarked Port
Cherbourg 120 150 270
Queenstown 79 44 123
Southampton 610 304 914
All 809 498 1307

That looks great!

For more than two variables we can create tables but they get harder to read. For instance, we can look at a three-way contingency table:

pd.crosstab(
  [sub_titanic_data.embarkedC, sub_titanic_data.survivedC], #pass a list of columns for the rows
  sub_titanic_data.sexC,
  margins = True)
sexC Female Male All
embarkedC survivedC
Cherbourg Died 11 109 120
Survived 102 48 150
Queenstown Died 23 56 79
Survived 37 7 44
Southampton Died 93 517 610
Survived 198 106 304
All 464 843 1307
  • We can add in names for more clarity
my_tab = pd.crosstab(
  [sub_titanic_data.embarkedC, sub_titanic_data.survivedC],
  sub_titanic_data.sexC,
  margins = True,
  rownames = ['Embarked Port', 'Survival Status'], #a list similar to how the rows were passed
  colnames = ['Sex'])
my_tab
Sex Female Male All
Embarked Port Survival Status
Cherbourg Died 11 109 120
Survived 102 48 150
Queenstown Died 23 56 79
Survived 37 7 44
Southampton Died 93 517 610
Survived 198 106 304
All 464 843 1307

We might want to subset the returned table to get certain values…

  • Note that the crosstab() function returns a data frame!
type(my_tab)
pandas.core.frame.DataFrame
def __init__(data=None, index: Axes | None=None, columns: Axes | None=None, dtype: Dtype | None=None, copy: bool | None=None) -> None
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, dataclass or list-like objects. If
    data is a dict, column order follows insertion-order. If a dict contains Series
    which have an index defined, it is aligned by its index. This alignment also
    occurs if data is a Series or a DataFrame itself. Alignment is done on
    Series/DataFrame inputs.

    If data is a list of dicts, column order follows insertion-order.

index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided.
columns : Index or array-like
    Column labels to use for resulting frame when data does not have them,
    defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
    will perform column selection instead.
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
    Copy data from inputs.
    For dict data, the default of None behaves like ``copy=True``.  For DataFrame
    or 2d ndarray input, the default of None behaves like ``copy=False``.
    If data is a dict containing one or more Series (possibly of different dtypes),
    ``copy=False`` will ensure that these inputs are not copied.

    .. versionchanged:: 1.3.0

See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.

Notes
-----
Please reference the :ref:`User Guide <basics.dataframe>` for more information.

Examples
--------
Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from a dictionary including Series:

>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
   col1  col2
0     0   NaN
1     1   NaN
2     2   2.0
3     3   3.0

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
...                    columns=['a', 'b', 'c'])
>>> df2
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Constructing DataFrame from a numpy ndarray that has labeled columns:

>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
...                 dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
   c  a
0  3  1
1  6  4
2  9  7

Constructing DataFrame from dataclass:

>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
   x  y
0  0  0
1  0  3
2  2  3

Constructing DataFrame from Series/DataFrame:

>>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"])
>>> df = pd.DataFrame(data=ser, index=["a", "c"])
>>> df
   0
a  1
c  3

>>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"])
>>> df2 = pd.DataFrame(data=df1, index=["a", "c"])
>>> df2
   x
a  1
c  3
my_tab.columns # columns of the data frame
Index(['Female', 'Male', 'All'], dtype='object', name='Sex')
my_tab.index #rows of the data frame, these are tuples!
MultiIndex([(  'Cherbourg',     'Died'),
            (  'Cherbourg', 'Survived'),
            ( 'Queenstown',     'Died'),
            ( 'Queenstown', 'Survived'),
            ('Southampton',     'Died'),
            ('Southampton', 'Survived'),
            (        'All',         '')],
           names=['Embarked Port', 'Survival Status'])
  • Can obtain conditional bivariate info via subsetting!

  • The MultiIndex can be tough but let’s look at some examples

  • Below returns the embarked vs survived table for females

my_tab["Female"]
Female
Embarked Port Survival Status
Cherbourg Died 11
Survived 102
Queenstown Died 23
Survived 37
Southampton Died 93
Survived 198
All 464

my_tab.loc[:, "Female"] #.loc way of doing this, : gives all of that index
Female
Embarked Port Survival Status
Cherbourg Died 11
Survived 102
Queenstown Died 23
Survived 37
Southampton Died 93
Survived 198
All 464

  • Below returns the sex vs embarked table for those that died
my_tab.iloc[0:5:2, :] #0:5:2 gives a shorthand for a sequence with steps of 2s
Sex Female Male All
Embarked Port Survival Status
Cherbourg Died 11 109 120
Queenstown Died 23 56 79
Southampton Died 93 517 610
  • Using .loc[] is better
  • Must understand our MultiIndex
my_tab.index
MultiIndex([(  'Cherbourg',     'Died'),
            (  'Cherbourg', 'Survived'),
            ( 'Queenstown',     'Died'),
            ( 'Queenstown', 'Survived'),
            ('Southampton',     'Died'),
            ('Southampton', 'Survived'),
            (        'All',         '')],
           names=['Embarked Port', 'Survival Status'])
  • Below uses this index to return the sex vs embarked table for those that died
my_tab.loc[(("Cherbourg", "Queenstown", "Southampton"), "Died"), :]
Sex Female Male All
Embarked Port Survival Status
Cherbourg Died 11 109 120
Queenstown Died 23 56 79
Southampton Died 93 517 610
  • Below returns the sex vs survived table for embarked of Cherbourg
my_tab.loc[('Cherbourg', ("Died", "Survived")), :]
Sex Female Male All
Embarked Port Survival Status
Cherbourg Died 11 109 120
Survived 102 48 150
  • Return the sex table for those that died and embarked at Cherbourg
    • First with .iloc[] then with .loc[]
my_tab.iloc[0, :]
Cherbourg
Died
Sex
Female 11
Male 109
All 120

my_tab.loc[('Cherbourg', 'Died')]
Cherbourg
Died
Sex
Female 11
Male 109
All 120


Numeric Data

Goal: Describe the distribution of the variable

  • Distribution = pattern and frequency with which you observe a variable
  • Numeric variable - entries are a numerical value where math can be performed

For a single numeric variable, describe the distribution via

  • Shape: Histogram, Density plot, … (covered later)
  • Measures of center: Mean, Median, …
  • Measures of spread: Variance, Standard Deviation, Quartiles, IQR, …

For two numeric variables, describe the distribution via

  • Shape: Scatter plot, …
  • Measures of linear relationship: Covariance, Correlation, …

Measures of Center

  • Find mean and median with methods on a Series
type(sub_titanic_data['fare'])
pandas.core.series.Series
def __init__(data=None, index=None, dtype: Dtype | None=None, name=None, copy: bool | None=None, fastpath: bool | lib.NoDefault=lib.no_default) -> None
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).

Operations between Series (+, -, /, \*, \*\*) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.

Parameters
----------
data : array-like, Iterable, dict, or scalar value
    Contains data stored in Series. If data is a dict, argument order is
    maintained.
index : array-like or Index (1d)
    Values must be hashable and have the same length as `data`.
    Non-unique index values are allowed. Will default to
    RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like
    and index is None, then the keys in the data are used as the index. If the
    index is not None, the resulting Series is reindexed with the index values.
dtype : str, numpy.dtype, or ExtensionDtype, optional
    Data type for the output Series. If not specified, this will be
    inferred from `data`.
    See the :ref:`user guide <basics.dtypes>` for more usages.
name : Hashable, default None
    The name to give to the Series.
copy : bool, default False
    Copy input data. Only affects Series or 1d ndarray input. See examples.

Notes
-----
Please reference the :ref:`User Guide <basics.series>` for more information.

Examples
--------
Constructing Series from a dictionary with an Index specified

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['a', 'b', 'c'])
>>> ser
a   1
b   2
c   3
dtype: int64

The keys of the dictionary match with the Index values, hence the Index
values have no effect.

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x   NaN
y   NaN
z   NaN
dtype: float64

Note that the Index is first build with the keys from the dictionary.
After this the Series is reindexed with the given Index values, hence we
get all NaN as a result.

Constructing Series from a list with `copy=False`.

>>> r = [1, 2]
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
[1, 2]
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a `copy` of
the original data even though `copy=False`, so
the data is unchanged.

Constructing Series from a 1d ndarray with `copy=False`.

>>> r = np.array([1, 2])
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
array([999,   2])
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a `view` on
the original data, so
the data is changed as well.
  • Corresponding methods exist for the common numerical summaries
sub_titanic_data['fare'].mean()
33.29547928134557
sub_titanic_data['fare'].median()
14.4542
sub_titanic_data.age.mean() #same thing with a different way to get a column
29.8811345124283
sub_titanic_data.age.median()
28.0

Measures of Spread

  • Standard Deviation, Quartiles, & IQR found with Series methods as well
sub_titanic_data.age.std()
14.413499699923594
sub_titanic_data.age.quantile(q = [0.2, 0.25, 0.5, 0.95])
age
0.20 19.0
0.25 21.0
0.50 28.0
0.95 57.0

q1 = sub_titanic_data.age.quantile(q = [0.25])
q1
age
0.25 21.0

q3 = sub_titanic_data.age.quantile(q = [0.75])
q3
age
0.75 39.0

type(q1)
pandas.core.series.Series
def __init__(data=None, index=None, dtype: Dtype | None=None, name=None, copy: bool | None=None, fastpath: bool | lib.NoDefault=lib.no_default) -> None
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).

Operations between Series (+, -, /, \*, \*\*) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.

Parameters
----------
data : array-like, Iterable, dict, or scalar value
    Contains data stored in Series. If data is a dict, argument order is
    maintained.
index : array-like or Index (1d)
    Values must be hashable and have the same length as `data`.
    Non-unique index values are allowed. Will default to
    RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like
    and index is None, then the keys in the data are used as the index. If the
    index is not None, the resulting Series is reindexed with the index values.
dtype : str, numpy.dtype, or ExtensionDtype, optional
    Data type for the output Series. If not specified, this will be
    inferred from `data`.
    See the :ref:`user guide <basics.dtypes>` for more usages.
name : Hashable, default None
    The name to give to the Series.
copy : bool, default False
    Copy input data. Only affects Series or 1d ndarray input. See examples.

Notes
-----
Please reference the :ref:`User Guide <basics.series>` for more information.

Examples
--------
Constructing Series from a dictionary with an Index specified

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['a', 'b', 'c'])
>>> ser
a   1
b   2
c   3
dtype: int64

The keys of the dictionary match with the Index values, hence the Index
values have no effect.

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x   NaN
y   NaN
z   NaN
dtype: float64

Note that the Index is first build with the keys from the dictionary.
After this the Series is reindexed with the given Index values, hence we
get all NaN as a result.

Constructing Series from a list with `copy=False`.

>>> r = [1, 2]
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
[1, 2]
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a `copy` of
the original data even though `copy=False`, so
the data is unchanged.

Constructing Series from a 1d ndarray with `copy=False`.

>>> r = np.array([1, 2])
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
array([999,   2])
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a `view` on
the original data, so
the data is changed as well.
  • As both q1 and q3 are Series, they have indices
  • This makes them a little more difficult than you might like to subtract (to find the IRQ)
q3-q1 #doesn't work due to the differing index names
age
0.25 NaN
0.75 NaN

q3[0.75] - q1[0.25] #grab the values by index names and subtract those
18.0
  • Alternatively, remember that returning the .values attribute returns a numpy array. We can subtract these.
q3.values - q1.values
array([18.])

Measures of Linear Relationship

  • Correlation via the .corr() method on a data frame
  • This gives the correlation with any numerically (stored) variables that are passed
    • Just because it is stored numerically doesn’t mean we should treat it numerically!
sub_titanic_data[["age", "fare", "sibsp", "parch"]].corr()
age fare sibsp parch
age 1.000000 0.178739 -0.243699 -0.150917
fare 0.178739 1.000000 0.160238 0.221539
sibsp -0.243699 0.160238 1.000000 0.373587
parch -0.150917 0.221539 0.373587 1.000000

Summaries Across Groups

Usually want summaries for different subgroups of data

Two approaches we’ll cover: - Use .groupby() method and then use a summarization method - Use pd.crosstab() function with aggfunc argument

.groupby() Examples

Example: Get similar fare summaries for each survival status

sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean()
FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean()
age fare sibsp parch
survivedC
Died 30.545369 23.353831 0.521632 0.328801
Survived 28.918228 49.361184 0.462000 0.476000
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].std()
FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].std()
age fare sibsp parch
survivedC
Died 13.922539 34.145096 1.210449 0.912332
Survived 15.061481 68.648795 0.685197 0.776292
  • .unstack() method on the result can sometimes make the output clearer
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean().unstack()
FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean().unstack()
0
survivedC
age Died 30.545369
Survived 28.918228
fare Died 23.353831
Survived 49.361184
sibsp Died 0.521632
Survived 0.462000
parch Died 0.328801
Survived 0.476000

  • Multiple grouping variables can be given as a list

    • Example: Get summary for numeric type variables for each survival status and embarked port
sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].mean()
FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].mean()
age fare sibsp parch
survivedC embarkedC
Died Cherbourg 34.468750 40.255592 0.316667 0.225000
Queenstown 30.202703 11.615349 0.379747 0.177215
Southampton 29.945385 21.546160 0.580328 0.368852
Survived Cherbourg 31.037248 80.000807 0.466667 0.486667
Queenstown 24.153846 13.833998 0.272727 0.000000
Southampton 27.989881 39.183470 0.490132 0.542763
sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].std()
FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].std()
age fare sibsp parch
survivedC embarkedC
Died Cherbourg 14.655181 56.553704 0.518293 0.557040
Queenstown 16.785187 10.922240 1.016578 0.655538
Southampton 13.496871 28.786020 1.320897 0.990934
Survived Cherbourg 15.523752 97.642219 0.575410 0.730327
Queenstown 7.057457 17.503850 0.585230 0.000000
Southampton 14.926867 47.656409 0.744552 0.831405
  • As our code gets longer, this is a good place to use \ to extend our code down a line
sub_titanic_data \
  .groupby(["survivedC", "embarkedC"]) \
   [["age", "fare", "sibsp", "parch"]] \
   .mean()
FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  .groupby(["survivedC", "embarkedC"]) \
age fare sibsp parch
survivedC embarkedC
Died Cherbourg 34.468750 40.255592 0.316667 0.225000
Queenstown 30.202703 11.615349 0.379747 0.177215
Southampton 29.945385 21.546160 0.580328 0.368852
Survived Cherbourg 31.037248 80.000807 0.466667 0.486667
Queenstown 24.153846 13.833998 0.272727 0.000000
Southampton 27.989881 39.183470 0.490132 0.542763

pd.crosstab() Examples

  • Alternatively we can use the pd.crosstab() function with an aggfunc to define our summarization to produce

Example: Get summary for numeric type variables for each survival status

  • A bit awkward in this case as we don’t really have a ‘column’ variable
  • Make a dummy variable for that
pd.crosstab(
  sub_titanic_data.survivedC,
  columns = ["mean" for _ in range(sub_titanic_data.shape[0])], #create variable with only the value 'mean'
  values = sub_titanic_data.fare,
  aggfunc = 'mean')
col_0 mean
survivedC
Died 23.353831
Survived 49.361184
  • Can return multiple summaries at once by passing them as a list
pd.crosstab(
  sub_titanic_data.survivedC,
  columns = ["stat" for _ in range(sub_titanic_data.shape[0])],
  values = sub_titanic_data.fare,
  aggfunc = ['mean', 'median', 'std', 'count'])
mean median std count
col_0 stat stat stat stat
survivedC
Died 23.353831 10.5 34.145096 808
Survived 49.361184 26.0 68.648795 500
  • More natural with two grouping variables

    • Example: Get summary for numeric type variables for each survival status and embarked port
pd.crosstab(
  sub_titanic_data.embarkedC,
  sub_titanic_data.survivedC,
  values = sub_titanic_data.fare,
  aggfunc = ['mean', 'count'])
mean count
survivedC Died Survived Died Survived
embarkedC
Cherbourg 40.255592 80.000807 120 150
Queenstown 11.615349 13.833998 79 44
Southampton 21.546160 39.183470 609 304

Quick Video

This video shows an example of reading in some data and finding numeric summaries!

We’ll look at the .pivot_table() and .agg() methods.

Remember to pop the video out into the full player.

The notebook written in the video is available here.

from IPython.display import IFrame
IFrame(src="https://ncsu.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=197e5481-40be-4488-8271-b0ff000f5fd2&autoplay=false&offerviewer=true&showtitle=true&showbrand=true&captions=false&interactivity=all", height="405", width="720")

Recap

EDA is often the first step to an analysis:

  • Must understand the type of data you have/missingness/data validation

  • Then describe the distributions of the variables

  • Numerical summaries

    • Contingency Tables: pd.crosstab()
    • Mean/Median: .mean(), .median() methods on a data frame
    • Standard Deviation/quantiles: .std(), .quantile() methods
  • Across subgroups with .groupby() method or pd.crosstab(value, aggfunc)

  • You can fancy up output too!

That wraps up the material for week 3! Head to the Moodle site to work on your next homework assignment.

If you are on the course website, use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!

If you are on Google Colab, head back to our course website for our next lesson!