Published

2025-03-31

Numerical Summaries

Justin Post

Usual first step in an analysis is to get to know your data (an Exploratory Data Analysis (EDA))
EDA generally consists of a few steps:
- Understand how your data is stored
- Do basic data validation
- Determine rate of missing values
- Clean data up data as needed
- Investigate distributions
  - Univariate measures/graphs
  - Multivariate measures/graphs
- Apply transformations and repeat previous step

Note: These types of webpages are built from Jupyter notebooks (.ipynb files). You can access your own versions of them by clicking here. It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you’d like!

Understand How Data is Stored

First, let’s read in some data. Recall, for .csv files (comma separated value files) we can read them in using pandas read_csv() function.

We’ll read in the classic titanic data set.

import pandas as pd
titanic_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/titanic.csv")

The .info() method allows us to see how our variables are stored (among other things)
Column data types should make sense for what you expect!

titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1310 entries, 0 to 1309
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   float64
 1   survived   1309 non-null   float64
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   float64
 6   parch      1309 non-null   float64
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(7), object(7)
memory usage: 143.4+ KB

.head() and .tail() help to see what we have as well

titanic_data.head() #clearly some missing values with NaNs

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1.0	1.0	Allen, Miss. Elisabeth Walton	female	29.0000	0.0	0.0	24160	211.3375	B5	S	2	NaN	St Louis, MO
1	1.0	1.0	Allison, Master. Hudson Trevor	male	0.9167	1.0	2.0	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON
2	1.0	0.0	Allison, Miss. Helen Loraine	female	2.0000	1.0	2.0	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON
3	1.0	0.0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1.0	2.0	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON
4	1.0	0.0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1.0	2.0	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON

titanic_data.tail() #note the last row of NaN (not a number)

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
1305	3.0	0.0	Zabour, Miss. Thamine	female	NaN	1.0	0.0	2665	14.4542	NaN	C	NaN	NaN	NaN
1306	3.0	0.0	Zakarian, Mr. Mapriededer	male	26.5	0.0	0.0	2656	7.2250	NaN	C	NaN	304.0	NaN
1307	3.0	0.0	Zakarian, Mr. Ortin	male	27.0	0.0	0.0	2670	7.2250	NaN	C	NaN	NaN	NaN
1308	3.0	0.0	Zimmerman, Mr. Leo	male	29.0	0.0	0.0	315082	7.8750	NaN	S	NaN	NaN	NaN
1309	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Do Basic Data Validation

Use the describe() method on a data frame
Check that the min’s, max’s, etc. all make sense!

titanic_data.describe()

	pclass	survived	age	sibsp	parch	fare	body
count	1309.000000	1309.000000	1046.000000	1309.000000	1309.000000	1308.000000	121.000000
mean	2.294882	0.381971	29.881135	0.498854	0.385027	33.295479	160.809917
std	0.837836	0.486055	14.413500	1.041658	0.865560	51.758668	97.696922
min	1.000000	0.000000	0.166700	0.000000	0.000000	0.000000	1.000000
25%	2.000000	0.000000	21.000000	0.000000	0.000000	7.895800	72.000000
50%	3.000000	0.000000	28.000000	0.000000	0.000000	14.454200	155.000000
75%	3.000000	1.000000	39.000000	1.000000	0.000000	31.275000	256.000000
max	3.000000	1.000000	80.000000	8.000000	9.000000	512.329200	328.000000

Recall we can subset our columns with []

titanic_data.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

We can determine which percentiles of selected columns to return by combining the column subsetting via selection brackets [] and the .describe() method

titanic_data[["age", "sibsp", "parch", "fare"]].describe(percentiles = [0.05, 0.25, 0.99])

	age	sibsp	parch	fare
count	1046.000000	1309.000000	1309.000000	1308.000000
mean	29.881135	0.498854	0.385027	33.295479
std	14.413500	1.041658	0.865560	51.758668
min	0.166700	0.000000	0.000000	0.000000
5%	5.000000	0.000000	0.000000	7.225000
25%	21.000000	0.000000	0.000000	7.895800
50%	28.000000	0.000000	0.000000	14.454200
99%	65.000000	5.000000	4.000000	262.375000
max	80.000000	8.000000	9.000000	512.329200

Determine Rate of Missing Values

Use is.null() method to determine the missing values

titanic_data.isnull()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	False	False	False	False	False	False	False	False	False	False	False	False	True	False
1	False	False	False	False	False	False	False	False	False	False	False	False	True	False
2	False	False	False	False	False	False	False	False	False	False	False	True	True	False
3	False	False	False	False	False	False	False	False	False	False	False	True	False	False
4	False	False	False	False	False	False	False	False	False	False	False	True	True	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1305	False	False	False	False	True	False	False	False	False	True	False	True	True	True
1306	False	False	False	False	False	False	False	False	False	True	False	True	False	True
1307	False	False	False	False	False	False	False	False	False	True	False	True	True	True
1308	False	False	False	False	False	False	False	False	False	True	False	True	True	True
1309	True	True	True	True	True	True	True	True	True	True	True	True	True	True

1310 rows × 14 columns

Yikes! Can’t make heads or tails of that.
This is a DataFrame of booleans!
Use the .sum() method to see how many null values we have for each column

titanic_data.isnull().sum()

	0
pclass	1
survived	1
name	1
sex	1
age	264
sibsp	1
parch	1
ticket	1
fare	2
cabin	1015
embarked	3
boat	824
body	1189
home.dest	565

dtype: int64

This type of multiple method use gives us a good chance to use our \ operator to create more readable code by making it multi-line

titanic_data.isnull() \
  .sum()

	0
pclass	1
survived	1
name	1
sex	1
age	264
sibsp	1
parch	1
ticket	1
fare	2
cabin	1015
embarked	3
boat	824
body	1189
home.dest	565

dtype: int64

Clean Up Data As Needed

We can remove rows with missing using .dropna() method
First, remove the cabin, boat, and body variables since they have so many missing values
- If we want to just remove some columns, can use the .drop() method

sub_titanic_data = titanic_data.drop(columns = ["body", "cabin", "boat"])
sub_titanic_data.shape

(1310, 11)

Check on the missingness now

sub_titanic_data.isnull().sum()

	0
pclass	1
survived	1
name	1
sex	1
age	264
sibsp	1
parch	1
ticket	1
fare	2
embarked	3
home.dest	565

dtype: int64

Now we are ready to use the .dropna() method to remove any rows with missing data

temp = sub_titanic_data.dropna()
temp.shape #notice the reduction in rows

(684, 11)

temp.isnull().sum() #no more missing values

	0
pclass	0
survived	0
name	0
sex	0
age	0
sibsp	0
parch	0
ticket	0
fare	0
embarked	0
home.dest	0

dtype: int64

Usually, you don’t want to drop all the rows with any missing data as you are throwing out useful info.
One option is to impute the missing values… this can be dangerous but can be done with .fillna() method

sub_titanic_data.fillna(value = 0) #note, for instance, some values of age are 0 now and the last row is all 0 values

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	embarked	home.dest
0	1.0	1.0	Allen, Miss. Elisabeth Walton	female	29.0000	0.0	0.0	24160	211.3375	S	St Louis, MO
1	1.0	1.0	Allison, Master. Hudson Trevor	male	0.9167	1.0	2.0	113781	151.5500	S	Montreal, PQ / Chesterville, ON
2	1.0	0.0	Allison, Miss. Helen Loraine	female	2.0000	1.0	2.0	113781	151.5500	S	Montreal, PQ / Chesterville, ON
3	1.0	0.0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1.0	2.0	113781	151.5500	S	Montreal, PQ / Chesterville, ON
4	1.0	0.0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1.0	2.0	113781	151.5500	S	Montreal, PQ / Chesterville, ON
...	...	...	...	...	...	...	...	...	...	...	...
1305	3.0	0.0	Zabour, Miss. Thamine	female	0.0000	1.0	0.0	2665	14.4542	C	0
1306	3.0	0.0	Zakarian, Mr. Mapriededer	male	26.5000	0.0	0.0	2656	7.2250	C	0
1307	3.0	0.0	Zakarian, Mr. Ortin	male	27.0000	0.0	0.0	2670	7.2250	C	0
1308	3.0	0.0	Zimmerman, Mr. Leo	male	29.0000	0.0	0.0	315082	7.8750	S	0
1309	0.0	0.0	0	0	0.0000	0.0	0.0	0	0.0000	0	0

1310 rows × 11 columns

Can set the value you want to impute by passing a dictionary of key/value pairs

sub_titanic_data.fillna(value = {"home.dest": "Unknown", "age": 200})

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	embarked	home.dest
0	1.0	1.0	Allen, Miss. Elisabeth Walton	female	29.0000	0.0	0.0	24160	211.3375	S	St Louis, MO
1	1.0	1.0	Allison, Master. Hudson Trevor	male	0.9167	1.0	2.0	113781	151.5500	S	Montreal, PQ / Chesterville, ON
2	1.0	0.0	Allison, Miss. Helen Loraine	female	2.0000	1.0	2.0	113781	151.5500	S	Montreal, PQ / Chesterville, ON
3	1.0	0.0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1.0	2.0	113781	151.5500	S	Montreal, PQ / Chesterville, ON
4	1.0	0.0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1.0	2.0	113781	151.5500	S	Montreal, PQ / Chesterville, ON
...	...	...	...	...	...	...	...	...	...	...	...
1305	3.0	0.0	Zabour, Miss. Thamine	female	200.0000	1.0	0.0	2665	14.4542	C	Unknown
1306	3.0	0.0	Zakarian, Mr. Mapriededer	male	26.5000	0.0	0.0	2656	7.2250	C	Unknown
1307	3.0	0.0	Zakarian, Mr. Ortin	male	27.0000	0.0	0.0	2670	7.2250	C	Unknown
1308	3.0	0.0	Zimmerman, Mr. Leo	male	29.0000	0.0	0.0	315082	7.8750	S	Unknown
1309	NaN	NaN	NaN	NaN	200.0000	NaN	NaN	NaN	NaN	NaN	Unknown

1310 rows × 11 columns

Investigate distributions

How to summarize data depends on the type of data
- Categorical (Qualitative) variable - entries are a label or attribute
- Numeric (Quantitative) variable - entries are a numerical value where math can be performed
Numerical summaries (across subgroups)
- Contingency Tables (for categorical data)
- Mean/Median
- Standard Deviation/Variance/IQR
- Quantiles/Percentiles
Graphical summaries (across subgroups)
- Bar plots (for categorical data)
- Histograms
- Box plots
- Scatter plots

Categorical Data

Goal: Describe the distribution of the variable

Distribution = pattern and frequency with which you observe a variable
Categorical variable - entries are a label or attribute
- Describe the relative frequency (or count) for each category
- Using pandas .value_counts() method and crosstab() function

Variables of interest for this section: + embarked (where journey started)

sub_titanic_data.embarked[0:2]

	embarked
0	S
1	S

dtype: object

type(sub_titanic_data.embarked[0])

str

The str type isn’t ideal for summarizaitons. A different data type is better!

Category Type Variables

A category type variable is really useful for categorical variables.

Akin to a factor variable in R (if you know those)
Can have more descriptive labels, ordering of categories, etc.

Let’s give the embarked variable more descriptive values and by converting it to a category type and manipulating it that way.

sub_titanic_data["embarkedC"] = sub_titanic_data.embarked.astype("category")
sub_titanic_data.embarkedC[0:2]

	embarkedC
0	S
1	S

dtype: category

Now we can use the .cat.rename_categories() method on this category variable

sub_titanic_data.embarkedC = sub_titanic_data.embarkedC.cat.rename_categories(["Cherbourg", "Queenstown", "Southampton"])
sub_titanic_data.embarkedC[0:2]

	embarkedC
0	Southampton
1	Southampton

dtype: category

Way better! Now let’s grab two more categorical variables and do similar things:

sex (Male or Female)
survived (survived or died)

#convert sec variable
sub_titanic_data["sexC"] = sub_titanic_data.sex.astype("category")
sub_titanic_data.sexC = sub_titanic_data.sexC.cat.rename_categories(["Female", "Male"])
#convert survived variable
sub_titanic_data["survivedC"] = sub_titanic_data.survived.astype("category")
sub_titanic_data.survivedC = sub_titanic_data.survivedC.cat.rename_categories(["Died", "Survived"])

Contingency tables

Tables of counts are the main numerical summary for categorical data
Create one-way contingency tables (.value_counts() method) (one-way because we are looking at one variable at a time)

sub_titanic_data.embarkedC.value_counts(dropna = False)

	count
embarkedC
Southampton	914
Cherbourg	270
Queenstown	123
NaN	3

dtype: int64

sub_titanic_data.survivedC.value_counts()

	count
survivedC
Died	809
Survived	500

dtype: int64

sub_titanic_data.sexC.value_counts()

	count
sexC
Male	843
Female	466

dtype: int64

Alternatively, we can find a one-way table using the pd.cross_tab() function
- This function is meant to take two columns (or more) and return tabulations between those two variables
- We can define a dummy variable to cross with
- index argument is the row variable and columns argument is the column variable

sub_titanic_data["dummy"] = 0
pd.crosstab(index = sub_titanic_data.embarkedC, columns = sub_titanic_data.dummy)

dummy	0
embarkedC
Cherbourg	270
Queenstown	123
Southampton	914

pd.crosstab(index = sub_titanic_data.sexC, columns = sub_titanic_data.dummy)

dummy	0
sexC
Female	466
Male	843

To summarize two categorical variables together, we use a two-way contingency table
Now the cross_tab() function can be used more naturally

pd.crosstab(
  sub_titanic_data.embarkedC, #index variable
  sub_titanic_data.survivedC) #column variable

survivedC	Died	Survived
embarkedC
Cherbourg	120	150
Queenstown	79	44
Southampton	610	304

pd.crosstab(
  sub_titanic_data.sexC,
  sub_titanic_data.survivedC)

survivedC	Died	Survived
sexC
Female	127	339
Male	682	161

Add marginal totals with margins = True argument

pd.crosstab(
  sub_titanic_data.embarkedC,
  sub_titanic_data.survivedC,
  margins = True)

survivedC	Died	Survived	All
embarkedC
Cherbourg	120	150	270
Queenstown	79	44	123
Southampton	610	304	914
All	809	498	1307

Add row and columns names for clarity
- Use rownames and colnames arguments

pd.crosstab(
  sub_titanic_data.embarkedC,
  sub_titanic_data.survivedC,
  margins = True,
  rownames = ["Embarked Port"],
  colnames = ["Survival Status"]
  )

Survival Status	Died	Survived	All
Embarked Port
Cherbourg	120	150	270
Queenstown	79	44	123
Southampton	610	304	914
All	809	498	1307

That looks great!

For more than two variables we can create tables but they get harder to read. For instance, we can look at a three-way contingency table:

pd.crosstab(
  [sub_titanic_data.embarkedC, sub_titanic_data.survivedC], #pass a list of columns for the rows
  sub_titanic_data.sexC,
  margins = True)

	sexC	Female	Male	All
embarkedC	survivedC
Cherbourg	Died	11	109	120
Cherbourg	Survived	102	48	150
Queenstown	Died	23	56	79
Queenstown	Survived	37	7	44
Southampton	Died	93	517	610
Southampton	Survived	198	106	304
All		464	843	1307

We can add in names for more clarity

my_tab = pd.crosstab(
  [sub_titanic_data.embarkedC, sub_titanic_data.survivedC],
  sub_titanic_data.sexC,
  margins = True,
  rownames = ['Embarked Port', 'Survival Status'], #a list similar to how the rows were passed
  colnames = ['Sex'])
my_tab

	Sex	Female	Male	All
Embarked Port	Survival Status
Cherbourg	Died	11	109	120
Cherbourg	Survived	102	48	150
Queenstown	Died	23	56	79
Queenstown	Survived	37	7	44
Southampton	Died	93	517	610
Southampton	Survived	198	106	304
All		464	843	1307

We might want to subset the returned table to get certain values…

Note that the crosstab() function returns a data frame!

type(my_tab)

pandas.core.frame.DataFrame
def __init__(data=None, index: Axes | None=None, columns: Axes | None=None, dtype: Dtype | None=None, copy: bool | None=None) -> None

/usr/local/lib/python3.10/dist-packages/pandas/core/frame.pyTwo-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, dataclass or list-like objects. If
    data is a dict, column order follows insertion-order. If a dict contains Series
    which have an index defined, it is aligned by its index. This alignment also
    occurs if data is a Series or a DataFrame itself. Alignment is done on
    Series/DataFrame inputs.

    If data is a list of dicts, column order follows insertion-order.

index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided.
columns : Index or array-like
    Column labels to use for resulting frame when data does not have them,
    defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
    will perform column selection instead.
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
    Copy data from inputs.
    For dict data, the default of None behaves like ``copy=True``.  For DataFrame
    or 2d ndarray input, the default of None behaves like ``copy=False``.
    If data is a dict containing one or more Series (possibly of different dtypes),
    ``copy=False`` will ensure that these inputs are not copied.

    .. versionchanged:: 1.3.0

See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.

Notes
-----
Please reference the :ref:`User Guide <basics.dataframe>` for more information.

Examples
--------
Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from a dictionary including Series:

>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
   col1  col2
0     0   NaN
1     1   NaN
2     2   2.0
3     3   3.0

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
...                    columns=['a', 'b', 'c'])
>>> df2
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Constructing DataFrame from a numpy ndarray that has labeled columns:

>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
...                 dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
   c  a
0  3  1
1  6  4
2  9  7

Constructing DataFrame from dataclass:

>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
   x  y
0  0  0
1  0  3
2  2  3

Constructing DataFrame from Series/DataFrame:

>>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"])
>>> df = pd.DataFrame(data=ser, index=["a", "c"])
>>> df
   0
a  1
c  3

>>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"])
>>> df2 = pd.DataFrame(data=df1, index=["a", "c"])
>>> df2
   x
a  1
c  3

my_tab.columns # columns of the data frame

Index(['Female', 'Male', 'All'], dtype='object', name='Sex')

my_tab.index #rows of the data frame, these are tuples!

MultiIndex([(  'Cherbourg',     'Died'),
            (  'Cherbourg', 'Survived'),
            ( 'Queenstown',     'Died'),
            ( 'Queenstown', 'Survived'),
            ('Southampton',     'Died'),
            ('Southampton', 'Survived'),
            (        'All',         '')],
           names=['Embarked Port', 'Survival Status'])

Can obtain conditional bivariate info via subsetting!
The MultiIndex can be tough but let’s look at some examples
Below returns the embarked vs survived table for females

my_tab["Female"]

		Female
Embarked Port	Survival Status
Cherbourg	Died	11
Cherbourg	Survived	102
Queenstown	Died	23
Queenstown	Survived	37
Southampton	Died	93
Southampton	Survived	198
All		464

dtype: int64

my_tab.loc[:, "Female"] #.loc way of doing this, : gives all of that index

		Female
Embarked Port	Survival Status
Cherbourg	Died	11
Cherbourg	Survived	102
Queenstown	Died	23
Queenstown	Survived	37
Southampton	Died	93
Southampton	Survived	198
All		464

dtype: int64

Below returns the sex vs embarked table for those that died

my_tab.iloc[0:5:2, :] #0:5:2 gives a shorthand for a sequence with steps of 2s

	Sex	Female	Male	All
Embarked Port	Survival Status
Cherbourg	Died	11	109	120
Queenstown	Died	23	56	79
Southampton	Died	93	517	610

Using .loc[] is better
Must understand our MultiIndex

my_tab.index

MultiIndex([(  'Cherbourg',     'Died'),
            (  'Cherbourg', 'Survived'),
            ( 'Queenstown',     'Died'),
            ( 'Queenstown', 'Survived'),
            ('Southampton',     'Died'),
            ('Southampton', 'Survived'),
            (        'All',         '')],
           names=['Embarked Port', 'Survival Status'])

Below uses this index to return the sex vs embarked table for those that died

my_tab.loc[(("Cherbourg", "Queenstown", "Southampton"), "Died"), :]

	Sex	Female	Male	All
Embarked Port	Survival Status
Cherbourg	Died	11	109	120
Queenstown	Died	23	56	79
Southampton	Died	93	517	610

Below returns the sex vs survived table for embarked of Cherbourg

my_tab.loc[('Cherbourg', ("Died", "Survived")), :]

	Sex	Female	Male	All
Embarked Port	Survival Status
Cherbourg	Died	11	109	120
Cherbourg	Survived	102	48	150

Return the sex table for those that died and embarked at Cherbourg
- First with .iloc[] then with .loc[]

my_tab.iloc[0, :]

	Cherbourg
	Died
Sex
Female	11
Male	109
All	120

dtype: int64

my_tab.loc[('Cherbourg', 'Died')]

	Cherbourg
	Died
Sex
Female	11
Male	109
All	120

dtype: int64

Numeric Data

Goal: Describe the distribution of the variable

Distribution = pattern and frequency with which you observe a variable
Numeric variable - entries are a numerical value where math can be performed

For a single numeric variable, describe the distribution via

Shape: Histogram, Density plot, … (covered later)
Measures of center: Mean, Median, …
Measures of spread: Variance, Standard Deviation, Quartiles, IQR, …

For two numeric variables, describe the distribution via

Shape: Scatter plot, …
Measures of linear relationship: Covariance, Correlation, …

Measures of Center

Find mean and median with methods on a Series

type(sub_titanic_data['fare'])

pandas.core.series.Series
def __init__(data=None, index=None, dtype: Dtype | None=None, name=None, copy: bool | None=None, fastpath: bool | lib.NoDefault=lib.no_default) -> None

/usr/local/lib/python3.10/dist-packages/pandas/core/series.pyOne-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).

Operations between Series (+, -, /, \*, \*\*) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.

Parameters
----------
data : array-like, Iterable, dict, or scalar value
    Contains data stored in Series. If data is a dict, argument order is
    maintained.
index : array-like or Index (1d)
    Values must be hashable and have the same length as `data`.
    Non-unique index values are allowed. Will default to
    RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like
    and index is None, then the keys in the data are used as the index. If the
    index is not None, the resulting Series is reindexed with the index values.
dtype : str, numpy.dtype, or ExtensionDtype, optional
    Data type for the output Series. If not specified, this will be
    inferred from `data`.
    See the :ref:`user guide <basics.dtypes>` for more usages.
name : Hashable, default None
    The name to give to the Series.
copy : bool, default False
    Copy input data. Only affects Series or 1d ndarray input. See examples.

Notes
-----
Please reference the :ref:`User Guide <basics.series>` for more information.

Examples
--------
Constructing Series from a dictionary with an Index specified

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['a', 'b', 'c'])
>>> ser
a   1
b   2
c   3
dtype: int64

The keys of the dictionary match with the Index values, hence the Index
values have no effect.

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x   NaN
y   NaN
z   NaN
dtype: float64

Note that the Index is first build with the keys from the dictionary.
After this the Series is reindexed with the given Index values, hence we
get all NaN as a result.

Constructing Series from a list with `copy=False`.

>>> r = [1, 2]
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
[1, 2]
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a `copy` of
the original data even though `copy=False`, so
the data is unchanged.

Constructing Series from a 1d ndarray with `copy=False`.

>>> r = np.array([1, 2])
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
array([999,   2])
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a `view` on
the original data, so
the data is changed as well.

Corresponding methods exist for the common numerical summaries

sub_titanic_data['fare'].mean()

33.29547928134557

sub_titanic_data['fare'].median()

14.4542

sub_titanic_data.age.mean() #same thing with a different way to get a column

29.8811345124283

sub_titanic_data.age.median()

28.0

Measures of Spread

Standard Deviation, Quartiles, & IQR found with Series methods as well

sub_titanic_data.age.std()

14.413499699923594

sub_titanic_data.age.quantile(q = [0.2, 0.25, 0.5, 0.95])

	age
0.20	19.0
0.25	21.0
0.50	28.0
0.95	57.0

dtype: float64

q1 = sub_titanic_data.age.quantile(q = [0.25])
q1

	age
0.25	21.0

dtype: float64

q3 = sub_titanic_data.age.quantile(q = [0.75])
q3

	age
0.75	39.0

dtype: float64

type(q1)

pandas.core.series.Series
def __init__(data=None, index=None, dtype: Dtype | None=None, name=None, copy: bool | None=None, fastpath: bool | lib.NoDefault=lib.no_default) -> None

/usr/local/lib/python3.10/dist-packages/pandas/core/series.pyOne-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).

Operations between Series (+, -, /, \*, \*\*) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.

Parameters
----------
data : array-like, Iterable, dict, or scalar value
    Contains data stored in Series. If data is a dict, argument order is
    maintained.
index : array-like or Index (1d)
    Values must be hashable and have the same length as `data`.
    Non-unique index values are allowed. Will default to
    RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like
    and index is None, then the keys in the data are used as the index. If the
    index is not None, the resulting Series is reindexed with the index values.
dtype : str, numpy.dtype, or ExtensionDtype, optional
    Data type for the output Series. If not specified, this will be
    inferred from `data`.
    See the :ref:`user guide <basics.dtypes>` for more usages.
name : Hashable, default None
    The name to give to the Series.
copy : bool, default False
    Copy input data. Only affects Series or 1d ndarray input. See examples.

Notes
-----
Please reference the :ref:`User Guide <basics.series>` for more information.

Examples
--------
Constructing Series from a dictionary with an Index specified

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['a', 'b', 'c'])
>>> ser
a   1
b   2
c   3
dtype: int64

The keys of the dictionary match with the Index values, hence the Index
values have no effect.

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x   NaN
y   NaN
z   NaN
dtype: float64

Note that the Index is first build with the keys from the dictionary.
After this the Series is reindexed with the given Index values, hence we
get all NaN as a result.

Constructing Series from a list with `copy=False`.

>>> r = [1, 2]
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
[1, 2]
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a `copy` of
the original data even though `copy=False`, so
the data is unchanged.

Constructing Series from a 1d ndarray with `copy=False`.

>>> r = np.array([1, 2])
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
array([999,   2])
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a `view` on
the original data, so
the data is changed as well.

As both q1 and q3 are Series, they have indices
This makes them a little more difficult than you might like to subtract (to find the IRQ)

q3-q1 #doesn't work due to the differing index names

	age
0.25	NaN
0.75	NaN

dtype: float64

q3[0.75] - q1[0.25] #grab the values by index names and subtract those

18.0

Alternatively, remember that returning the .values attribute returns a numpy array. We can subtract these.

q3.values - q1.values

array([18.])

Measures of Linear Relationship

Correlation via the .corr() method on a data frame
This gives the correlation with any numerically (stored) variables that are passed
- Just because it is stored numerically doesn’t mean we should treat it numerically!

sub_titanic_data[["age", "fare", "sibsp", "parch"]].corr()

	age	fare	sibsp	parch
age	1.000000	0.178739	-0.243699	-0.150917
fare	0.178739	1.000000	0.160238	0.221539
sibsp	-0.243699	0.160238	1.000000	0.373587
parch	-0.150917	0.221539	0.373587	1.000000

Summaries Across Groups

Usually want summaries for different subgroups of data

Two approaches we’ll cover: - Use .groupby() method and then use a summarization method - Use pd.crosstab() function with aggfunc argument

`.groupby()` Examples

Example: Get similar fare summaries for each survival status

sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean()

FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean()

	age	fare	sibsp	parch
survivedC
Died	30.545369	23.353831	0.521632	0.328801
Survived	28.918228	49.361184	0.462000	0.476000

sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].std()

FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].std()

	age	fare	sibsp	parch
survivedC
Died	13.922539	34.145096	1.210449	0.912332
Survived	15.061481	68.648795	0.685197	0.776292

.unstack() method on the result can sometimes make the output clearer

sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean().unstack()

FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean().unstack()

		0
	survivedC
age	Died	30.545369
age	Survived	28.918228
fare	Died	23.353831
fare	Survived	49.361184
sibsp	Died	0.521632
sibsp	Survived	0.462000
parch	Died	0.328801
parch	Survived	0.476000

dtype: float64

Multiple grouping variables can be given as a list
- Example: Get summary for numeric type variables for each survival status and embarked port

sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].mean()

FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].mean()

		age	fare	sibsp	parch
survivedC	embarkedC
Died	Cherbourg	34.468750	40.255592	0.316667	0.225000
	Queenstown	30.202703	11.615349	0.379747	0.177215
	Southampton	29.945385	21.546160	0.580328	0.368852
Survived	Cherbourg	31.037248	80.000807	0.466667	0.486667
	Queenstown	24.153846	13.833998	0.272727	0.000000
	Southampton	27.989881	39.183470	0.490132	0.542763

sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].std()

FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].std()

		age	fare	sibsp	parch
survivedC	embarkedC
Died	Cherbourg	14.655181	56.553704	0.518293	0.557040
	Queenstown	16.785187	10.922240	1.016578	0.655538
	Southampton	13.496871	28.786020	1.320897	0.990934
Survived	Cherbourg	15.523752	97.642219	0.575410	0.730327
	Queenstown	7.057457	17.503850	0.585230	0.000000
	Southampton	14.926867	47.656409	0.744552	0.831405

As our code gets longer, this is a good place to use \ to extend our code down a line

sub_titanic_data \
  .groupby(["survivedC", "embarkedC"]) \
   [["age", "fare", "sibsp", "parch"]] \
   .mean()

FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  .groupby(["survivedC", "embarkedC"]) \

		age	fare	sibsp	parch
survivedC	embarkedC
Died	Cherbourg	34.468750	40.255592	0.316667	0.225000
	Queenstown	30.202703	11.615349	0.379747	0.177215
	Southampton	29.945385	21.546160	0.580328	0.368852
Survived	Cherbourg	31.037248	80.000807	0.466667	0.486667
	Queenstown	24.153846	13.833998	0.272727	0.000000
	Southampton	27.989881	39.183470	0.490132	0.542763

`pd.crosstab()` Examples

Alternatively we can use the pd.crosstab() function with an aggfunc to define our summarization to produce

Example: Get summary for numeric type variables for each survival status

A bit awkward in this case as we don’t really have a ‘column’ variable
Make a dummy variable for that

pd.crosstab(
  sub_titanic_data.survivedC,
  columns = ["mean" for _ in range(sub_titanic_data.shape[0])], #create variable with only the value 'mean'
  values = sub_titanic_data.fare,
  aggfunc = 'mean')

col_0	mean
survivedC
Died	23.353831
Survived	49.361184

Can return multiple summaries at once by passing them as a list

pd.crosstab(
  sub_titanic_data.survivedC,
  columns = ["stat" for _ in range(sub_titanic_data.shape[0])],
  values = sub_titanic_data.fare,
  aggfunc = ['mean', 'median', 'std', 'count'])

	mean	median	std	count
col_0	stat	stat	stat	stat
survivedC
Died	23.353831	10.5	34.145096	808
Survived	49.361184	26.0	68.648795	500

More natural with two grouping variables
- Example: Get summary for numeric type variables for each survival status and embarked port

pd.crosstab(
  sub_titanic_data.embarkedC,
  sub_titanic_data.survivedC,
  values = sub_titanic_data.fare,
  aggfunc = ['mean', 'count'])

	mean		count
survivedC	Died	Survived	Died	Survived
embarkedC
Cherbourg	40.255592	80.000807	120	150
Queenstown	11.615349	13.833998	79	44
Southampton	21.546160	39.183470	609	304

Quick Video

This video shows an example of reading in some data and finding numeric summaries!

We’ll look at the .pivot_table() and .agg() methods.

Remember to pop the video out into the full player.

The notebook written in the video is available here.

from IPython.display import IFrame
IFrame(src="https://ncsu.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=197e5481-40be-4488-8271-b0ff000f5fd2&autoplay=false&offerviewer=true&showtitle=true&showbrand=true&captions=false&interactivity=all", height="405", width="720")

Recap

EDA is often the first step to an analysis:

Must understand the type of data you have/missingness/data validation
Then describe the distributions of the variables
Numerical summaries
- Contingency Tables: pd.crosstab()
- Mean/Median: .mean(), .median() methods on a data frame
- Standard Deviation/quantiles: .std(), .quantile() methods
Across subgroups with .groupby() method or pd.crosstab(value, aggfunc)
You can fancy up output too!

That wraps up the material for week 3! Head to the Moodle site to work on your next homework assignment.

If you are on the course website, use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!

If you are on Google Colab, head back to our course website for our next lesson!

Numerical Summaries

Understand How Data is Stored

Do Basic Data Validation

Determine Rate of Missing Values

Clean Up Data As Needed

Investigate distributions

Categorical Data

Category Type Variables

Contingency tables

Numeric Data

Measures of Center

Measures of Spread

Measures of Linear Relationship

Summaries Across Groups

.groupby() Examples

pd.crosstab() Examples

Quick Video

Recap

`.groupby()` Examples

`pd.crosstab()` Examples