import pandas as pd
= pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/titanic.csv") titanic_data
Numerical Summaries
Justin Post
Usual first step in an analysis is to get to know your data (an Exploratory Data Analysis (EDA))
EDA generally consists of a few steps:
- Understand how your data is stored
- Do basic data validation
- Determine rate of missing values
- Clean data up data as needed
- Investigate distributions
- Univariate measures/graphs
- Multivariate measures/graphs
- Apply transformations and repeat previous step
Note: These types of webpages are built from Jupyter notebooks (.ipynb
files). You can access your own versions of them by clicking here. It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you’d like!
Understand How Data is Stored
First, let’s read in some data. Recall, for .csv
files (comma separated value files) we can read them in using pandas
read_csv()
function.
We’ll read in the classic titanic data set.
- The
.info()
method allows us to see how our variables are stored (among other things) - Column data types should make sense for what you expect!
titanic_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1310 entries, 0 to 1309
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null float64
1 survived 1309 non-null float64
2 name 1309 non-null object
3 sex 1309 non-null object
4 age 1046 non-null float64
5 sibsp 1309 non-null float64
6 parch 1309 non-null float64
7 ticket 1309 non-null object
8 fare 1308 non-null float64
9 cabin 295 non-null object
10 embarked 1307 non-null object
11 boat 486 non-null object
12 body 121 non-null float64
13 home.dest 745 non-null object
dtypes: float64(7), object(7)
memory usage: 143.4+ KB
.head()
and.tail()
help to see what we have as well
#clearly some missing values with NaNs titanic_data.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0.0 | 0.0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
1 | 1.0 | 1.0 | Allison, Master. Hudson Trevor | male | 0.9167 | 1.0 | 2.0 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
2 | 1.0 | 0.0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1.0 | 2.0 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
3 | 1.0 | 0.0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1.0 | 2.0 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
4 | 1.0 | 0.0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1.0 | 2.0 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
#note the last row of NaN (not a number) titanic_data.tail()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1305 | 3.0 | 0.0 | Zabour, Miss. Thamine | female | NaN | 1.0 | 0.0 | 2665 | 14.4542 | NaN | C | NaN | NaN | NaN |
1306 | 3.0 | 0.0 | Zakarian, Mr. Mapriededer | male | 26.5 | 0.0 | 0.0 | 2656 | 7.2250 | NaN | C | NaN | 304.0 | NaN |
1307 | 3.0 | 0.0 | Zakarian, Mr. Ortin | male | 27.0 | 0.0 | 0.0 | 2670 | 7.2250 | NaN | C | NaN | NaN | NaN |
1308 | 3.0 | 0.0 | Zimmerman, Mr. Leo | male | 29.0 | 0.0 | 0.0 | 315082 | 7.8750 | NaN | S | NaN | NaN | NaN |
1309 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Do Basic Data Validation
- Use the
describe()
method on a data frame - Check that the min’s, max’s, etc. all make sense!
titanic_data.describe()
pclass | survived | age | sibsp | parch | fare | body | |
---|---|---|---|---|---|---|---|
count | 1309.000000 | 1309.000000 | 1046.000000 | 1309.000000 | 1309.000000 | 1308.000000 | 121.000000 |
mean | 2.294882 | 0.381971 | 29.881135 | 0.498854 | 0.385027 | 33.295479 | 160.809917 |
std | 0.837836 | 0.486055 | 14.413500 | 1.041658 | 0.865560 | 51.758668 | 97.696922 |
min | 1.000000 | 0.000000 | 0.166700 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 2.000000 | 0.000000 | 21.000000 | 0.000000 | 0.000000 | 7.895800 | 72.000000 |
50% | 3.000000 | 0.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 | 155.000000 |
75% | 3.000000 | 1.000000 | 39.000000 | 1.000000 | 0.000000 | 31.275000 | 256.000000 |
max | 3.000000 | 1.000000 | 80.000000 | 8.000000 | 9.000000 | 512.329200 | 328.000000 |
- Recall we can subset our columns with
[]
titanic_data.columns
Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
dtype='object')
- We can determine which
percentiles
of selected columns to return by combining the column subsetting via selection brackets[]
and the.describe()
method
"age", "sibsp", "parch", "fare"]].describe(percentiles = [0.05, 0.25, 0.99]) titanic_data[[
age | sibsp | parch | fare | |
---|---|---|---|---|
count | 1046.000000 | 1309.000000 | 1309.000000 | 1308.000000 |
mean | 29.881135 | 0.498854 | 0.385027 | 33.295479 |
std | 14.413500 | 1.041658 | 0.865560 | 51.758668 |
min | 0.166700 | 0.000000 | 0.000000 | 0.000000 |
5% | 5.000000 | 0.000000 | 0.000000 | 7.225000 |
25% | 21.000000 | 0.000000 | 0.000000 | 7.895800 |
50% | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
99% | 65.000000 | 5.000000 | 4.000000 | 262.375000 |
max | 80.000000 | 8.000000 | 9.000000 | 512.329200 |
Determine Rate of Missing Values
- Use
is.null()
method to determine the missing values
titanic_data.isnull()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | False | False | True | False |
1 | False | False | False | False | False | False | False | False | False | False | False | False | True | False |
2 | False | False | False | False | False | False | False | False | False | False | False | True | True | False |
3 | False | False | False | False | False | False | False | False | False | False | False | True | False | False |
4 | False | False | False | False | False | False | False | False | False | False | False | True | True | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1305 | False | False | False | False | True | False | False | False | False | True | False | True | True | True |
1306 | False | False | False | False | False | False | False | False | False | True | False | True | False | True |
1307 | False | False | False | False | False | False | False | False | False | True | False | True | True | True |
1308 | False | False | False | False | False | False | False | False | False | True | False | True | True | True |
1309 | True | True | True | True | True | True | True | True | True | True | True | True | True | True |
1310 rows × 14 columns
- Yikes! Can’t make heads or tails of that.
- This is a
DataFrame
of booleans! - Use the
.sum()
method to see how manynull
values we have for each column
sum() titanic_data.isnull().
0 | |
---|---|
pclass | 1 |
survived | 1 |
name | 1 |
sex | 1 |
age | 264 |
sibsp | 1 |
parch | 1 |
ticket | 1 |
fare | 2 |
cabin | 1015 |
embarked | 3 |
boat | 824 |
body | 1189 |
home.dest | 565 |
- This type of multiple method use gives us a good chance to use our
\
operator to create more readable code by making it multi-line
\
titanic_data.isnull() sum() .
0 | |
---|---|
pclass | 1 |
survived | 1 |
name | 1 |
sex | 1 |
age | 264 |
sibsp | 1 |
parch | 1 |
ticket | 1 |
fare | 2 |
cabin | 1015 |
embarked | 3 |
boat | 824 |
body | 1189 |
home.dest | 565 |
Clean Up Data As Needed
- We can remove rows with missing using
.dropna()
method - First, remove the
cabin
,boat
, andbody
variables since they have so many missing values- If we want to just remove some columns, can use the
.drop()
method
- If we want to just remove some columns, can use the
= titanic_data.drop(columns = ["body", "cabin", "boat"])
sub_titanic_data sub_titanic_data.shape
(1310, 11)
- Check on the missingness now
sum() sub_titanic_data.isnull().
0 | |
---|---|
pclass | 1 |
survived | 1 |
name | 1 |
sex | 1 |
age | 264 |
sibsp | 1 |
parch | 1 |
ticket | 1 |
fare | 2 |
embarked | 3 |
home.dest | 565 |
- Now we are ready to use the
.dropna()
method to remove any rows with missing data
= sub_titanic_data.dropna()
temp #notice the reduction in rows temp.shape
(684, 11)
sum() #no more missing values temp.isnull().
0 | |
---|---|
pclass | 0 |
survived | 0 |
name | 0 |
sex | 0 |
age | 0 |
sibsp | 0 |
parch | 0 |
ticket | 0 |
fare | 0 |
embarked | 0 |
home.dest | 0 |
- Usually, you don’t want to drop all the rows with any missing data as you are throwing out useful info.
- One option is to impute the missing values… this can be dangerous but can be done with
.fillna()
method
= 0) #note, for instance, some values of age are 0 now and the last row is all 0 values sub_titanic_data.fillna(value
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | embarked | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0.0 | 0.0 | 24160 | 211.3375 | S | St Louis, MO |
1 | 1.0 | 1.0 | Allison, Master. Hudson Trevor | male | 0.9167 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
2 | 1.0 | 0.0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
3 | 1.0 | 0.0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
4 | 1.0 | 0.0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1305 | 3.0 | 0.0 | Zabour, Miss. Thamine | female | 0.0000 | 1.0 | 0.0 | 2665 | 14.4542 | C | 0 |
1306 | 3.0 | 0.0 | Zakarian, Mr. Mapriededer | male | 26.5000 | 0.0 | 0.0 | 2656 | 7.2250 | C | 0 |
1307 | 3.0 | 0.0 | Zakarian, Mr. Ortin | male | 27.0000 | 0.0 | 0.0 | 2670 | 7.2250 | C | 0 |
1308 | 3.0 | 0.0 | Zimmerman, Mr. Leo | male | 29.0000 | 0.0 | 0.0 | 315082 | 7.8750 | S | 0 |
1309 | 0.0 | 0.0 | 0 | 0 | 0.0000 | 0.0 | 0.0 | 0 | 0.0000 | 0 | 0 |
1310 rows × 11 columns
- Can set the value you want to impute by passing a dictionary of key/value pairs
= {"home.dest": "Unknown", "age": 200}) sub_titanic_data.fillna(value
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | embarked | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0.0 | 0.0 | 24160 | 211.3375 | S | St Louis, MO |
1 | 1.0 | 1.0 | Allison, Master. Hudson Trevor | male | 0.9167 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
2 | 1.0 | 0.0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
3 | 1.0 | 0.0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
4 | 1.0 | 0.0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1305 | 3.0 | 0.0 | Zabour, Miss. Thamine | female | 200.0000 | 1.0 | 0.0 | 2665 | 14.4542 | C | Unknown |
1306 | 3.0 | 0.0 | Zakarian, Mr. Mapriededer | male | 26.5000 | 0.0 | 0.0 | 2656 | 7.2250 | C | Unknown |
1307 | 3.0 | 0.0 | Zakarian, Mr. Ortin | male | 27.0000 | 0.0 | 0.0 | 2670 | 7.2250 | C | Unknown |
1308 | 3.0 | 0.0 | Zimmerman, Mr. Leo | male | 29.0000 | 0.0 | 0.0 | 315082 | 7.8750 | S | Unknown |
1309 | NaN | NaN | NaN | NaN | 200.0000 | NaN | NaN | NaN | NaN | NaN | Unknown |
1310 rows × 11 columns
Investigate distributions
How to summarize data depends on the type of data
- Categorical (Qualitative) variable - entries are a label or attribute
- Numeric (Quantitative) variable - entries are a numerical value where math can be performed
- Categorical (Qualitative) variable - entries are a label or attribute
Numerical summaries (across subgroups)
- Contingency Tables (for categorical data)
- Mean/Median
- Standard Deviation/Variance/IQR
- Quantiles/Percentiles
Graphical summaries (across subgroups)
- Bar plots (for categorical data)
- Histograms
- Box plots
- Scatter plots
Categorical Data
Goal: Describe the distribution of the variable
- Distribution = pattern and frequency with which you observe a variable
- Categorical variable - entries are a label or attribute
- Describe the relative frequency (or count) for each category
- Using
pandas
.value_counts()
method andcrosstab()
function
Variables of interest for this section: + embarked (where journey started)
0:2] sub_titanic_data.embarked[
embarked | |
---|---|
0 | S |
1 | S |
type(sub_titanic_data.embarked[0])
str
The str
type isn’t ideal for summarizaitons. A different data type is better!
Category Type Variables
A category type variable is really useful for categorical variables.
- Akin to a factor variable in
R
(if you know those) - Can have more descriptive labels, ordering of categories, etc.
Let’s give the embarked
variable more descriptive values and by converting it to a category
type and manipulating it that way.
"embarkedC"] = sub_titanic_data.embarked.astype("category")
sub_titanic_data[0:2] sub_titanic_data.embarkedC[
embarkedC | |
---|---|
0 | S |
1 | S |
- Now we can use the
.cat.rename_categories()
method on thiscategory
variable
= sub_titanic_data.embarkedC.cat.rename_categories(["Cherbourg", "Queenstown", "Southampton"])
sub_titanic_data.embarkedC 0:2] sub_titanic_data.embarkedC[
embarkedC | |
---|---|
0 | Southampton |
1 | Southampton |
Way better! Now let’s grab two more categorical variables and do similar things:
- sex (Male or Female)
- survived (survived or died)
#convert sec variable
"sexC"] = sub_titanic_data.sex.astype("category")
sub_titanic_data[= sub_titanic_data.sexC.cat.rename_categories(["Female", "Male"])
sub_titanic_data.sexC #convert survived variable
"survivedC"] = sub_titanic_data.survived.astype("category")
sub_titanic_data[= sub_titanic_data.survivedC.cat.rename_categories(["Died", "Survived"]) sub_titanic_data.survivedC
Contingency tables
- Tables of counts are the main numerical summary for categorical data
- Create one-way contingency tables (
.value_counts()
method) (one-way because we are looking at one variable at a time)
= False) sub_titanic_data.embarkedC.value_counts(dropna
count | |
---|---|
embarkedC | |
Southampton | 914 |
Cherbourg | 270 |
Queenstown | 123 |
NaN | 3 |
sub_titanic_data.survivedC.value_counts()
count | |
---|---|
survivedC | |
Died | 809 |
Survived | 500 |
sub_titanic_data.sexC.value_counts()
count | |
---|---|
sexC | |
Male | 843 |
Female | 466 |
- Alternatively, we can find a one-way table using the
pd.cross_tab()
function- This function is meant to take two columns (or more) and return tabulations between those two variables
- We can define a dummy variable to cross with
index
argument is the row variable andcolumns
argument is the column variable
"dummy"] = 0
sub_titanic_data[= sub_titanic_data.embarkedC, columns = sub_titanic_data.dummy) pd.crosstab(index
dummy | 0 |
---|---|
embarkedC | |
Cherbourg | 270 |
Queenstown | 123 |
Southampton | 914 |
= sub_titanic_data.sexC, columns = sub_titanic_data.dummy) pd.crosstab(index
dummy | 0 |
---|---|
sexC | |
Female | 466 |
Male | 843 |
- To summarize two categorical variables together, we use a two-way contingency table
- Now the
cross_tab()
function can be used more naturally
pd.crosstab(#index variable
sub_titanic_data.embarkedC, #column variable sub_titanic_data.survivedC)
survivedC | Died | Survived |
---|---|---|
embarkedC | ||
Cherbourg | 120 | 150 |
Queenstown | 79 | 44 |
Southampton | 610 | 304 |
pd.crosstab(
sub_titanic_data.sexC, sub_titanic_data.survivedC)
survivedC | Died | Survived |
---|---|---|
sexC | ||
Female | 127 | 339 |
Male | 682 | 161 |
- Add marginal totals with
margins = True
argument
pd.crosstab(
sub_titanic_data.embarkedC,
sub_titanic_data.survivedC,= True) margins
survivedC | Died | Survived | All |
---|---|---|---|
embarkedC | |||
Cherbourg | 120 | 150 | 270 |
Queenstown | 79 | 44 | 123 |
Southampton | 610 | 304 | 914 |
All | 809 | 498 | 1307 |
- Add row and columns names for clarity
- Use
rownames
andcolnames
arguments
- Use
pd.crosstab(
sub_titanic_data.embarkedC,
sub_titanic_data.survivedC,= True,
margins = ["Embarked Port"],
rownames = ["Survival Status"]
colnames )
Survival Status | Died | Survived | All |
---|---|---|---|
Embarked Port | |||
Cherbourg | 120 | 150 | 270 |
Queenstown | 79 | 44 | 123 |
Southampton | 610 | 304 | 914 |
All | 809 | 498 | 1307 |
That looks great!
For more than two variables we can create tables but they get harder to read. For instance, we can look at a three-way contingency table:
pd.crosstab(#pass a list of columns for the rows
[sub_titanic_data.embarkedC, sub_titanic_data.survivedC],
sub_titanic_data.sexC,= True) margins
sexC | Female | Male | All | |
---|---|---|---|---|
embarkedC | survivedC | |||
Cherbourg | Died | 11 | 109 | 120 |
Survived | 102 | 48 | 150 | |
Queenstown | Died | 23 | 56 | 79 |
Survived | 37 | 7 | 44 | |
Southampton | Died | 93 | 517 | 610 |
Survived | 198 | 106 | 304 | |
All | 464 | 843 | 1307 |
- We can add in names for more clarity
= pd.crosstab(
my_tab
[sub_titanic_data.embarkedC, sub_titanic_data.survivedC],
sub_titanic_data.sexC,= True,
margins = ['Embarked Port', 'Survival Status'], #a list similar to how the rows were passed
rownames = ['Sex'])
colnames my_tab
Sex | Female | Male | All | |
---|---|---|---|---|
Embarked Port | Survival Status | |||
Cherbourg | Died | 11 | 109 | 120 |
Survived | 102 | 48 | 150 | |
Queenstown | Died | 23 | 56 | 79 |
Survived | 37 | 7 | 44 | |
Southampton | Died | 93 | 517 | 610 |
Survived | 198 | 106 | 304 | |
All | 464 | 843 | 1307 |
We might want to subset the returned table to get certain values…
- Note that the
crosstab()
function returns a data frame!
type(my_tab)
pandas.core.frame.DataFrame
def __init__(data=None, index: Axes | None=None, columns: Axes | None=None, dtype: Dtype | None=None, copy: bool | None=None) -> None
Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure. Parameters ---------- data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. This alignment also occurs if data is a Series or a DataFrame itself. Alignment is done on Series/DataFrame inputs. If data is a list of dicts, column order follows insertion-order. index : Index or array-like Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided. columns : Index or array-like Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels, will perform column selection instead. dtype : dtype, default None Data type to force. Only a single dtype is allowed. If None, infer. copy : bool or None, default None Copy data from inputs. For dict data, the default of None behaves like ``copy=True``. For DataFrame or 2d ndarray input, the default of None behaves like ``copy=False``. If data is a dict containing one or more Series (possibly of different dtypes), ``copy=False`` will ensure that these inputs are not copied. .. versionchanged:: 1.3.0 See Also -------- DataFrame.from_records : Constructor from tuples, also record arrays. DataFrame.from_dict : From dicts of Series, arrays, or dicts. read_csv : Read a comma-separated values (csv) file into DataFrame. read_table : Read general delimited file into DataFrame. read_clipboard : Read text from clipboard into DataFrame. Notes ----- Please reference the :ref:`User Guide <basics.dataframe>` for more information. Examples -------- Constructing DataFrame from a dictionary. >>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = pd.DataFrame(data=d) >>> df col1 col2 0 1 3 1 2 4 Notice that the inferred dtype is int64. >>> df.dtypes col1 int64 col2 int64 dtype: object To enforce a single dtype: >>> df = pd.DataFrame(data=d, dtype=np.int8) >>> df.dtypes col1 int8 col2 int8 dtype: object Constructing DataFrame from a dictionary including Series: >>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])} >>> pd.DataFrame(data=d, index=[0, 1, 2, 3]) col1 col2 0 0 NaN 1 1 NaN 2 2 2.0 3 3 3.0 Constructing DataFrame from numpy ndarray: >>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), ... columns=['a', 'b', 'c']) >>> df2 a b c 0 1 2 3 1 4 5 6 2 7 8 9 Constructing DataFrame from a numpy ndarray that has labeled columns: >>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)], ... dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")]) >>> df3 = pd.DataFrame(data, columns=['c', 'a']) ... >>> df3 c a 0 3 1 1 6 4 2 9 7 Constructing DataFrame from dataclass: >>> from dataclasses import make_dataclass >>> Point = make_dataclass("Point", [("x", int), ("y", int)]) >>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)]) x y 0 0 0 1 0 3 2 2 3 Constructing DataFrame from Series/DataFrame: >>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"]) >>> df = pd.DataFrame(data=ser, index=["a", "c"]) >>> df 0 a 1 c 3 >>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"]) >>> df2 = pd.DataFrame(data=df1, index=["a", "c"]) >>> df2 x a 1 c 3
# columns of the data frame my_tab.columns
Index(['Female', 'Male', 'All'], dtype='object', name='Sex')
#rows of the data frame, these are tuples! my_tab.index
MultiIndex([( 'Cherbourg', 'Died'),
( 'Cherbourg', 'Survived'),
( 'Queenstown', 'Died'),
( 'Queenstown', 'Survived'),
('Southampton', 'Died'),
('Southampton', 'Survived'),
( 'All', '')],
names=['Embarked Port', 'Survival Status'])
Can obtain conditional bivariate info via subsetting!
The
MultiIndex
can be tough but let’s look at some examplesBelow returns the embarked vs survived table for females
"Female"] my_tab[
Female | ||
---|---|---|
Embarked Port | Survival Status | |
Cherbourg | Died | 11 |
Survived | 102 | |
Queenstown | Died | 23 |
Survived | 37 | |
Southampton | Died | 93 |
Survived | 198 | |
All | 464 |
"Female"] #.loc way of doing this, : gives all of that index my_tab.loc[:,
Female | ||
---|---|---|
Embarked Port | Survival Status | |
Cherbourg | Died | 11 |
Survived | 102 | |
Queenstown | Died | 23 |
Survived | 37 | |
Southampton | Died | 93 |
Survived | 198 | |
All | 464 |
- Below returns the sex vs embarked table for those that died
0:5:2, :] #0:5:2 gives a shorthand for a sequence with steps of 2s my_tab.iloc[
Sex | Female | Male | All | |
---|---|---|---|---|
Embarked Port | Survival Status | |||
Cherbourg | Died | 11 | 109 | 120 |
Queenstown | Died | 23 | 56 | 79 |
Southampton | Died | 93 | 517 | 610 |
- Using
.loc[]
is better - Must understand our
MultiIndex
my_tab.index
MultiIndex([( 'Cherbourg', 'Died'),
( 'Cherbourg', 'Survived'),
( 'Queenstown', 'Died'),
( 'Queenstown', 'Survived'),
('Southampton', 'Died'),
('Southampton', 'Survived'),
( 'All', '')],
names=['Embarked Port', 'Survival Status'])
- Below uses this index to return the sex vs embarked table for those that died
"Cherbourg", "Queenstown", "Southampton"), "Died"), :] my_tab.loc[((
Sex | Female | Male | All | |
---|---|---|---|---|
Embarked Port | Survival Status | |||
Cherbourg | Died | 11 | 109 | 120 |
Queenstown | Died | 23 | 56 | 79 |
Southampton | Died | 93 | 517 | 610 |
- Below returns the sex vs survived table for embarked of Cherbourg
'Cherbourg', ("Died", "Survived")), :] my_tab.loc[(
Sex | Female | Male | All | |
---|---|---|---|---|
Embarked Port | Survival Status | |||
Cherbourg | Died | 11 | 109 | 120 |
Survived | 102 | 48 | 150 |
- Return the sex table for those that died and embarked at Cherbourg
- First with
.iloc[]
then with.loc[]
- First with
0, :] my_tab.iloc[
Cherbourg | |
---|---|
Died | |
Sex | |
Female | 11 |
Male | 109 |
All | 120 |
'Cherbourg', 'Died')] my_tab.loc[(
Cherbourg | |
---|---|
Died | |
Sex | |
Female | 11 |
Male | 109 |
All | 120 |
Numeric Data
Goal: Describe the distribution of the variable
- Distribution = pattern and frequency with which you observe a variable
- Numeric variable - entries are a numerical value where math can be performed
For a single numeric variable, describe the distribution via
- Shape: Histogram, Density plot, … (covered later)
- Measures of center: Mean, Median, …
- Measures of spread: Variance, Standard Deviation, Quartiles, IQR, …
For two numeric variables, describe the distribution via
- Shape: Scatter plot, …
- Measures of linear relationship: Covariance, Correlation, …
Measures of Center
- Find mean and median with methods on a
Series
type(sub_titanic_data['fare'])
pandas.core.series.Series
def __init__(data=None, index=None, dtype: Dtype | None=None, name=None, copy: bool | None=None, fastpath: bool | lib.NoDefault=lib.no_default) -> None
One-dimensional ndarray with axis labels (including time series). Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN). Operations between Series (+, -, /, \*, \*\*) align values based on their associated index values-- they need not be the same length. The result index will be the sorted union of the two indexes. Parameters ---------- data : array-like, Iterable, dict, or scalar value Contains data stored in Series. If data is a dict, argument order is maintained. index : array-like or Index (1d) Values must be hashable and have the same length as `data`. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values. dtype : str, numpy.dtype, or ExtensionDtype, optional Data type for the output Series. If not specified, this will be inferred from `data`. See the :ref:`user guide <basics.dtypes>` for more usages. name : Hashable, default None The name to give to the Series. copy : bool, default False Copy input data. Only affects Series or 1d ndarray input. See examples. Notes ----- Please reference the :ref:`User Guide <basics.series>` for more information. Examples -------- Constructing Series from a dictionary with an Index specified >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['a', 'b', 'c']) >>> ser a 1 b 2 c 3 dtype: int64 The keys of the dictionary match with the Index values, hence the Index values have no effect. >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['x', 'y', 'z']) >>> ser x NaN y NaN z NaN dtype: float64 Note that the Index is first build with the keys from the dictionary. After this the Series is reindexed with the given Index values, hence we get all NaN as a result. Constructing Series from a list with `copy=False`. >>> r = [1, 2] >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r [1, 2] >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `copy` of the original data even though `copy=False`, so the data is unchanged. Constructing Series from a 1d ndarray with `copy=False`. >>> r = np.array([1, 2]) >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r array([999, 2]) >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `view` on the original data, so the data is changed as well.
- Corresponding methods exist for the common numerical summaries
'fare'].mean() sub_titanic_data[
33.29547928134557
'fare'].median() sub_titanic_data[
14.4542
#same thing with a different way to get a column sub_titanic_data.age.mean()
29.8811345124283
sub_titanic_data.age.median()
28.0
Measures of Spread
- Standard Deviation, Quartiles, & IQR found with
Series
methods as well
sub_titanic_data.age.std()
14.413499699923594
= [0.2, 0.25, 0.5, 0.95]) sub_titanic_data.age.quantile(q
age | |
---|---|
0.20 | 19.0 |
0.25 | 21.0 |
0.50 | 28.0 |
0.95 | 57.0 |
= sub_titanic_data.age.quantile(q = [0.25])
q1 q1
age | |
---|---|
0.25 | 21.0 |
= sub_titanic_data.age.quantile(q = [0.75])
q3 q3
age | |
---|---|
0.75 | 39.0 |
type(q1)
pandas.core.series.Series
def __init__(data=None, index=None, dtype: Dtype | None=None, name=None, copy: bool | None=None, fastpath: bool | lib.NoDefault=lib.no_default) -> None
One-dimensional ndarray with axis labels (including time series). Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN). Operations between Series (+, -, /, \*, \*\*) align values based on their associated index values-- they need not be the same length. The result index will be the sorted union of the two indexes. Parameters ---------- data : array-like, Iterable, dict, or scalar value Contains data stored in Series. If data is a dict, argument order is maintained. index : array-like or Index (1d) Values must be hashable and have the same length as `data`. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values. dtype : str, numpy.dtype, or ExtensionDtype, optional Data type for the output Series. If not specified, this will be inferred from `data`. See the :ref:`user guide <basics.dtypes>` for more usages. name : Hashable, default None The name to give to the Series. copy : bool, default False Copy input data. Only affects Series or 1d ndarray input. See examples. Notes ----- Please reference the :ref:`User Guide <basics.series>` for more information. Examples -------- Constructing Series from a dictionary with an Index specified >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['a', 'b', 'c']) >>> ser a 1 b 2 c 3 dtype: int64 The keys of the dictionary match with the Index values, hence the Index values have no effect. >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['x', 'y', 'z']) >>> ser x NaN y NaN z NaN dtype: float64 Note that the Index is first build with the keys from the dictionary. After this the Series is reindexed with the given Index values, hence we get all NaN as a result. Constructing Series from a list with `copy=False`. >>> r = [1, 2] >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r [1, 2] >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `copy` of the original data even though `copy=False`, so the data is unchanged. Constructing Series from a 1d ndarray with `copy=False`. >>> r = np.array([1, 2]) >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r array([999, 2]) >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `view` on the original data, so the data is changed as well.
- As both
q1
andq3
areSeries
, they have indices - This makes them a little more difficult than you might like to subtract (to find the IRQ)
-q1 #doesn't work due to the differing index names q3
age | |
---|---|
0.25 | NaN |
0.75 | NaN |
0.75] - q1[0.25] #grab the values by index names and subtract those q3[
18.0
- Alternatively, remember that returning the
.values
attribute returns anumpy
array. We can subtract these.
- q1.values q3.values
array([18.])
Measures of Linear Relationship
- Correlation via the
.corr()
method on a data frame - This gives the correlation with any numerically (stored) variables that are passed
- Just because it is stored numerically doesn’t mean we should treat it numerically!
"age", "fare", "sibsp", "parch"]].corr() sub_titanic_data[[
age | fare | sibsp | parch | |
---|---|---|---|---|
age | 1.000000 | 0.178739 | -0.243699 | -0.150917 |
fare | 0.178739 | 1.000000 | 0.160238 | 0.221539 |
sibsp | -0.243699 | 0.160238 | 1.000000 | 0.373587 |
parch | -0.150917 | 0.221539 | 0.373587 | 1.000000 |
Summaries Across Groups
Usually want summaries for different subgroups of data
Two approaches we’ll cover: - Use .groupby()
method and then use a summarization method - Use pd.crosstab()
function with aggfunc
argument
.groupby()
Examples
Example: Get similar fare summaries for each survival status
"survivedC")[["age", "fare", "sibsp", "parch"]].mean() sub_titanic_data.groupby(
FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean()
age | fare | sibsp | parch | |
---|---|---|---|---|
survivedC | ||||
Died | 30.545369 | 23.353831 | 0.521632 | 0.328801 |
Survived | 28.918228 | 49.361184 | 0.462000 | 0.476000 |
"survivedC")[["age", "fare", "sibsp", "parch"]].std() sub_titanic_data.groupby(
FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].std()
age | fare | sibsp | parch | |
---|---|---|---|---|
survivedC | ||||
Died | 13.922539 | 34.145096 | 1.210449 | 0.912332 |
Survived | 15.061481 | 68.648795 | 0.685197 | 0.776292 |
.unstack()
method on the result can sometimes make the output clearer
"survivedC")[["age", "fare", "sibsp", "parch"]].mean().unstack() sub_titanic_data.groupby(
FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean().unstack()
0 | ||
---|---|---|
survivedC | ||
age | Died | 30.545369 |
Survived | 28.918228 | |
fare | Died | 23.353831 |
Survived | 49.361184 | |
sibsp | Died | 0.521632 |
Survived | 0.462000 | |
parch | Died | 0.328801 |
Survived | 0.476000 |
Multiple grouping variables can be given as a list
- Example: Get summary for numeric type variables for each survival status and embarked port
"survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].mean() sub_titanic_data.groupby([
FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].mean()
age | fare | sibsp | parch | ||
---|---|---|---|---|---|
survivedC | embarkedC | ||||
Died | Cherbourg | 34.468750 | 40.255592 | 0.316667 | 0.225000 |
Queenstown | 30.202703 | 11.615349 | 0.379747 | 0.177215 | |
Southampton | 29.945385 | 21.546160 | 0.580328 | 0.368852 | |
Survived | Cherbourg | 31.037248 | 80.000807 | 0.466667 | 0.486667 |
Queenstown | 24.153846 | 13.833998 | 0.272727 | 0.000000 | |
Southampton | 27.989881 | 39.183470 | 0.490132 | 0.542763 |
"survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].std() sub_titanic_data.groupby([
FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].std()
age | fare | sibsp | parch | ||
---|---|---|---|---|---|
survivedC | embarkedC | ||||
Died | Cherbourg | 14.655181 | 56.553704 | 0.518293 | 0.557040 |
Queenstown | 16.785187 | 10.922240 | 1.016578 | 0.655538 | |
Southampton | 13.496871 | 28.786020 | 1.320897 | 0.990934 | |
Survived | Cherbourg | 15.523752 | 97.642219 | 0.575410 | 0.730327 |
Queenstown | 7.057457 | 17.503850 | 0.585230 | 0.000000 | |
Southampton | 14.926867 | 47.656409 | 0.744552 | 0.831405 |
- As our code gets longer, this is a good place to use
\
to extend our code down a line
\
sub_titanic_data "survivedC", "embarkedC"]) \
.groupby(["age", "fare", "sibsp", "parch"]] \
[[ .mean()
FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
.groupby(["survivedC", "embarkedC"]) \
age | fare | sibsp | parch | ||
---|---|---|---|---|---|
survivedC | embarkedC | ||||
Died | Cherbourg | 34.468750 | 40.255592 | 0.316667 | 0.225000 |
Queenstown | 30.202703 | 11.615349 | 0.379747 | 0.177215 | |
Southampton | 29.945385 | 21.546160 | 0.580328 | 0.368852 | |
Survived | Cherbourg | 31.037248 | 80.000807 | 0.466667 | 0.486667 |
Queenstown | 24.153846 | 13.833998 | 0.272727 | 0.000000 | |
Southampton | 27.989881 | 39.183470 | 0.490132 | 0.542763 |
pd.crosstab()
Examples
- Alternatively we can use the
pd.crosstab()
function with anaggfunc
to define our summarization to produce
Example: Get summary for numeric type variables for each survival status
- A bit awkward in this case as we don’t really have a ‘column’ variable
- Make a dummy variable for that
pd.crosstab(
sub_titanic_data.survivedC,= ["mean" for _ in range(sub_titanic_data.shape[0])], #create variable with only the value 'mean'
columns = sub_titanic_data.fare,
values = 'mean') aggfunc
col_0 | mean |
---|---|
survivedC | |
Died | 23.353831 |
Survived | 49.361184 |
- Can return multiple summaries at once by passing them as a
list
pd.crosstab(
sub_titanic_data.survivedC,= ["stat" for _ in range(sub_titanic_data.shape[0])],
columns = sub_titanic_data.fare,
values = ['mean', 'median', 'std', 'count']) aggfunc
mean | median | std | count | |
---|---|---|---|---|
col_0 | stat | stat | stat | stat |
survivedC | ||||
Died | 23.353831 | 10.5 | 34.145096 | 808 |
Survived | 49.361184 | 26.0 | 68.648795 | 500 |
More natural with two grouping variables
- Example: Get summary for numeric type variables for each survival status and embarked port
pd.crosstab(
sub_titanic_data.embarkedC,
sub_titanic_data.survivedC,= sub_titanic_data.fare,
values = ['mean', 'count']) aggfunc
mean | count | |||
---|---|---|---|---|
survivedC | Died | Survived | Died | Survived |
embarkedC | ||||
Cherbourg | 40.255592 | 80.000807 | 120 | 150 |
Queenstown | 11.615349 | 13.833998 | 79 | 44 |
Southampton | 21.546160 | 39.183470 | 609 | 304 |
Quick Video
This video shows an example of reading in some data and finding numeric summaries!
We’ll look at the .pivot_table()
and .agg()
methods.
Remember to pop the video out into the full player.
The notebook written in the video is available here.
from IPython.display import IFrame
="https://ncsu.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=197e5481-40be-4488-8271-b0ff000f5fd2&autoplay=false&offerviewer=true&showtitle=true&showbrand=true&captions=false&interactivity=all", height="405", width="720") IFrame(src
Recap
EDA is often the first step to an analysis:
Must understand the type of data you have/missingness/data validation
Then describe the distributions of the variables
Numerical summaries
- Contingency Tables:
pd.crosstab()
- Mean/Median:
.mean()
,.median()
methods on a data frame - Standard Deviation/quantiles:
.std()
,.quantile()
methods
- Contingency Tables:
Across subgroups with
.groupby()
method orpd.crosstab(value, aggfunc)
You can fancy up output too!
That wraps up the material for week 3! Head to the Moodle site to work on your next homework assignment.
If you are on the course website, use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!
If you are on Google Colab, head back to our course website for our next lesson!