import pandas as pd
titanic_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/titanic.csv")Numerical Summaries
Justin Post
Usual first step in an analysis is to get to know your data (an Exploratory Data Analysis (EDA))
EDA generally consists of a few steps:
- Understand how your data is stored
- Do basic data validation
- Determine rate of missing values
- Clean data up data as needed
- Investigate distributions
- Univariate measures/graphs
- Multivariate measures/graphs
- Apply transformations and repeat previous step
Note: These types of webpages are built from Jupyter notebooks (.ipynb files). You can access your own versions of them by clicking here. It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you’d like!
Understand How Data is Stored
First, let’s read in some data. Recall, for .csv files (comma separated value files) we can read them in using pandas read_csv() function.
We’ll read in the classic titanic data set.
- The
.info()method allows us to see how our variables are stored (among other things) - Column data types should make sense for what you expect!
titanic_data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1310 entries, 0 to 1309
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null float64
1 survived 1309 non-null float64
2 name 1309 non-null object
3 sex 1309 non-null object
4 age 1046 non-null float64
5 sibsp 1309 non-null float64
6 parch 1309 non-null float64
7 ticket 1309 non-null object
8 fare 1308 non-null float64
9 cabin 295 non-null object
10 embarked 1307 non-null object
11 boat 486 non-null object
12 body 121 non-null float64
13 home.dest 745 non-null object
dtypes: float64(7), object(7)
memory usage: 143.4+ KB
.head()and.tail()help to see what we have as well
titanic_data.head() #clearly some missing values with NaNs| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 1.0 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0.0 | 0.0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
| 1 | 1.0 | 1.0 | Allison, Master. Hudson Trevor | male | 0.9167 | 1.0 | 2.0 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
| 2 | 1.0 | 0.0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1.0 | 2.0 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
| 3 | 1.0 | 0.0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1.0 | 2.0 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
| 4 | 1.0 | 0.0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1.0 | 2.0 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
titanic_data.tail() #note the last row of NaN (not a number)| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1305 | 3.0 | 0.0 | Zabour, Miss. Thamine | female | NaN | 1.0 | 0.0 | 2665 | 14.4542 | NaN | C | NaN | NaN | NaN |
| 1306 | 3.0 | 0.0 | Zakarian, Mr. Mapriededer | male | 26.5 | 0.0 | 0.0 | 2656 | 7.2250 | NaN | C | NaN | 304.0 | NaN |
| 1307 | 3.0 | 0.0 | Zakarian, Mr. Ortin | male | 27.0 | 0.0 | 0.0 | 2670 | 7.2250 | NaN | C | NaN | NaN | NaN |
| 1308 | 3.0 | 0.0 | Zimmerman, Mr. Leo | male | 29.0 | 0.0 | 0.0 | 315082 | 7.8750 | NaN | S | NaN | NaN | NaN |
| 1309 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Do Basic Data Validation
- Use the
describe()method on a data frame - Check that the min’s, max’s, etc. all make sense!
titanic_data.describe()| pclass | survived | age | sibsp | parch | fare | body | |
|---|---|---|---|---|---|---|---|
| count | 1309.000000 | 1309.000000 | 1046.000000 | 1309.000000 | 1309.000000 | 1308.000000 | 121.000000 |
| mean | 2.294882 | 0.381971 | 29.881135 | 0.498854 | 0.385027 | 33.295479 | 160.809917 |
| std | 0.837836 | 0.486055 | 14.413500 | 1.041658 | 0.865560 | 51.758668 | 97.696922 |
| min | 1.000000 | 0.000000 | 0.166700 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 2.000000 | 0.000000 | 21.000000 | 0.000000 | 0.000000 | 7.895800 | 72.000000 |
| 50% | 3.000000 | 0.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 | 155.000000 |
| 75% | 3.000000 | 1.000000 | 39.000000 | 1.000000 | 0.000000 | 31.275000 | 256.000000 |
| max | 3.000000 | 1.000000 | 80.000000 | 8.000000 | 9.000000 | 512.329200 | 328.000000 |
- Recall we can subset our columns with
[]
titanic_data.columnsIndex(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
dtype='object')
- We can determine which
percentilesof selected columns to return by combining the column subsetting via selection brackets[]and the.describe()method
titanic_data[["age", "sibsp", "parch", "fare"]].describe(percentiles = [0.05, 0.25, 0.99])| age | sibsp | parch | fare | |
|---|---|---|---|---|
| count | 1046.000000 | 1309.000000 | 1309.000000 | 1308.000000 |
| mean | 29.881135 | 0.498854 | 0.385027 | 33.295479 |
| std | 14.413500 | 1.041658 | 0.865560 | 51.758668 |
| min | 0.166700 | 0.000000 | 0.000000 | 0.000000 |
| 5% | 5.000000 | 0.000000 | 0.000000 | 7.225000 |
| 25% | 21.000000 | 0.000000 | 0.000000 | 7.895800 |
| 50% | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
| 99% | 65.000000 | 5.000000 | 4.000000 | 262.375000 |
| max | 80.000000 | 8.000000 | 9.000000 | 512.329200 |
Determine Rate of Missing Values
- Use
is.null()method to determine the missing values
titanic_data.isnull()| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | False | False | False | False | False | True | False |
| 1 | False | False | False | False | False | False | False | False | False | False | False | False | True | False |
| 2 | False | False | False | False | False | False | False | False | False | False | False | True | True | False |
| 3 | False | False | False | False | False | False | False | False | False | False | False | True | False | False |
| 4 | False | False | False | False | False | False | False | False | False | False | False | True | True | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1305 | False | False | False | False | True | False | False | False | False | True | False | True | True | True |
| 1306 | False | False | False | False | False | False | False | False | False | True | False | True | False | True |
| 1307 | False | False | False | False | False | False | False | False | False | True | False | True | True | True |
| 1308 | False | False | False | False | False | False | False | False | False | True | False | True | True | True |
| 1309 | True | True | True | True | True | True | True | True | True | True | True | True | True | True |
1310 rows × 14 columns
- Yikes! Can’t make heads or tails of that.
- This is a
DataFrameof booleans! - Use the
.sum()method to see how manynullvalues we have for each column
titanic_data.isnull().sum()| 0 | |
|---|---|
| pclass | 1 |
| survived | 1 |
| name | 1 |
| sex | 1 |
| age | 264 |
| sibsp | 1 |
| parch | 1 |
| ticket | 1 |
| fare | 2 |
| cabin | 1015 |
| embarked | 3 |
| boat | 824 |
| body | 1189 |
| home.dest | 565 |
- This type of multiple method use gives us a good chance to use our
\operator to create more readable code by making it multi-line
titanic_data.isnull() \
.sum()| 0 | |
|---|---|
| pclass | 1 |
| survived | 1 |
| name | 1 |
| sex | 1 |
| age | 264 |
| sibsp | 1 |
| parch | 1 |
| ticket | 1 |
| fare | 2 |
| cabin | 1015 |
| embarked | 3 |
| boat | 824 |
| body | 1189 |
| home.dest | 565 |
Clean Up Data As Needed
- We can remove rows with missing using
.dropna()method - First, remove the
cabin,boat, andbodyvariables since they have so many missing values- If we want to just remove some columns, can use the
.drop()method
- If we want to just remove some columns, can use the
sub_titanic_data = titanic_data.drop(columns = ["body", "cabin", "boat"])
sub_titanic_data.shape(1310, 11)
- Check on the missingness now
sub_titanic_data.isnull().sum()| 0 | |
|---|---|
| pclass | 1 |
| survived | 1 |
| name | 1 |
| sex | 1 |
| age | 264 |
| sibsp | 1 |
| parch | 1 |
| ticket | 1 |
| fare | 2 |
| embarked | 3 |
| home.dest | 565 |
- Now we are ready to use the
.dropna()method to remove any rows with missing data
temp = sub_titanic_data.dropna()
temp.shape #notice the reduction in rows(684, 11)
temp.isnull().sum() #no more missing values| 0 | |
|---|---|
| pclass | 0 |
| survived | 0 |
| name | 0 |
| sex | 0 |
| age | 0 |
| sibsp | 0 |
| parch | 0 |
| ticket | 0 |
| fare | 0 |
| embarked | 0 |
| home.dest | 0 |
- Usually, you don’t want to drop all the rows with any missing data as you are throwing out useful info.
- One option is to impute the missing values… this can be dangerous but can be done with
.fillna()method
sub_titanic_data.fillna(value = 0) #note, for instance, some values of age are 0 now and the last row is all 0 values| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | embarked | home.dest | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 1.0 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0.0 | 0.0 | 24160 | 211.3375 | S | St Louis, MO |
| 1 | 1.0 | 1.0 | Allison, Master. Hudson Trevor | male | 0.9167 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
| 2 | 1.0 | 0.0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
| 3 | 1.0 | 0.0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
| 4 | 1.0 | 0.0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1305 | 3.0 | 0.0 | Zabour, Miss. Thamine | female | 0.0000 | 1.0 | 0.0 | 2665 | 14.4542 | C | 0 |
| 1306 | 3.0 | 0.0 | Zakarian, Mr. Mapriededer | male | 26.5000 | 0.0 | 0.0 | 2656 | 7.2250 | C | 0 |
| 1307 | 3.0 | 0.0 | Zakarian, Mr. Ortin | male | 27.0000 | 0.0 | 0.0 | 2670 | 7.2250 | C | 0 |
| 1308 | 3.0 | 0.0 | Zimmerman, Mr. Leo | male | 29.0000 | 0.0 | 0.0 | 315082 | 7.8750 | S | 0 |
| 1309 | 0.0 | 0.0 | 0 | 0 | 0.0000 | 0.0 | 0.0 | 0 | 0.0000 | 0 | 0 |
1310 rows × 11 columns
- Can set the value you want to impute by passing a dictionary of key/value pairs
sub_titanic_data.fillna(value = {"home.dest": "Unknown", "age": 200})| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | embarked | home.dest | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 1.0 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0.0 | 0.0 | 24160 | 211.3375 | S | St Louis, MO |
| 1 | 1.0 | 1.0 | Allison, Master. Hudson Trevor | male | 0.9167 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
| 2 | 1.0 | 0.0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
| 3 | 1.0 | 0.0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
| 4 | 1.0 | 0.0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1.0 | 2.0 | 113781 | 151.5500 | S | Montreal, PQ / Chesterville, ON |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1305 | 3.0 | 0.0 | Zabour, Miss. Thamine | female | 200.0000 | 1.0 | 0.0 | 2665 | 14.4542 | C | Unknown |
| 1306 | 3.0 | 0.0 | Zakarian, Mr. Mapriededer | male | 26.5000 | 0.0 | 0.0 | 2656 | 7.2250 | C | Unknown |
| 1307 | 3.0 | 0.0 | Zakarian, Mr. Ortin | male | 27.0000 | 0.0 | 0.0 | 2670 | 7.2250 | C | Unknown |
| 1308 | 3.0 | 0.0 | Zimmerman, Mr. Leo | male | 29.0000 | 0.0 | 0.0 | 315082 | 7.8750 | S | Unknown |
| 1309 | NaN | NaN | NaN | NaN | 200.0000 | NaN | NaN | NaN | NaN | NaN | Unknown |
1310 rows × 11 columns
Investigate distributions
How to summarize data depends on the type of data
- Categorical (Qualitative) variable - entries are a label or attribute
- Numeric (Quantitative) variable - entries are a numerical value where math can be performed
- Categorical (Qualitative) variable - entries are a label or attribute
Numerical summaries (across subgroups)
- Contingency Tables (for categorical data)
- Mean/Median
- Standard Deviation/Variance/IQR
- Quantiles/Percentiles
Graphical summaries (across subgroups)
- Bar plots (for categorical data)
- Histograms
- Box plots
- Scatter plots
Categorical Data
Goal: Describe the distribution of the variable
- Distribution = pattern and frequency with which you observe a variable
- Categorical variable - entries are a label or attribute
- Describe the relative frequency (or count) for each category
- Using
pandas.value_counts()method andcrosstab()function
Variables of interest for this section: + embarked (where journey started)
sub_titanic_data.embarked[0:2]| embarked | |
|---|---|
| 0 | S |
| 1 | S |
type(sub_titanic_data.embarked[0])str
The str type isn’t ideal for summarizaitons. A different data type is better!
Category Type Variables
A category type variable is really useful for categorical variables.
- Akin to a factor variable in
R(if you know those) - Can have more descriptive labels, ordering of categories, etc.
Let’s give the embarked variable more descriptive values and by converting it to a category type and manipulating it that way.
sub_titanic_data["embarkedC"] = sub_titanic_data.embarked.astype("category")
sub_titanic_data.embarkedC[0:2]| embarkedC | |
|---|---|
| 0 | S |
| 1 | S |
- Now we can use the
.cat.rename_categories()method on thiscategoryvariable
sub_titanic_data.embarkedC = sub_titanic_data.embarkedC.cat.rename_categories(["Cherbourg", "Queenstown", "Southampton"])
sub_titanic_data.embarkedC[0:2]| embarkedC | |
|---|---|
| 0 | Southampton |
| 1 | Southampton |
Way better! Now let’s grab two more categorical variables and do similar things:
- sex (Male or Female)
- survived (survived or died)
#convert sec variable
sub_titanic_data["sexC"] = sub_titanic_data.sex.astype("category")
sub_titanic_data.sexC = sub_titanic_data.sexC.cat.rename_categories(["Female", "Male"])
#convert survived variable
sub_titanic_data["survivedC"] = sub_titanic_data.survived.astype("category")
sub_titanic_data.survivedC = sub_titanic_data.survivedC.cat.rename_categories(["Died", "Survived"])Contingency tables
- Tables of counts are the main numerical summary for categorical data
- Create one-way contingency tables (
.value_counts()method) (one-way because we are looking at one variable at a time)
sub_titanic_data.embarkedC.value_counts(dropna = False)| count | |
|---|---|
| embarkedC | |
| Southampton | 914 |
| Cherbourg | 270 |
| Queenstown | 123 |
| NaN | 3 |
sub_titanic_data.survivedC.value_counts()| count | |
|---|---|
| survivedC | |
| Died | 809 |
| Survived | 500 |
sub_titanic_data.sexC.value_counts()| count | |
|---|---|
| sexC | |
| Male | 843 |
| Female | 466 |
- Alternatively, we can find a one-way table using the
pd.cross_tab()function- This function is meant to take two columns (or more) and return tabulations between those two variables
- We can define a dummy variable to cross with
indexargument is the row variable andcolumnsargument is the column variable
sub_titanic_data["dummy"] = 0
pd.crosstab(index = sub_titanic_data.embarkedC, columns = sub_titanic_data.dummy)| dummy | 0 |
|---|---|
| embarkedC | |
| Cherbourg | 270 |
| Queenstown | 123 |
| Southampton | 914 |
pd.crosstab(index = sub_titanic_data.sexC, columns = sub_titanic_data.dummy)| dummy | 0 |
|---|---|
| sexC | |
| Female | 466 |
| Male | 843 |
- To summarize two categorical variables together, we use a two-way contingency table
- Now the
cross_tab()function can be used more naturally
pd.crosstab(
sub_titanic_data.embarkedC, #index variable
sub_titanic_data.survivedC) #column variable| survivedC | Died | Survived |
|---|---|---|
| embarkedC | ||
| Cherbourg | 120 | 150 |
| Queenstown | 79 | 44 |
| Southampton | 610 | 304 |
pd.crosstab(
sub_titanic_data.sexC,
sub_titanic_data.survivedC)| survivedC | Died | Survived |
|---|---|---|
| sexC | ||
| Female | 127 | 339 |
| Male | 682 | 161 |
- Add marginal totals with
margins = Trueargument
pd.crosstab(
sub_titanic_data.embarkedC,
sub_titanic_data.survivedC,
margins = True)| survivedC | Died | Survived | All |
|---|---|---|---|
| embarkedC | |||
| Cherbourg | 120 | 150 | 270 |
| Queenstown | 79 | 44 | 123 |
| Southampton | 610 | 304 | 914 |
| All | 809 | 498 | 1307 |
- Add row and columns names for clarity
- Use
rownamesandcolnamesarguments
- Use
pd.crosstab(
sub_titanic_data.embarkedC,
sub_titanic_data.survivedC,
margins = True,
rownames = ["Embarked Port"],
colnames = ["Survival Status"]
)| Survival Status | Died | Survived | All |
|---|---|---|---|
| Embarked Port | |||
| Cherbourg | 120 | 150 | 270 |
| Queenstown | 79 | 44 | 123 |
| Southampton | 610 | 304 | 914 |
| All | 809 | 498 | 1307 |
That looks great!
For more than two variables we can create tables but they get harder to read. For instance, we can look at a three-way contingency table:
pd.crosstab(
[sub_titanic_data.embarkedC, sub_titanic_data.survivedC], #pass a list of columns for the rows
sub_titanic_data.sexC,
margins = True)| sexC | Female | Male | All | |
|---|---|---|---|---|
| embarkedC | survivedC | |||
| Cherbourg | Died | 11 | 109 | 120 |
| Survived | 102 | 48 | 150 | |
| Queenstown | Died | 23 | 56 | 79 |
| Survived | 37 | 7 | 44 | |
| Southampton | Died | 93 | 517 | 610 |
| Survived | 198 | 106 | 304 | |
| All | 464 | 843 | 1307 |
- We can add in names for more clarity
my_tab = pd.crosstab(
[sub_titanic_data.embarkedC, sub_titanic_data.survivedC],
sub_titanic_data.sexC,
margins = True,
rownames = ['Embarked Port', 'Survival Status'], #a list similar to how the rows were passed
colnames = ['Sex'])
my_tab| Sex | Female | Male | All | |
|---|---|---|---|---|
| Embarked Port | Survival Status | |||
| Cherbourg | Died | 11 | 109 | 120 |
| Survived | 102 | 48 | 150 | |
| Queenstown | Died | 23 | 56 | 79 |
| Survived | 37 | 7 | 44 | |
| Southampton | Died | 93 | 517 | 610 |
| Survived | 198 | 106 | 304 | |
| All | 464 | 843 | 1307 |
We might want to subset the returned table to get certain values…
- Note that the
crosstab()function returns a data frame!
type(my_tab)pandas.core.frame.DataFrame
def __init__(data=None, index: Axes | None=None, columns: Axes | None=None, dtype: Dtype | None=None, copy: bool | None=None) -> None
Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure. Parameters ---------- data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. This alignment also occurs if data is a Series or a DataFrame itself. Alignment is done on Series/DataFrame inputs. If data is a list of dicts, column order follows insertion-order. index : Index or array-like Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided. columns : Index or array-like Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels, will perform column selection instead. dtype : dtype, default None Data type to force. Only a single dtype is allowed. If None, infer. copy : bool or None, default None Copy data from inputs. For dict data, the default of None behaves like ``copy=True``. For DataFrame or 2d ndarray input, the default of None behaves like ``copy=False``. If data is a dict containing one or more Series (possibly of different dtypes), ``copy=False`` will ensure that these inputs are not copied. .. versionchanged:: 1.3.0 See Also -------- DataFrame.from_records : Constructor from tuples, also record arrays. DataFrame.from_dict : From dicts of Series, arrays, or dicts. read_csv : Read a comma-separated values (csv) file into DataFrame. read_table : Read general delimited file into DataFrame. read_clipboard : Read text from clipboard into DataFrame. Notes ----- Please reference the :ref:`User Guide <basics.dataframe>` for more information. Examples -------- Constructing DataFrame from a dictionary. >>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = pd.DataFrame(data=d) >>> df col1 col2 0 1 3 1 2 4 Notice that the inferred dtype is int64. >>> df.dtypes col1 int64 col2 int64 dtype: object To enforce a single dtype: >>> df = pd.DataFrame(data=d, dtype=np.int8) >>> df.dtypes col1 int8 col2 int8 dtype: object Constructing DataFrame from a dictionary including Series: >>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])} >>> pd.DataFrame(data=d, index=[0, 1, 2, 3]) col1 col2 0 0 NaN 1 1 NaN 2 2 2.0 3 3 3.0 Constructing DataFrame from numpy ndarray: >>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), ... columns=['a', 'b', 'c']) >>> df2 a b c 0 1 2 3 1 4 5 6 2 7 8 9 Constructing DataFrame from a numpy ndarray that has labeled columns: >>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)], ... dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")]) >>> df3 = pd.DataFrame(data, columns=['c', 'a']) ... >>> df3 c a 0 3 1 1 6 4 2 9 7 Constructing DataFrame from dataclass: >>> from dataclasses import make_dataclass >>> Point = make_dataclass("Point", [("x", int), ("y", int)]) >>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)]) x y 0 0 0 1 0 3 2 2 3 Constructing DataFrame from Series/DataFrame: >>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"]) >>> df = pd.DataFrame(data=ser, index=["a", "c"]) >>> df 0 a 1 c 3 >>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"]) >>> df2 = pd.DataFrame(data=df1, index=["a", "c"]) >>> df2 x a 1 c 3
my_tab.columns # columns of the data frameIndex(['Female', 'Male', 'All'], dtype='object', name='Sex')
my_tab.index #rows of the data frame, these are tuples!MultiIndex([( 'Cherbourg', 'Died'),
( 'Cherbourg', 'Survived'),
( 'Queenstown', 'Died'),
( 'Queenstown', 'Survived'),
('Southampton', 'Died'),
('Southampton', 'Survived'),
( 'All', '')],
names=['Embarked Port', 'Survival Status'])
Can obtain conditional bivariate info via subsetting!
The
MultiIndexcan be tough but let’s look at some examplesBelow returns the embarked vs survived table for females
my_tab["Female"]| Female | ||
|---|---|---|
| Embarked Port | Survival Status | |
| Cherbourg | Died | 11 |
| Survived | 102 | |
| Queenstown | Died | 23 |
| Survived | 37 | |
| Southampton | Died | 93 |
| Survived | 198 | |
| All | 464 |
my_tab.loc[:, "Female"] #.loc way of doing this, : gives all of that index| Female | ||
|---|---|---|
| Embarked Port | Survival Status | |
| Cherbourg | Died | 11 |
| Survived | 102 | |
| Queenstown | Died | 23 |
| Survived | 37 | |
| Southampton | Died | 93 |
| Survived | 198 | |
| All | 464 |
- Below returns the sex vs embarked table for those that died
my_tab.iloc[0:5:2, :] #0:5:2 gives a shorthand for a sequence with steps of 2s| Sex | Female | Male | All | |
|---|---|---|---|---|
| Embarked Port | Survival Status | |||
| Cherbourg | Died | 11 | 109 | 120 |
| Queenstown | Died | 23 | 56 | 79 |
| Southampton | Died | 93 | 517 | 610 |
- Using
.loc[]is better - Must understand our
MultiIndex
my_tab.indexMultiIndex([( 'Cherbourg', 'Died'),
( 'Cherbourg', 'Survived'),
( 'Queenstown', 'Died'),
( 'Queenstown', 'Survived'),
('Southampton', 'Died'),
('Southampton', 'Survived'),
( 'All', '')],
names=['Embarked Port', 'Survival Status'])
- Below uses this index to return the sex vs embarked table for those that died
my_tab.loc[(("Cherbourg", "Queenstown", "Southampton"), "Died"), :]| Sex | Female | Male | All | |
|---|---|---|---|---|
| Embarked Port | Survival Status | |||
| Cherbourg | Died | 11 | 109 | 120 |
| Queenstown | Died | 23 | 56 | 79 |
| Southampton | Died | 93 | 517 | 610 |
- Below returns the sex vs survived table for embarked of Cherbourg
my_tab.loc[('Cherbourg', ("Died", "Survived")), :]| Sex | Female | Male | All | |
|---|---|---|---|---|
| Embarked Port | Survival Status | |||
| Cherbourg | Died | 11 | 109 | 120 |
| Survived | 102 | 48 | 150 |
- Return the sex table for those that died and embarked at Cherbourg
- First with
.iloc[]then with.loc[]
- First with
my_tab.iloc[0, :]| Cherbourg | |
|---|---|
| Died | |
| Sex | |
| Female | 11 |
| Male | 109 |
| All | 120 |
my_tab.loc[('Cherbourg', 'Died')]| Cherbourg | |
|---|---|
| Died | |
| Sex | |
| Female | 11 |
| Male | 109 |
| All | 120 |
Numeric Data
Goal: Describe the distribution of the variable
- Distribution = pattern and frequency with which you observe a variable
- Numeric variable - entries are a numerical value where math can be performed
For a single numeric variable, describe the distribution via
- Shape: Histogram, Density plot, … (covered later)
- Measures of center: Mean, Median, …
- Measures of spread: Variance, Standard Deviation, Quartiles, IQR, …
For two numeric variables, describe the distribution via
- Shape: Scatter plot, …
- Measures of linear relationship: Covariance, Correlation, …
Measures of Center
- Find mean and median with methods on a
Series
type(sub_titanic_data['fare'])pandas.core.series.Series
def __init__(data=None, index=None, dtype: Dtype | None=None, name=None, copy: bool | None=None, fastpath: bool | lib.NoDefault=lib.no_default) -> None
One-dimensional ndarray with axis labels (including time series). Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN). Operations between Series (+, -, /, \*, \*\*) align values based on their associated index values-- they need not be the same length. The result index will be the sorted union of the two indexes. Parameters ---------- data : array-like, Iterable, dict, or scalar value Contains data stored in Series. If data is a dict, argument order is maintained. index : array-like or Index (1d) Values must be hashable and have the same length as `data`. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values. dtype : str, numpy.dtype, or ExtensionDtype, optional Data type for the output Series. If not specified, this will be inferred from `data`. See the :ref:`user guide <basics.dtypes>` for more usages. name : Hashable, default None The name to give to the Series. copy : bool, default False Copy input data. Only affects Series or 1d ndarray input. See examples. Notes ----- Please reference the :ref:`User Guide <basics.series>` for more information. Examples -------- Constructing Series from a dictionary with an Index specified >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['a', 'b', 'c']) >>> ser a 1 b 2 c 3 dtype: int64 The keys of the dictionary match with the Index values, hence the Index values have no effect. >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['x', 'y', 'z']) >>> ser x NaN y NaN z NaN dtype: float64 Note that the Index is first build with the keys from the dictionary. After this the Series is reindexed with the given Index values, hence we get all NaN as a result. Constructing Series from a list with `copy=False`. >>> r = [1, 2] >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r [1, 2] >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `copy` of the original data even though `copy=False`, so the data is unchanged. Constructing Series from a 1d ndarray with `copy=False`. >>> r = np.array([1, 2]) >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r array([999, 2]) >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `view` on the original data, so the data is changed as well.
- Corresponding methods exist for the common numerical summaries
sub_titanic_data['fare'].mean()33.29547928134557
sub_titanic_data['fare'].median()14.4542
sub_titanic_data.age.mean() #same thing with a different way to get a column29.8811345124283
sub_titanic_data.age.median()28.0
Measures of Spread
- Standard Deviation, Quartiles, & IQR found with
Seriesmethods as well
sub_titanic_data.age.std()14.413499699923594
sub_titanic_data.age.quantile(q = [0.2, 0.25, 0.5, 0.95])| age | |
|---|---|
| 0.20 | 19.0 |
| 0.25 | 21.0 |
| 0.50 | 28.0 |
| 0.95 | 57.0 |
q1 = sub_titanic_data.age.quantile(q = [0.25])
q1| age | |
|---|---|
| 0.25 | 21.0 |
q3 = sub_titanic_data.age.quantile(q = [0.75])
q3| age | |
|---|---|
| 0.75 | 39.0 |
type(q1)pandas.core.series.Series
def __init__(data=None, index=None, dtype: Dtype | None=None, name=None, copy: bool | None=None, fastpath: bool | lib.NoDefault=lib.no_default) -> None
One-dimensional ndarray with axis labels (including time series). Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN). Operations between Series (+, -, /, \*, \*\*) align values based on their associated index values-- they need not be the same length. The result index will be the sorted union of the two indexes. Parameters ---------- data : array-like, Iterable, dict, or scalar value Contains data stored in Series. If data is a dict, argument order is maintained. index : array-like or Index (1d) Values must be hashable and have the same length as `data`. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values. dtype : str, numpy.dtype, or ExtensionDtype, optional Data type for the output Series. If not specified, this will be inferred from `data`. See the :ref:`user guide <basics.dtypes>` for more usages. name : Hashable, default None The name to give to the Series. copy : bool, default False Copy input data. Only affects Series or 1d ndarray input. See examples. Notes ----- Please reference the :ref:`User Guide <basics.series>` for more information. Examples -------- Constructing Series from a dictionary with an Index specified >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['a', 'b', 'c']) >>> ser a 1 b 2 c 3 dtype: int64 The keys of the dictionary match with the Index values, hence the Index values have no effect. >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['x', 'y', 'z']) >>> ser x NaN y NaN z NaN dtype: float64 Note that the Index is first build with the keys from the dictionary. After this the Series is reindexed with the given Index values, hence we get all NaN as a result. Constructing Series from a list with `copy=False`. >>> r = [1, 2] >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r [1, 2] >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `copy` of the original data even though `copy=False`, so the data is unchanged. Constructing Series from a 1d ndarray with `copy=False`. >>> r = np.array([1, 2]) >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r array([999, 2]) >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `view` on the original data, so the data is changed as well.
- As both
q1andq3areSeries, they have indices - This makes them a little more difficult than you might like to subtract (to find the IRQ)
q3-q1 #doesn't work due to the differing index names| age | |
|---|---|
| 0.25 | NaN |
| 0.75 | NaN |
q3[0.75] - q1[0.25] #grab the values by index names and subtract those18.0
- Alternatively, remember that returning the
.valuesattribute returns anumpyarray. We can subtract these.
q3.values - q1.valuesarray([18.])
Measures of Linear Relationship
- Correlation via the
.corr()method on a data frame - This gives the correlation with any numerically (stored) variables that are passed
- Just because it is stored numerically doesn’t mean we should treat it numerically!
sub_titanic_data[["age", "fare", "sibsp", "parch"]].corr()| age | fare | sibsp | parch | |
|---|---|---|---|---|
| age | 1.000000 | 0.178739 | -0.243699 | -0.150917 |
| fare | 0.178739 | 1.000000 | 0.160238 | 0.221539 |
| sibsp | -0.243699 | 0.160238 | 1.000000 | 0.373587 |
| parch | -0.150917 | 0.221539 | 0.373587 | 1.000000 |
Summaries Across Groups
Usually want summaries for different subgroups of data
Two approaches we’ll cover: - Use .groupby() method and then use a summarization method - Use pd.crosstab() function with aggfunc argument
.groupby() Examples
Example: Get similar fare summaries for each survival status
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean()FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean()
| age | fare | sibsp | parch | |
|---|---|---|---|---|
| survivedC | ||||
| Died | 30.545369 | 23.353831 | 0.521632 | 0.328801 |
| Survived | 28.918228 | 49.361184 | 0.462000 | 0.476000 |
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].std()FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].std()
| age | fare | sibsp | parch | |
|---|---|---|---|---|
| survivedC | ||||
| Died | 13.922539 | 34.145096 | 1.210449 | 0.912332 |
| Survived | 15.061481 | 68.648795 | 0.685197 | 0.776292 |
.unstack()method on the result can sometimes make the output clearer
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean().unstack()FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean().unstack()
| 0 | ||
|---|---|---|
| survivedC | ||
| age | Died | 30.545369 |
| Survived | 28.918228 | |
| fare | Died | 23.353831 |
| Survived | 49.361184 | |
| sibsp | Died | 0.521632 |
| Survived | 0.462000 | |
| parch | Died | 0.328801 |
| Survived | 0.476000 |
Multiple grouping variables can be given as a list
- Example: Get summary for numeric type variables for each survival status and embarked port
sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].mean()FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].mean()
| age | fare | sibsp | parch | ||
|---|---|---|---|---|---|
| survivedC | embarkedC | ||||
| Died | Cherbourg | 34.468750 | 40.255592 | 0.316667 | 0.225000 |
| Queenstown | 30.202703 | 11.615349 | 0.379747 | 0.177215 | |
| Southampton | 29.945385 | 21.546160 | 0.580328 | 0.368852 | |
| Survived | Cherbourg | 31.037248 | 80.000807 | 0.466667 | 0.486667 |
| Queenstown | 24.153846 | 13.833998 | 0.272727 | 0.000000 | |
| Southampton | 27.989881 | 39.183470 | 0.490132 | 0.542763 |
sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].std()FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].std()
| age | fare | sibsp | parch | ||
|---|---|---|---|---|---|
| survivedC | embarkedC | ||||
| Died | Cherbourg | 14.655181 | 56.553704 | 0.518293 | 0.557040 |
| Queenstown | 16.785187 | 10.922240 | 1.016578 | 0.655538 | |
| Southampton | 13.496871 | 28.786020 | 1.320897 | 0.990934 | |
| Survived | Cherbourg | 15.523752 | 97.642219 | 0.575410 | 0.730327 |
| Queenstown | 7.057457 | 17.503850 | 0.585230 | 0.000000 | |
| Southampton | 14.926867 | 47.656409 | 0.744552 | 0.831405 |
- As our code gets longer, this is a good place to use
\to extend our code down a line
sub_titanic_data \
.groupby(["survivedC", "embarkedC"]) \
[["age", "fare", "sibsp", "parch"]] \
.mean()FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
.groupby(["survivedC", "embarkedC"]) \
| age | fare | sibsp | parch | ||
|---|---|---|---|---|---|
| survivedC | embarkedC | ||||
| Died | Cherbourg | 34.468750 | 40.255592 | 0.316667 | 0.225000 |
| Queenstown | 30.202703 | 11.615349 | 0.379747 | 0.177215 | |
| Southampton | 29.945385 | 21.546160 | 0.580328 | 0.368852 | |
| Survived | Cherbourg | 31.037248 | 80.000807 | 0.466667 | 0.486667 |
| Queenstown | 24.153846 | 13.833998 | 0.272727 | 0.000000 | |
| Southampton | 27.989881 | 39.183470 | 0.490132 | 0.542763 |
pd.crosstab() Examples
- Alternatively we can use the
pd.crosstab()function with anaggfuncto define our summarization to produce
Example: Get summary for numeric type variables for each survival status
- A bit awkward in this case as we don’t really have a ‘column’ variable
- Make a dummy variable for that
pd.crosstab(
sub_titanic_data.survivedC,
columns = ["mean" for _ in range(sub_titanic_data.shape[0])], #create variable with only the value 'mean'
values = sub_titanic_data.fare,
aggfunc = 'mean')| col_0 | mean |
|---|---|
| survivedC | |
| Died | 23.353831 |
| Survived | 49.361184 |
- Can return multiple summaries at once by passing them as a
list
pd.crosstab(
sub_titanic_data.survivedC,
columns = ["stat" for _ in range(sub_titanic_data.shape[0])],
values = sub_titanic_data.fare,
aggfunc = ['mean', 'median', 'std', 'count'])| mean | median | std | count | |
|---|---|---|---|---|
| col_0 | stat | stat | stat | stat |
| survivedC | ||||
| Died | 23.353831 | 10.5 | 34.145096 | 808 |
| Survived | 49.361184 | 26.0 | 68.648795 | 500 |
More natural with two grouping variables
- Example: Get summary for numeric type variables for each survival status and embarked port
pd.crosstab(
sub_titanic_data.embarkedC,
sub_titanic_data.survivedC,
values = sub_titanic_data.fare,
aggfunc = ['mean', 'count'])| mean | count | |||
|---|---|---|---|---|
| survivedC | Died | Survived | Died | Survived |
| embarkedC | ||||
| Cherbourg | 40.255592 | 80.000807 | 120 | 150 |
| Queenstown | 11.615349 | 13.833998 | 79 | 44 |
| Southampton | 21.546160 | 39.183470 | 609 | 304 |
Quick Video
This video shows an example of reading in some data and finding numeric summaries!
We’ll look at the .pivot_table() and .agg() methods.
Remember to pop the video out into the full player.
The notebook written in the video is available here.
from IPython.display import IFrame
IFrame(src="https://ncsu.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=197e5481-40be-4488-8271-b0ff000f5fd2&autoplay=false&offerviewer=true&showtitle=true&showbrand=true&captions=false&interactivity=all", height="405", width="720")Recap
EDA is often the first step to an analysis:
Must understand the type of data you have/missingness/data validation
Then describe the distributions of the variables
Numerical summaries
- Contingency Tables:
pd.crosstab()
- Mean/Median:
.mean(),.median()methods on a data frame - Standard Deviation/quantiles:
.std(),.quantile()methods
- Contingency Tables:
Across subgroups with
.groupby()method orpd.crosstab(value, aggfunc)You can fancy up output too!
That wraps up the material for week 3! Head to the Moodle site to work on your next homework assignment.
If you are on the course website, use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!
If you are on Google Colab, head back to our course website for our next lesson!