import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#readin data
= pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/titanic.csv")
titanic_data
#remove some columns and a bad row
= titanic_data.drop(columns = ["body", "cabin", "boat"], axis = 1) \
sub_titanic_data 0]-1)]
.iloc[:(titanic_data.shape[#create category versions of the variables (some code omitted)
"embarkedC"] = sub_titanic_data.embarked.astype("category")
sub_titanic_data[= sub_titanic_data.embarkedC.cat.rename_categories(
sub_titanic_data.embarkedC "Cherbourg", "Queenstown", "Southampton"])
["sexC"] = sub_titanic_data.sex.astype("category")
sub_titanic_data[= sub_titanic_data.sexC.cat.rename_categories(["Female", "Male"])
sub_titanic_data.sexC "survivedC"] = sub_titanic_data.survived.astype("category")
sub_titanic_data[= sub_titanic_data.survivedC.cat.rename_categories(["Died", "Survived"]) sub_titanic_data.survivedC
Plotting with pandas
Justin Post
Let’s see the basic functionality that pandas
provides for plotting Series
(columns essentially) and DataFrames
Looking through the help files is really useful. This link is for the .plot()
method but you can search for other methods pretty easily! The documentation will provide you all of the options you can pass. We’ll only discuss a few and the big picture ideas.
- Let’s process our titanic data from the previous notebook.
Note: These types of webpages are built from Jupyter notebooks (.ipynb
files). You can access your own versions of them by clicking here. It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you’d like!
Barplots with pandas
We saw the barplot for summarizing categorical data. We’ll cover two different methods to create bar plots in pandas:
.plot.bar()
method on aseries
ordataframe
.plot()
method withkind = 'bar'
specified
= sub_titanic_data.embarkedC.value_counts()
table print(type(table))
table
<class 'pandas.core.series.Series'>
count | |
---|---|
embarkedC | |
Southampton | 914 |
Cherbourg | 270 |
Queenstown | 123 |
Note that this is a pandas
series
. We can use the .plot.bar()
method on this series to get a bar plot.
table.plot.bar()
We can then apply the matplotlib
functionality to update/modify the plot (notice we already read in matplotlib.pyplot
as plt
.
table.plot.bar()= 0) plt.xticks(rotation
(array([0, 1, 2]),
[Text(0, 0, 'Southampton'),
Text(1, 0, 'Cherbourg'),
Text(2, 0, 'Queenstown')])
Alternatively, we can use the slightly more flexible .plot()
method on a series where we specify the kind=
of the plot to create.
= "bar", rot = 0) #can use additional arg rather than additional function call table.plot(kind
Where we really gain is when trying to bring in a multivariate relationship
- For instance, we can color the bars by another categorical variable in the
DataFrame
pretty easily!
First, create the contingency table for two variables (remember this returns a DataFrame
)
= pd.crosstab(sub_titanic_data["embarkedC"], sub_titanic_data["survivedC"])
table print(type(table))
table
<class 'pandas.core.frame.DataFrame'>
survivedC | Died | Survived |
---|---|---|
embarkedC | ||
Cherbourg | 120 | 150 |
Queenstown | 79 | 44 |
Southampton | 610 | 304 |
Now let’s use the .plot.bar()
method on this DataFrame
with the stacked = True
argument.
= True, rot = 0) table.plot.bar(stacked
We can do this with the .plot()
method as well.
= True, kind = "bar", rot = 0) table.plot(stacked
If we want side-by-side bar plots, we can just remove the stacked = True
argument.
= 0) table.plot.bar(rot
Plotting Numeric Variables
Recall: Numeric variable have entries that are a numerical value where math can be performed
Goal: describe the shape, center, and spread of the distribution
- Shape can be described well via a histogram or density plot
- Boxplots provide a good summary of the distribution as well
Histogram with pandas
Histogram - Bin data to show distribution of observations - Done via .plot.hist()
or .plot(kind = "hist")
method on a series or data frame - A .hist()
method also exists!
First, the .plot.hist()
method on a series (we’ll also fix it up a bit using matplotlib.pyplot
functionality.
"age"].plot.hist()
sub_titanic_data["Age")
plt.xlabel("Histogram of Age for Titanic Passengers") plt.title(
Text(0.5, 1.0, 'Histogram of Age for Titanic Passengers')
- Specify # of bins with
bins =
- Note we also return the series in a different way here (just to show you can use either)
#can add label/title here (xlabel doesn't seem to work as intended...)
#instead we'll use the .set() method on the histogram to set the xlabel
= 20, title = "Histogram of Age for Titanic Passengers") \
sub_titanic_data.age.plot.hist(bins set(xlabel = "Age") .
Overlaying Two Histograms
- To overlay two histograms on the same graph, create two histograms and use
alpha = 0-1 value
. This sets the transparency.- alpha = 1 is not transparent at all
- alpha = 0 is completely transparent
Let’s create histograms of age for those that Survived and those that Died. - We should also set up the bins manually so they are the same bin widths and locations (for better comparison) - bins can be specified via the bins =
argument
= 10
bin_ends = [i*max(sub_titanic_data.age)/bin_ends for i in range(0, bin_ends + 1)]
bins print(bins)
[0.0, 8.0, 16.0, 24.0, 32.0, 40.0, 48.0, 56.0, 64.0, 72.0, 80.0]
- Obtain subsets of data needed
= sub_titanic_data.loc[sub_titanic_data.survivedC == "Died", "age"] #series for died
age_died = sub_titanic_data.loc[sub_titanic_data.survivedC == "Survived", "age"] #series for survived age_survived
Create the plot using the .plot.hist()
method. By creating two plots in the same cell, they will be overlayed. - Notice the use of label()
to automatically create a legend (similar to what we did with matplotlib
= bins, alpha = 0.5, label = "Died",
age_died.plot.hist(bins = "Ages for those that survived vs those that died") \
title set(xlabel = "Age")
.= bins, alpha = 0.5, label = "Survived")
age_survived.plot.hist(bins plt.legend()
pandas
will automatically overlay data from different columns of the same data frame- That is, if we use the
plot.hist()
method on a data frame with two numeric variables, it will plot both of those on the same plot- To use that here we need to make that kind of data frame…
- Need two columns, one representing ages for those that survived and one for those that died
= sub_titanic_data.loc[sub_titanic_data.survivedC == "Died", "age"] #809 values
age_died = sub_titanic_data.loc[sub_titanic_data.survivedC == "Survived", "age"] #500 values age_survived
- Note the difference in the number of observations! This means that putting them together into a data frame isn’t super seamless.
= pd.DataFrame(zip(age_died, age_survived), columns = ["Died", "Survived"])
temp print(temp.shape)
#only has 500 rows instead of 809!
= 0.5) temp.plot.hist(alpha
(500, 2)
How do we fix that?
- We can fill in
NaN
values for the shorter series so they end up the same length.
- We can fill in
= pd.concat([age_survived, pd.Series([np.nan for _ in range(308)])])
age_survived age_survived
0 | |
---|---|
0 | 29.0000 |
1 | 0.9167 |
5 | 48.0000 |
6 | 63.0000 |
8 | 53.0000 |
... | ... |
303 | NaN |
304 | NaN |
305 | NaN |
306 | NaN |
307 | NaN |
808 rows × 1 columns
- Now we can zip these together into a data frame and plot as we’d like!
= pd.DataFrame(zip(age_died, age_survived),
plotting_df = ["Died", "Survived"])
columns print(plotting_df.shape)
= 0.5, title = "Ages for those that survived vs those that died") \
plotting_df.plot.hist(alpha set(xlabel = "Age") .
(808, 2)
Side-by-side Histograms
- Can place two graphs next to each other with
.hist()
method (notice this is a different method!)- Specify a
column
variable and aby
variable
- Specify a
- These don’t have the same bin widths
= "age", by = "survivedC") sub_titanic_data.hist(column
FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
sub_titanic_data.hist(column = "age", by = "survivedC")
array([<Axes: title={'center': 'Died'}>,
<Axes: title={'center': 'Survived'}>], dtype=object)
- We could also use the
.groupby()
functionality but the result is a bit subpar as it doesn’t label the graphs.
"age", "survivedC"]].groupby("survivedC").hist() sub_titanic_data[[
FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
sub_titanic_data[["age", "survivedC"]].groupby("survivedC").hist()
0 | |
---|---|
survivedC | |
Died | [[Axes(0.125,0.11;0.775x0.77)]] |
Survived | [[Axes(0.125,0.11;0.775x0.77)]] |
Kernel smoother with pandas
- Kernel Smoother - Smoothed version of a histogram
- ‘Kernel’ determines weight given to nearby points
- Use
.plot.density()
orplot(kind = "density")
method bw_method = #
specifies how ‘smooth’ you want the graph to be- smaller values imply using a smaller bandwidth (more variability)
- larger values imply using a larger bandwidth (more smooth)
- Use
= 0.1, label = "bw = 0.1",
sub_titanic_data.age.plot.density(bw_method = "Density Plots of Age for Titanic Passengers")
title = 0.25, label = "bw = 0.25")
sub_titanic_data.age.plot.density(bw_method = 0.5, label = "bw = 0.5")
sub_titanic_data.age.plot.density(bw_method plt.legend()
Boxplots with pandas
- Boxplot - Provides the five number summary in a graph
- Min, Q1, Median, Q3, Max
- Often show possible outliers as well
- Use
.plot.box()
orplot(kind = "box")
method - A
.boxplot()
method also exists!
- Min, Q1, Median, Q3, Max
First the .plot.box()
method on a series
sub_titanic_data.age.plot.box()
- Fine.. but usually we want to compare these boxplots across another variable. To do this the
.boxplot()
method on a data frame is very useful! - Similar to the
.hist()
method we specify acolumn
andby
variable
= ["age"], by = "survivedC") sub_titanic_data.boxplot(column
Scatter Plots with pandas
- Scatter Plot - graphs points corresponding to each observation
- Use
.plot.scatter()
orplot(kind = "scatter")
method on a data frame withx =
, andy =
- Use
= "age", y = "fare", title = "Scatter plots rule!") sub_titanic_data.plot.scatter(x
- Easy to modify! Check the help for arguments (specifically the keyword arguments that get passed to
.plot()
) but we can specify differentmarker
values, atitle
, and more!
#c = color, marker is a matplotlib option
= "age", y = "fare", c = "Red", marker = "v", title = "Oh, V's!") sub_titanic_data.plot.scatter(x
- We can easily modify aspects of the plot based on a variable as well!
- This is great as it allows us to bring a third varaible in
- Here we color by a category variable
#s for size (should be a numeric column), cmap can be used with c for specifying color scales
= "age", y = "fare", c = "survivedC", cmap = "viridis", s = 10) sub_titanic_data.plot.scatter(x
Matrix of Scatter Plots
.plotting.scatter_matrix()
function will produce basic graphs showing relationships!- Here we grab the numeric variables from the data frame
"age", "fare", "survived", "sibsp"]]) pd.plotting.scatter_matrix(sub_titanic_data[[
array([[<Axes: xlabel='age', ylabel='age'>,
<Axes: xlabel='fare', ylabel='age'>,
<Axes: xlabel='survived', ylabel='age'>,
<Axes: xlabel='sibsp', ylabel='age'>],
[<Axes: xlabel='age', ylabel='fare'>,
<Axes: xlabel='fare', ylabel='fare'>,
<Axes: xlabel='survived', ylabel='fare'>,
<Axes: xlabel='sibsp', ylabel='fare'>],
[<Axes: xlabel='age', ylabel='survived'>,
<Axes: xlabel='fare', ylabel='survived'>,
<Axes: xlabel='survived', ylabel='survived'>,
<Axes: xlabel='sibsp', ylabel='survived'>],
[<Axes: xlabel='age', ylabel='sibsp'>,
<Axes: xlabel='fare', ylabel='sibsp'>,
<Axes: xlabel='survived', ylabel='sibsp'>,
<Axes: xlabel='sibsp', ylabel='sibsp'>]], dtype=object)
Quick Video
This video shows an example of using pandas for plotting. Remember to pop the video out into the full player.
The notebook written in the video is available here.
from IPython.display import IFrame
="https://ncsu.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=f4e2bc16-1757-4f1e-8df9-b103016c97d2&autoplay=false&offerviewer=true&showtitle=true&showbrand=true&captions=false&interactivity=all", height="405", width="720") IFrame(src
Recap
Creating visualizations is an important part of an EDA
Goal: Describe the distribution
pandas
has nice functionality for creating common plots.plot()
method
We only covered a few plots but the concepts are similar for others
May want to check out seaborn for quick ways to do fancier plots!
If you are on the course website, use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!
If you are on Google Colab, head back to our course website for our next lesson!