PandasDataFrames are super useful 2D data structures!
Each column is a Series object
Each column can be of differing types (just like most common data sets!)
Note: These types of webpages are built from Jupyter notebooks (.ipynb files). You can access your own versions of them by clicking here. It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you’d like!
Creating a DataFrame
Most of the time we’ll read data from a raw file directly into a DataFrame
However, you can create one with the pd.DataFrame() function
import pandas as pdimport numpy as np
Creating a Data Frame from Lists
zip() lists of the same length together
specify columns via columns = list of appropriate length
specify row names via index = list of appropriate length (if you want!)
#populate some lists, each of equal lengthname = ['Alice', 'Bob','Charlie','Dave','Eve','Francesca','Greg']age = [20, 21, 22, 23, 22, 21, 22]major = ['Statistics', 'History', 'Chemistry', 'English', 'Math', 'Civil Engineering','Statistics']#create the data frame using zip()my_df = pd.DataFrame(zip(name, age, major), columns = ["name", "age", "major"])my_df
name
age
major
0
Alice
20
Statistics
1
Bob
21
History
2
Charlie
22
Chemistry
3
Dave
23
English
4
Eve
22
Math
5
Francesca
21
Civil Engineering
6
Greg
22
Statistics
Creating a Data Frame from a Dictionary
The pd.DataFrame() function can create DataFrames from many objects
For a dictionary (dict object), the keys become the column names (values must be of the same length)
/usr/local/lib/python3.10/dist-packages/pandas/core/series.pyOne-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).
Operations between Series (+, -, /, \*, \*\*) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.
Parameters
----------
data : array-like, Iterable, dict, or scalar value
Contains data stored in Series. If data is a dict, argument order is
maintained.
index : array-like or Index (1d)
Values must be hashable and have the same length as `data`.
Non-unique index values are allowed. Will default to
RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like
and index is None, then the keys in the data are used as the index. If the
index is not None, the resulting Series is reindexed with the index values.
dtype : str, numpy.dtype, or ExtensionDtype, optional
Data type for the output Series. If not specified, this will be
inferred from `data`.
See the :ref:`user guide <basics.dtypes>` for more usages.
name : Hashable, default None
The name to give to the Series.
copy : bool, default False
Copy input data. Only affects Series or 1d ndarray input. See examples.
Notes
-----
Please reference the :ref:`User Guide <basics.series>` for more information.
Examples
--------
Constructing Series from a dictionary with an Index specified
>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['a', 'b', 'c'])
>>> ser
a 1
b 2
c 3
dtype: int64
The keys of the dictionary match with the Index values, hence the Index
values have no effect.
>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x NaN
y NaN
z NaN
dtype: float64
Note that the Index is first build with the keys from the dictionary.
After this the Series is reindexed with the given Index values, hence we
get all NaN as a result.
Constructing Series from a list with `copy=False`.
>>> r = [1, 2]
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
[1, 2]
>>> ser
0 999
1 2
dtype: int64
Due to input data type the Series has a `copy` of
the original data even though `copy=False`, so
the data is unchanged.
Constructing Series from a 1d ndarray with `copy=False`.
>>> r = np.array([1, 2])
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
array([999, 2])
>>> ser
0 999
1 2
dtype: int64
Due to input data type the Series has a `view` on
the original data, so
the data is changed as well.
We can also return a column using the attribute syntax with the column name (a period at the end of the object followed by the column name)
/usr/local/lib/python3.10/dist-packages/pandas/core/series.pyOne-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).
Operations between Series (+, -, /, \*, \*\*) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.
Parameters
----------
data : array-like, Iterable, dict, or scalar value
Contains data stored in Series. If data is a dict, argument order is
maintained.
index : array-like or Index (1d)
Values must be hashable and have the same length as `data`.
Non-unique index values are allowed. Will default to
RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like
and index is None, then the keys in the data are used as the index. If the
index is not None, the resulting Series is reindexed with the index values.
dtype : str, numpy.dtype, or ExtensionDtype, optional
Data type for the output Series. If not specified, this will be
inferred from `data`.
See the :ref:`user guide <basics.dtypes>` for more usages.
name : Hashable, default None
The name to give to the Series.
copy : bool, default False
Copy input data. Only affects Series or 1d ndarray input. See examples.
Notes
-----
Please reference the :ref:`User Guide <basics.series>` for more information.
Examples
--------
Constructing Series from a dictionary with an Index specified
>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['a', 'b', 'c'])
>>> ser
a 1
b 2
c 3
dtype: int64
The keys of the dictionary match with the Index values, hence the Index
values have no effect.
>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x NaN
y NaN
z NaN
dtype: float64
Note that the Index is first build with the keys from the dictionary.
After this the Series is reindexed with the given Index values, hence we
get all NaN as a result.
Constructing Series from a list with `copy=False`.
>>> r = [1, 2]
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
[1, 2]
>>> ser
0 999
1 2
dtype: int64
Due to input data type the Series has a `copy` of
the original data even though `copy=False`, so
the data is unchanged.
Constructing Series from a 1d ndarray with `copy=False`.
>>> r = np.array([1, 2])
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
array([999, 2])
>>> ser
0 999
1 2
dtype: int64
Due to input data type the Series has a `view` on
the original data, so
the data is changed as well.
Returning more than one column is easy
You can give a list of the column names you want to the selection brackets
my_df[['Name', 'Age']]
Name
Age
0
Alice
20
1
Bob
21
2
Charlie
22
3
Dave
23
4
Eve
22
5
Francesca
21
6
Greg
22
Note you can’t use slicing for columns using just [] (we’ll need to us .iloc[] or .loc[], which we cover in a moment)
If you try to index with slicing you get back appropriate rows (see below)
Indexing Rows by Slicing with []
Similarly, you can index the rows using [] if you use a slice or a boolean array of appropriate length
my_df
Name
Age
Major
0
Alice
20
Statistics
1
Bob
21
History
2
Charlie
22
Chemistry
3
Dave
23
English
4
Eve
22
Math
5
Francesca
21
Civil Engineering
6
Greg
22
Statistics
my_df[3:5] #get the 3rd and 4th rows
Name
Age
Major
3
Dave
23
English
4
Eve
22
Math
my_df2
1st
2nd
3rd
a
0.446015
0.157260
0.632567
b
0.748716
0.350061
0.585704
c
0.143467
0.227066
0.022533
d
0.579832
0.607433
0.882421
e
0.208771
0.751327
0.044475
my_df2[1:5] #get the 2nd through 5th rows (counting starts at 0!)
1st
2nd
3rd
b
0.748716
0.350061
0.585704
c
0.143467
0.227066
0.022533
d
0.579832
0.607433
0.882421
e
0.208771
0.751327
0.044475
Oddly, you can’t return a single row with just a number
You can return it using slicing (recall :usually doesn’t return the last value)
my_df2[1] #throws an error
---------------------------------------------------------------------------KeyError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key) 3804try:-> 3805return self._engine.get_loc(casted_key) 3806except KeyError as err:index.pyx in pandas._libs.index.IndexEngine.get_loc()index.pyx in pandas._libs.index.IndexEngine.get_loc()pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()KeyError: 1
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-23-90bcbc95bda4> in <cell line: 1>()----> 1my_df2[1]#throws an error/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in __getitem__(self, key) 4100if self.columns.nlevels >1: 4101return self._getitem_multilevel(key)-> 4102indexer = self.columns.get_loc(key) 4103if is_integer(indexer): 4104 indexer =[indexer]/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key) 3810 ):
3811raise InvalidIndexError(key)-> 3812raise KeyError(key)from err
3813except TypeError: 3814# If we have a listlike key, _check_indexing_error will raiseKeyError: 1
my_df2[1:2] #return just one row
1st
2nd
3rd
b
0.748716
0.350061
0.585704
Indexing Rows Using a Boolean Array with []
Often we use a Boolean object to subset the rows (rows with a True get returned, False do not)
This comes up when we use a condition found using a variable from our data frame to do the subsetting
my_df['Name'] =='Alice'#create a boolean array
Name
0
True
1
False
2
False
3
False
4
False
5
False
6
False
my_df[my_df['Name'] =='Alice'] #return just the True rows
Name
Age
Major
0
Alice
20
Statistics
my_df[my_df['Age'] >21] #return only rows that match
/usr/local/lib/python3.10/dist-packages/pandas/core/series.pyOne-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).
Operations between Series (+, -, /, \*, \*\*) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.
Parameters
----------
data : array-like, Iterable, dict, or scalar value
Contains data stored in Series. If data is a dict, argument order is
maintained.
index : array-like or Index (1d)
Values must be hashable and have the same length as `data`.
Non-unique index values are allowed. Will default to
RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like
and index is None, then the keys in the data are used as the index. If the
index is not None, the resulting Series is reindexed with the index values.
dtype : str, numpy.dtype, or ExtensionDtype, optional
Data type for the output Series. If not specified, this will be
inferred from `data`.
See the :ref:`user guide <basics.dtypes>` for more usages.
name : Hashable, default None
The name to give to the Series.
copy : bool, default False
Copy input data. Only affects Series or 1d ndarray input. See examples.
Notes
-----
Please reference the :ref:`User Guide <basics.series>` for more information.
Examples
--------
Constructing Series from a dictionary with an Index specified
>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['a', 'b', 'c'])
>>> ser
a 1
b 2
c 3
dtype: int64
The keys of the dictionary match with the Index values, hence the Index
values have no effect.
>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x NaN
y NaN
z NaN
dtype: float64
Note that the Index is first build with the keys from the dictionary.
After this the Series is reindexed with the given Index values, hence we
get all NaN as a result.
Constructing Series from a list with `copy=False`.
>>> r = [1, 2]
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
[1, 2]
>>> ser
0 999
1 2
dtype: int64
Due to input data type the Series has a `copy` of
the original data even though `copy=False`, so
the data is unchanged.
Constructing Series from a 1d ndarray with `copy=False`.
>>> r = np.array([1, 2])
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
array([999, 2])
>>> ser
0 999
1 2
dtype: int64
Due to input data type the Series has a `view` on
the original data, so
the data is changed as well.
The row is return as a Series with the data type being as broad as it needs to be. Here it is returned with a data type of object (used for storing mixed data types)
my_df.iloc[1]
1
Name
Bob
Age
21
Major
History
With our other data object, all elements in a row are floats so that is the data type of the series that is returned
my_df2.iloc[1].dtype
dtype('float64')
You can return more than one row by passing a list (or similar type object, such as a range() call) of the numeric indices you want
my_df.iloc[[0,1]]
Name
Age
Major
0
Alice
20
Statistics
1
Bob
21
History
my_df.iloc[2:5] #note this doesn't include the last value!
Name
Age
Major
2
Charlie
22
Chemistry
3
Dave
23
English
4
Eve
22
Math
my_df.iloc[range(0,3)] #range doesn't include the last value either!
Name
Age
Major
0
Alice
20
Statistics
1
Bob
21
History
2
Charlie
22
Chemistry
.iloc[] for Returning Rows and Columns
.iloc[] allows for subsetting of columns by location too!
Simply add a , to get the 2nd dimension (similar to subsetting a numpy array)
from IPython.display import IFrameIFrame(src="https://ncsu.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=2dcc67df-5465-4570-83b7-b0ff0008e9a7&autoplay=false&offerviewer=true&showtitle=true&showbrand=true&captions=false&interactivity=all", height="405", width="720")
Recap
Data Frames are great for storing a data set (2D)
Rows = observations, Columns = variables
Many ways to create them (from a dictionary, list, array, etc.)
Many ways to subset them!
.info(), .head() and other useful methods!
If you are on the course website, use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!
If you are on Google Colab, head back to our course website for our next lesson!