import pandas as pd
Pandas for Reading Raw Data
Justin Post
pandas
library has functionality for reading delimited data, Excel data, SAS data, JSON data, and more!Remember we’ll need to import the
pandas
library in order to use it!
Note: These types of webpages are built from Jupyter notebooks (.ipynb
files). You can access your own versions of them by clicking here. It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you’d like!
Data Formats
Raw data comes in many different formats. Understanding the raw data format is essential for reading that data into python
. Some raw data types include:
‘Delimited’ data: Character (such as ‘,’ , ‘>’, or [’ ’]) separated data
Fixed field data
Excel data
From other statistical software, Ex: SPSS formatted data or SAS data sets
From an Application Programming Interface (API) (often returned as a
JSON
file - key/value pairs, similar to a dictionary)From a database
Delimited Data
Let’s start with delimited data.
One common format for raw data is delimited data
Data that has a character or characters that separates the data values
Character(s) is (are) called delimiter(s)
Using
pandas
theread_csv()
function can read in this kind of data (althoughcsv
stands for ‘comma separated value’, this function is used for reading most delimited data viapandas
)- If the raw data is well-formatted, we just need to tell
python
where to find it!
- If the raw data is well-formatted, we just need to tell
Locating a File
- How does python locate the file?
- Not in colab
- You can give file full path name
- ex: ‘S:/Documents/repos/ST-554/datasets/data.csv’
- ex: ‘S:\Documents\repos\ST-554\datasets\data.csv’
- ex: ‘S:/Documents/repos/ST-554/datasets/data.csv’
- You can give file full path name
- Or use local paths!
- Determine your working directory
- Use a path relative to that
- If your working directory is ‘S:/Documents/repos/ST-554’ you can get to ‘data.csv’ via ‘datasets/data.csv’
- The
os
module gives you access to function for finding and setting your working directory
Using a cloud-based platform complicates things a bit - In colab you can + Mount your google drive + Read files from URLs + Upload files via the menu on the left (folder icon, then upload a file via the icons there)
import os
#getcwd() stands for get current working directory
#shows the directory you can get to via the folder icon on the left
os.getcwd() #chdir() stands for change current directory
#os.chdir("S:/Documents/repos/ST-554") #won't work in colab but would work on a local python session
'/content'
This /content
refers to the main folder on the left hand side of Colab
!
Reading Files ‘Locally’ in Colab
Nicely formatted
.csv
files can be read in with theread_csv()
function frompandas
neuralgia.csv
has been loaded into the folder on colab in my session. Therefore, it exists in my working directory. This won’t be the case for you unless you upload the data during your session! You can click on the folder icon on the left, then click the upload button to upload this data set.
= pd.read_csv("neuralgia.csv") #neuralgia.csv file was uploaded to colab for my session
neuralgia_data #this code block won't work unless you upload the data in your session neuralgia_data.head()
Treatment | Sex | Age | Duration | Pain | |
---|---|---|---|---|---|
0 | P | F | 68 | 1 | No |
1 | B | M | 74 | 16 | No |
2 | P | F | 67 | 30 | No |
3 | P | M | 66 | 26 | Yes |
4 | B | F | 67 | 28 | No |
neuralgia_data.shape
(60, 5)
Reading From a URL
Nicely formatted
.csv
files can be read in with theread_csv()
function frompandas
scoresFull.csv
file at a URL given by ‘https://www4.stat.ncsu.edu/~online/datasets/scoresFull.csv’
= pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/scoresFull.csv")
scores_data scores_data.head()
week | date | day | season | awayTeam | AQ1 | AQ2 | AQ3 | AQ4 | AOT | ... | homeFumLost | homeNumPen | homePenYds | home3rdConv | home3rdAtt | home4thConv | home4thAtt | homeTOP | HminusAScore | homeSpread | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 5-Sep | Thu | 2002 | San Francisco 49ers | 3 | 0 | 7 | 6 | -1 | ... | 0 | 10 | 80 | 4 | 8 | 0 | 1 | 32.47 | -3 | -4.0 |
1 | 1 | 8-Sep | Sun | 2002 | Minnesota Vikings | 3 | 17 | 0 | 3 | -1 | ... | 1 | 4 | 33 | 2 | 6 | 0 | 0 | 28.48 | 4 | 4.5 |
2 | 1 | 8-Sep | Sun | 2002 | New Orleans Saints | 6 | 7 | 7 | 0 | 6 | ... | 0 | 8 | 85 | 1 | 6 | 0 | 1 | 31.48 | -6 | 6.0 |
3 | 1 | 8-Sep | Sun | 2002 | New York Jets | 0 | 17 | 3 | 11 | 6 | ... | 1 | 10 | 82 | 4 | 8 | 2 | 2 | 39.13 | -6 | -3.0 |
4 | 1 | 8-Sep | Sun | 2002 | Arizona Cardinals | 10 | 3 | 3 | 7 | -1 | ... | 0 | 7 | 56 | 6 | 10 | 1 | 2 | 34.40 | 8 | 6.0 |
5 rows × 82 columns
scores_data.shape
(3471, 82)
- Oddly, to read other types of delimited data, we also use
read_csv()
!- Specify the
sep =
argument
- Specify the
chemical.txt
file (space delimiter) stored at “https://www4.stat.ncsu.edu/~online/datasets/chemical.txt”
= pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/chemical.txt", sep=" ")
chem_data chem_data.head()
temp | conc | time | percent | |
---|---|---|---|---|
0 | -1.0 | -1.0 | -1.0 | 45.9 |
1 | 1.0 | -1.0 | -1.0 | 60.6 |
2 | -1.0 | 1.0 | -1.0 | 57.5 |
3 | 1.0 | 1.0 | -1.0 | 58.6 |
4 | -1.0 | -1.0 | 1.0 | 53.3 |
crabs.txt
file (tab delimiter) stored at “https://www4.stat.ncsu.edu/~online/datasets/crabs.txt”- Tab is
\t
- Tab is
= pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/crabs.txt", sep="\t")
crabs_data crabs_data.head()
color | spine | width | satell | weight | y | |
---|---|---|---|---|---|---|
0 | 3 | 3 | 28.3 | 8 | 3050 | 1 |
1 | 4 | 3 | 22.5 | 0 | 1550 | 0 |
2 | 2 | 1 | 26.0 | 9 | 2300 | 1 |
3 | 4 | 3 | 24.8 | 0 | 2100 | 0 |
4 | 4 | 3 | 26.0 | 4 | 2600 | 1 |
umps2012.txt
file (>
delimiter) stored at “https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt”- No column names in raw file
- Can specify
header = None
and give column names when reading (vianames = [list of names]
)
= pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt",
ump_data =">",
sep=None,
header=["Year", "Month", "Day", "Home", "Away", "HPUmpire"])
names ump_data.head()
Year | Month | Day | Home | Away | HPUmpire | |
---|---|---|---|---|---|---|
0 | 2012 | 4 | 12 | MIN | LAA | D.J. Reyburn |
1 | 2012 | 4 | 12 | SD | ARI | Marty Foster |
2 | 2012 | 4 | 12 | WSH | CIN | Mike Everitt |
3 | 2012 | 4 | 12 | PHI | MIA | Jeff Nelson |
4 | 2012 | 4 | 12 | CHC | MIL | Fieldin Culbreth |
Reading Excel Data
Use the
ExcelFile()
function frompandas
censusEd.xlsx
file located at “https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx”
= pd.ExcelFile("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx")
ed_data ed_data
<pandas.io.excel._base.ExcelFile at 0x7d6ef1627880>
- Unfortunately, there are different attributes associated with this data object!
#ed_data.head(), ed_data.info() won't work!
type(ed_data)
pandas.io.excel._base.ExcelFile
def __init__(path_or_buffer, engine: str | None=None, storage_options: StorageOptions | None=None, engine_kwargs: dict | None=None) -> None
Class for parsing tabular Excel sheets into DataFrame objects. See read_excel for more documentation. Parameters ---------- path_or_buffer : str, bytes, path object (pathlib.Path or py._path.local.LocalPath), A file-like object, xlrd workbook or openpyxl workbook. If a string or path object, expected to be a path to a .xls, .xlsx, .xlsb, .xlsm, .odf, .ods, or .odt file. engine : str, default None If io is not a buffer or path, this must be set to identify io. Supported engines: ``xlrd``, ``openpyxl``, ``odf``, ``pyxlsb``, ``calamine`` Engine compatibility : - ``xlrd`` supports old-style Excel files (.xls). - ``openpyxl`` supports newer Excel file formats. - ``odf`` supports OpenDocument file formats (.odf, .ods, .odt). - ``pyxlsb`` supports Binary Excel files. - ``calamine`` supports Excel (.xls, .xlsx, .xlsm, .xlsb) and OpenDocument (.ods) file formats. .. versionchanged:: 1.2.0 The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_ now only supports old-style ``.xls`` files. When ``engine=None``, the following logic will be used to determine the engine: - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), then `odf <https://pypi.org/project/odfpy/>`_ will be used. - Otherwise if ``path_or_buffer`` is an xls format, ``xlrd`` will be used. - Otherwise if ``path_or_buffer`` is in xlsb format, `pyxlsb <https://pypi.org/project/pyxlsb/>`_ will be used. .. versionadded:: 1.3.0 - Otherwise if `openpyxl <https://pypi.org/project/openpyxl/>`_ is installed, then ``openpyxl`` will be used. - Otherwise if ``xlrd >= 2.0`` is installed, a ``ValueError`` will be raised. .. warning:: Please do not report issues when using ``xlrd`` to read ``.xlsx`` files. This is not supported, switch to using ``openpyxl`` instead. engine_kwargs : dict, optional Arbitrary keyword arguments passed to excel engine. Examples -------- >>> file = pd.ExcelFile('myfile.xlsx') # doctest: +SKIP >>> with pd.ExcelFile("myfile.xls") as xls: # doctest: +SKIP ... df1 = pd.read_excel(xls, "Sheet1") # doctest: +SKIP
ed_data.sheet_names
['EDU01A',
'EDU01B',
'EDU01C',
'EDU01D',
'EDU01E',
'EDU01F',
'EDU01G',
'EDU01H',
'EDU01I',
'EDU01J']
- Use
.parse()
method with sheet to obtain a usualDataFrame
'EDU01A').head() ed_data.parse(
/usr/local/lib/python3.10/dist-packages/openpyxl/worksheet/header_footer.py:48: UserWarning: Cannot parse header or footer so it will be ignored
warn("""Cannot parse header or footer so it will be ignored""")
Area_name | STCOU | EDU010187F | EDU010187D | EDU010187N1 | EDU010187N2 | EDU010188F | EDU010188D | EDU010188N1 | EDU010188N2 | ... | EDU010194N1 | EDU010194N2 | EDU010195F | EDU010195D | EDU010195N1 | EDU010195N2 | EDU010196F | EDU010196D | EDU010196N1 | EDU010196N2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | UNITED STATES | 0 | 0 | 40024299 | 0 | 0 | 0 | 39967624 | 0 | 0 | ... | 0 | 0 | 0 | 43993459 | 0 | 0 | 0 | 44715737 | 0 | 0 |
1 | ALABAMA | 1000 | 0 | 733735 | 0 | 0 | 0 | 728234 | 0 | 0 | ... | 0 | 0 | 0 | 727989 | 0 | 0 | 0 | 736825 | 0 | 0 |
2 | Autauga, AL | 1001 | 0 | 6829 | 0 | 0 | 0 | 6900 | 0 | 0 | ... | 0 | 0 | 0 | 7568 | 0 | 0 | 0 | 7834 | 0 | 0 |
3 | Baldwin, AL | 1003 | 0 | 16417 | 0 | 0 | 0 | 16465 | 0 | 0 | ... | 0 | 0 | 0 | 19961 | 0 | 0 | 0 | 20699 | 0 | 0 |
4 | Barbour, AL | 1005 | 0 | 5071 | 0 | 0 | 0 | 5098 | 0 | 0 | ... | 0 | 0 | 0 | 5017 | 0 | 0 | 0 | 5053 | 0 | 0 |
5 rows × 42 columns
Alternatively, use the
read_excel()
function frompandas
This reads things in to a standard
DataFrame
but you have to specify a sheet to read in (or it defaults to the 1st)censusEd.xlsx
file located at “https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx”
= pd.read_excel("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx",
ed_data = 0) #or "EDU01A"
sheet_name ed_data.head()
/usr/local/lib/python3.10/dist-packages/openpyxl/worksheet/header_footer.py:48: UserWarning: Cannot parse header or footer so it will be ignored
warn("""Cannot parse header or footer so it will be ignored""")
Area_name | STCOU | EDU010187F | EDU010187D | EDU010187N1 | EDU010187N2 | EDU010188F | EDU010188D | EDU010188N1 | EDU010188N2 | ... | EDU010194N1 | EDU010194N2 | EDU010195F | EDU010195D | EDU010195N1 | EDU010195N2 | EDU010196F | EDU010196D | EDU010196N1 | EDU010196N2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | UNITED STATES | 0 | 0 | 40024299 | 0 | 0 | 0 | 39967624 | 0 | 0 | ... | 0 | 0 | 0 | 43993459 | 0 | 0 | 0 | 44715737 | 0 | 0 |
1 | ALABAMA | 1000 | 0 | 733735 | 0 | 0 | 0 | 728234 | 0 | 0 | ... | 0 | 0 | 0 | 727989 | 0 | 0 | 0 | 736825 | 0 | 0 |
2 | Autauga, AL | 1001 | 0 | 6829 | 0 | 0 | 0 | 6900 | 0 | 0 | ... | 0 | 0 | 0 | 7568 | 0 | 0 | 0 | 7834 | 0 | 0 |
3 | Baldwin, AL | 1003 | 0 | 16417 | 0 | 0 | 0 | 16465 | 0 | 0 | ... | 0 | 0 | 0 | 19961 | 0 | 0 | 0 | 20699 | 0 | 0 |
4 | Barbour, AL | 1005 | 0 | 5071 | 0 | 0 | 0 | 5098 | 0 | 0 | ... | 0 | 0 | 0 | 5017 | 0 | 0 | 0 | 5053 | 0 | 0 |
5 rows × 42 columns
- You can read all sheets with
sheet_name = None
- This gets read into a dictionary!
- Keys are the sheet name
- Values are the
DataFrame
s from each sheet
= pd.read_excel("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx",
ed_data = None)
sheet_name type(ed_data)
/usr/local/lib/python3.10/dist-packages/openpyxl/worksheet/header_footer.py:48: UserWarning: Cannot parse header or footer so it will be ignored
warn("""Cannot parse header or footer so it will be ignored""")
dict
ed_data.keys()
dict_keys(['EDU01A', 'EDU01B', 'EDU01C', 'EDU01D', 'EDU01E', 'EDU01F', 'EDU01G', 'EDU01H', 'EDU01I', 'EDU01J'])
"EDU01A").head() #get one DataFrame using its key! ed_data.get(
Area_name | STCOU | EDU010187F | EDU010187D | EDU010187N1 | EDU010187N2 | EDU010188F | EDU010188D | EDU010188N1 | EDU010188N2 | ... | EDU010194N1 | EDU010194N2 | EDU010195F | EDU010195D | EDU010195N1 | EDU010195N2 | EDU010196F | EDU010196D | EDU010196N1 | EDU010196N2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | UNITED STATES | 0 | 0 | 40024299 | 0 | 0 | 0 | 39967624 | 0 | 0 | ... | 0 | 0 | 0 | 43993459 | 0 | 0 | 0 | 44715737 | 0 | 0 |
1 | ALABAMA | 1000 | 0 | 733735 | 0 | 0 | 0 | 728234 | 0 | 0 | ... | 0 | 0 | 0 | 727989 | 0 | 0 | 0 | 736825 | 0 | 0 |
2 | Autauga, AL | 1001 | 0 | 6829 | 0 | 0 | 0 | 6900 | 0 | 0 | ... | 0 | 0 | 0 | 7568 | 0 | 0 | 0 | 7834 | 0 | 0 |
3 | Baldwin, AL | 1003 | 0 | 16417 | 0 | 0 | 0 | 16465 | 0 | 0 | ... | 0 | 0 | 0 | 19961 | 0 | 0 | 0 | 20699 | 0 | 0 |
4 | Barbour, AL | 1005 | 0 | 5071 | 0 | 0 | 0 | 5098 | 0 | 0 | ... | 0 | 0 | 0 | 5017 | 0 | 0 | 0 | 5053 | 0 | 0 |
5 rows × 42 columns
Reading JSON Data
JSON data has a structure similar to a dictionary
- Key-value pairs
{
{
"name": "Barry Sanders"
"games" : 153
"position": "RB"
},
{
"name": "Joe Montana"
"games": 192
"position": "QB"
}
}
read_json()
function frompandas
will work!- Read in data from URL: “https://api.exchangerate-api.com/v4/latest/USD”
= "https://api.exchangerate-api.com/v4/latest/USD"
url = pd.read_json(url)
usd_data usd_data.head()
provider | WARNING_UPGRADE_TO_V6 | terms | base | date | time_last_updated | rates | |
---|---|---|---|---|---|---|---|
USD | https://www.exchangerate-api.com | https://www.exchangerate-api.com/docs/free | https://www.exchangerate-api.com/terms | USD | 2025-01-03 | 1735862402 | 1.00 |
AED | https://www.exchangerate-api.com | https://www.exchangerate-api.com/docs/free | https://www.exchangerate-api.com/terms | USD | 2025-01-03 | 1735862402 | 3.67 |
AFN | https://www.exchangerate-api.com | https://www.exchangerate-api.com/docs/free | https://www.exchangerate-api.com/terms | USD | 2025-01-03 | 1735862402 | 70.58 |
ALL | https://www.exchangerate-api.com | https://www.exchangerate-api.com/docs/free | https://www.exchangerate-api.com/terms | USD | 2025-01-03 | 1735862402 | 94.12 |
AMD | https://www.exchangerate-api.com | https://www.exchangerate-api.com/docs/free | https://www.exchangerate-api.com/terms | USD | 2025-01-03 | 1735862402 | 396.72 |
Every API is different so contacting them and returning data differs depending on the website you use. This is an important way we obtain data these days!
Writing Out a DataFrame
- We can write out data using the
.to_csv()
method
'usd_data.csv', index = False) #this goes to the folder on the left in colab! usd_data.to_csv(
Quick Video
This video shows some examples of reading in some SAS data and data from an API.
Remember to pop the video out into the full player.
The notebook written in the video is available here.
from IPython.display import IFrame
="https://ncsu.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=30db871f-1750-4e5d-9de8-b0ff000c225c&autoplay=false&offerviewer=true&showtitle=true&showbrand=true&captions=false&interactivity=all", height="405", width="720") IFrame(src
Recap
Read data in with
pandas
read_csv()
for delimited dataExcelFile()
orread_excel()
for excel dataread_json()
for JSON
Write to a
.csv
with.to_csv()
If you are on the course website, use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!
If you are on Google Colab, head back to our course website for our next lesson!