import pandas as pdPandas for Reading Raw Data
Justin Post
pandaslibrary has functionality for reading delimited data, Excel data, SAS data, JSON data, and more!Remember we’ll need to import the
pandaslibrary in order to use it!
Note: These types of webpages are built from Jupyter notebooks (.ipynb files). You can access your own versions of them by clicking here. It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you’d like!
Data Formats
Raw data comes in many different formats. Understanding the raw data format is essential for reading that data into python. Some raw data types include:
‘Delimited’ data: Character (such as ‘,’ , ‘>’, or [’ ’]) separated data
Fixed field data
Excel data
From other statistical software, Ex: SPSS formatted data or SAS data sets
From an Application Programming Interface (API) (often returned as a
JSONfile - key/value pairs, similar to a dictionary)From a database
Delimited Data
Let’s start with delimited data.
One common format for raw data is delimited data
Data that has a character or characters that separates the data values
Character(s) is (are) called delimiter(s)
Using
pandastheread_csv()function can read in this kind of data (althoughcsvstands for ‘comma separated value’, this function is used for reading most delimited data viapandas)- If the raw data is well-formatted, we just need to tell
pythonwhere to find it!
- If the raw data is well-formatted, we just need to tell
Locating a File
- How does python locate the file?
- Not in colab
- You can give file full path name
- ex: ‘S:/Documents/repos/ST-554/datasets/data.csv’
- ex: ‘S:\Documents\repos\ST-554\datasets\data.csv’
- ex: ‘S:/Documents/repos/ST-554/datasets/data.csv’
- You can give file full path name
- Or use local paths!
- Determine your working directory
- Use a path relative to that
- If your working directory is ‘S:/Documents/repos/ST-554’ you can get to ‘data.csv’ via ‘datasets/data.csv’
- The
osmodule gives you access to function for finding and setting your working directory
Using a cloud-based platform complicates things a bit - In colab you can + Mount your google drive + Read files from URLs + Upload files via the menu on the left (folder icon, then upload a file via the icons there)
import os
#getcwd() stands for get current working directory
os.getcwd() #shows the directory you can get to via the folder icon on the left
#chdir() stands for change current directory
#os.chdir("S:/Documents/repos/ST-554") #won't work in colab but would work on a local python session'/content'
This /content refers to the main folder on the left hand side of Colab!
Reading Files ‘Locally’ in Colab
Nicely formatted
.csvfiles can be read in with theread_csv()function frompandasneuralgia.csvhas been loaded into the folder on colab in my session. Therefore, it exists in my working directory. This won’t be the case for you unless you upload the data during your session! You can click on the folder icon on the left, then click the upload button to upload this data set.
neuralgia_data = pd.read_csv("neuralgia.csv") #neuralgia.csv file was uploaded to colab for my session
neuralgia_data.head() #this code block won't work unless you upload the data in your session| Treatment | Sex | Age | Duration | Pain | |
|---|---|---|---|---|---|
| 0 | P | F | 68 | 1 | No |
| 1 | B | M | 74 | 16 | No |
| 2 | P | F | 67 | 30 | No |
| 3 | P | M | 66 | 26 | Yes |
| 4 | B | F | 67 | 28 | No |
neuralgia_data.shape(60, 5)
Reading From a URL
Nicely formatted
.csvfiles can be read in with theread_csv()function frompandasscoresFull.csvfile at a URL given by ‘https://www4.stat.ncsu.edu/~online/datasets/scoresFull.csv’
scores_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/scoresFull.csv")
scores_data.head()| week | date | day | season | awayTeam | AQ1 | AQ2 | AQ3 | AQ4 | AOT | ... | homeFumLost | homeNumPen | homePenYds | home3rdConv | home3rdAtt | home4thConv | home4thAtt | homeTOP | HminusAScore | homeSpread | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 5-Sep | Thu | 2002 | San Francisco 49ers | 3 | 0 | 7 | 6 | -1 | ... | 0 | 10 | 80 | 4 | 8 | 0 | 1 | 32.47 | -3 | -4.0 |
| 1 | 1 | 8-Sep | Sun | 2002 | Minnesota Vikings | 3 | 17 | 0 | 3 | -1 | ... | 1 | 4 | 33 | 2 | 6 | 0 | 0 | 28.48 | 4 | 4.5 |
| 2 | 1 | 8-Sep | Sun | 2002 | New Orleans Saints | 6 | 7 | 7 | 0 | 6 | ... | 0 | 8 | 85 | 1 | 6 | 0 | 1 | 31.48 | -6 | 6.0 |
| 3 | 1 | 8-Sep | Sun | 2002 | New York Jets | 0 | 17 | 3 | 11 | 6 | ... | 1 | 10 | 82 | 4 | 8 | 2 | 2 | 39.13 | -6 | -3.0 |
| 4 | 1 | 8-Sep | Sun | 2002 | Arizona Cardinals | 10 | 3 | 3 | 7 | -1 | ... | 0 | 7 | 56 | 6 | 10 | 1 | 2 | 34.40 | 8 | 6.0 |
5 rows × 82 columns
scores_data.shape(3471, 82)
- Oddly, to read other types of delimited data, we also use
read_csv()!- Specify the
sep =argument
- Specify the
chemical.txtfile (space delimiter) stored at “https://www4.stat.ncsu.edu/~online/datasets/chemical.txt”
chem_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/chemical.txt", sep=" ")
chem_data.head()| temp | conc | time | percent | |
|---|---|---|---|---|
| 0 | -1.0 | -1.0 | -1.0 | 45.9 |
| 1 | 1.0 | -1.0 | -1.0 | 60.6 |
| 2 | -1.0 | 1.0 | -1.0 | 57.5 |
| 3 | 1.0 | 1.0 | -1.0 | 58.6 |
| 4 | -1.0 | -1.0 | 1.0 | 53.3 |
crabs.txtfile (tab delimiter) stored at “https://www4.stat.ncsu.edu/~online/datasets/crabs.txt”- Tab is
\t
- Tab is
crabs_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/crabs.txt", sep="\t")
crabs_data.head()| color | spine | width | satell | weight | y | |
|---|---|---|---|---|---|---|
| 0 | 3 | 3 | 28.3 | 8 | 3050 | 1 |
| 1 | 4 | 3 | 22.5 | 0 | 1550 | 0 |
| 2 | 2 | 1 | 26.0 | 9 | 2300 | 1 |
| 3 | 4 | 3 | 24.8 | 0 | 2100 | 0 |
| 4 | 4 | 3 | 26.0 | 4 | 2600 | 1 |
umps2012.txtfile (>delimiter) stored at “https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt”- No column names in raw file
- Can specify
header = Noneand give column names when reading (vianames = [list of names])
ump_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt",
sep=">",
header=None,
names=["Year", "Month", "Day", "Home", "Away", "HPUmpire"])
ump_data.head()| Year | Month | Day | Home | Away | HPUmpire | |
|---|---|---|---|---|---|---|
| 0 | 2012 | 4 | 12 | MIN | LAA | D.J. Reyburn |
| 1 | 2012 | 4 | 12 | SD | ARI | Marty Foster |
| 2 | 2012 | 4 | 12 | WSH | CIN | Mike Everitt |
| 3 | 2012 | 4 | 12 | PHI | MIA | Jeff Nelson |
| 4 | 2012 | 4 | 12 | CHC | MIL | Fieldin Culbreth |
Reading Excel Data
Use the
ExcelFile()function frompandascensusEd.xlsxfile located at “https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx”
ed_data = pd.ExcelFile("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx")
ed_data<pandas.io.excel._base.ExcelFile at 0x7d6ef1627880>
- Unfortunately, there are different attributes associated with this data object!
#ed_data.head(), ed_data.info() won't work!
type(ed_data)pandas.io.excel._base.ExcelFile
def __init__(path_or_buffer, engine: str | None=None, storage_options: StorageOptions | None=None, engine_kwargs: dict | None=None) -> None
Class for parsing tabular Excel sheets into DataFrame objects. See read_excel for more documentation. Parameters ---------- path_or_buffer : str, bytes, path object (pathlib.Path or py._path.local.LocalPath), A file-like object, xlrd workbook or openpyxl workbook. If a string or path object, expected to be a path to a .xls, .xlsx, .xlsb, .xlsm, .odf, .ods, or .odt file. engine : str, default None If io is not a buffer or path, this must be set to identify io. Supported engines: ``xlrd``, ``openpyxl``, ``odf``, ``pyxlsb``, ``calamine`` Engine compatibility : - ``xlrd`` supports old-style Excel files (.xls). - ``openpyxl`` supports newer Excel file formats. - ``odf`` supports OpenDocument file formats (.odf, .ods, .odt). - ``pyxlsb`` supports Binary Excel files. - ``calamine`` supports Excel (.xls, .xlsx, .xlsm, .xlsb) and OpenDocument (.ods) file formats. .. versionchanged:: 1.2.0 The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_ now only supports old-style ``.xls`` files. When ``engine=None``, the following logic will be used to determine the engine: - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), then `odf <https://pypi.org/project/odfpy/>`_ will be used. - Otherwise if ``path_or_buffer`` is an xls format, ``xlrd`` will be used. - Otherwise if ``path_or_buffer`` is in xlsb format, `pyxlsb <https://pypi.org/project/pyxlsb/>`_ will be used. .. versionadded:: 1.3.0 - Otherwise if `openpyxl <https://pypi.org/project/openpyxl/>`_ is installed, then ``openpyxl`` will be used. - Otherwise if ``xlrd >= 2.0`` is installed, a ``ValueError`` will be raised. .. warning:: Please do not report issues when using ``xlrd`` to read ``.xlsx`` files. This is not supported, switch to using ``openpyxl`` instead. engine_kwargs : dict, optional Arbitrary keyword arguments passed to excel engine. Examples -------- >>> file = pd.ExcelFile('myfile.xlsx') # doctest: +SKIP >>> with pd.ExcelFile("myfile.xls") as xls: # doctest: +SKIP ... df1 = pd.read_excel(xls, "Sheet1") # doctest: +SKIP
ed_data.sheet_names['EDU01A',
'EDU01B',
'EDU01C',
'EDU01D',
'EDU01E',
'EDU01F',
'EDU01G',
'EDU01H',
'EDU01I',
'EDU01J']
- Use
.parse()method with sheet to obtain a usualDataFrame
ed_data.parse('EDU01A').head()/usr/local/lib/python3.10/dist-packages/openpyxl/worksheet/header_footer.py:48: UserWarning: Cannot parse header or footer so it will be ignored
warn("""Cannot parse header or footer so it will be ignored""")
| Area_name | STCOU | EDU010187F | EDU010187D | EDU010187N1 | EDU010187N2 | EDU010188F | EDU010188D | EDU010188N1 | EDU010188N2 | ... | EDU010194N1 | EDU010194N2 | EDU010195F | EDU010195D | EDU010195N1 | EDU010195N2 | EDU010196F | EDU010196D | EDU010196N1 | EDU010196N2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | UNITED STATES | 0 | 0 | 40024299 | 0 | 0 | 0 | 39967624 | 0 | 0 | ... | 0 | 0 | 0 | 43993459 | 0 | 0 | 0 | 44715737 | 0 | 0 |
| 1 | ALABAMA | 1000 | 0 | 733735 | 0 | 0 | 0 | 728234 | 0 | 0 | ... | 0 | 0 | 0 | 727989 | 0 | 0 | 0 | 736825 | 0 | 0 |
| 2 | Autauga, AL | 1001 | 0 | 6829 | 0 | 0 | 0 | 6900 | 0 | 0 | ... | 0 | 0 | 0 | 7568 | 0 | 0 | 0 | 7834 | 0 | 0 |
| 3 | Baldwin, AL | 1003 | 0 | 16417 | 0 | 0 | 0 | 16465 | 0 | 0 | ... | 0 | 0 | 0 | 19961 | 0 | 0 | 0 | 20699 | 0 | 0 |
| 4 | Barbour, AL | 1005 | 0 | 5071 | 0 | 0 | 0 | 5098 | 0 | 0 | ... | 0 | 0 | 0 | 5017 | 0 | 0 | 0 | 5053 | 0 | 0 |
5 rows × 42 columns
Alternatively, use the
read_excel()function frompandasThis reads things in to a standard
DataFramebut you have to specify a sheet to read in (or it defaults to the 1st)censusEd.xlsxfile located at “https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx”
ed_data = pd.read_excel("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx",
sheet_name = 0) #or "EDU01A"
ed_data.head()/usr/local/lib/python3.10/dist-packages/openpyxl/worksheet/header_footer.py:48: UserWarning: Cannot parse header or footer so it will be ignored
warn("""Cannot parse header or footer so it will be ignored""")
| Area_name | STCOU | EDU010187F | EDU010187D | EDU010187N1 | EDU010187N2 | EDU010188F | EDU010188D | EDU010188N1 | EDU010188N2 | ... | EDU010194N1 | EDU010194N2 | EDU010195F | EDU010195D | EDU010195N1 | EDU010195N2 | EDU010196F | EDU010196D | EDU010196N1 | EDU010196N2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | UNITED STATES | 0 | 0 | 40024299 | 0 | 0 | 0 | 39967624 | 0 | 0 | ... | 0 | 0 | 0 | 43993459 | 0 | 0 | 0 | 44715737 | 0 | 0 |
| 1 | ALABAMA | 1000 | 0 | 733735 | 0 | 0 | 0 | 728234 | 0 | 0 | ... | 0 | 0 | 0 | 727989 | 0 | 0 | 0 | 736825 | 0 | 0 |
| 2 | Autauga, AL | 1001 | 0 | 6829 | 0 | 0 | 0 | 6900 | 0 | 0 | ... | 0 | 0 | 0 | 7568 | 0 | 0 | 0 | 7834 | 0 | 0 |
| 3 | Baldwin, AL | 1003 | 0 | 16417 | 0 | 0 | 0 | 16465 | 0 | 0 | ... | 0 | 0 | 0 | 19961 | 0 | 0 | 0 | 20699 | 0 | 0 |
| 4 | Barbour, AL | 1005 | 0 | 5071 | 0 | 0 | 0 | 5098 | 0 | 0 | ... | 0 | 0 | 0 | 5017 | 0 | 0 | 0 | 5053 | 0 | 0 |
5 rows × 42 columns
- You can read all sheets with
sheet_name = None - This gets read into a dictionary!
- Keys are the sheet name
- Values are the
DataFrames from each sheet
ed_data = pd.read_excel("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx",
sheet_name = None)
type(ed_data)/usr/local/lib/python3.10/dist-packages/openpyxl/worksheet/header_footer.py:48: UserWarning: Cannot parse header or footer so it will be ignored
warn("""Cannot parse header or footer so it will be ignored""")
dict
ed_data.keys()dict_keys(['EDU01A', 'EDU01B', 'EDU01C', 'EDU01D', 'EDU01E', 'EDU01F', 'EDU01G', 'EDU01H', 'EDU01I', 'EDU01J'])
ed_data.get("EDU01A").head() #get one DataFrame using its key!| Area_name | STCOU | EDU010187F | EDU010187D | EDU010187N1 | EDU010187N2 | EDU010188F | EDU010188D | EDU010188N1 | EDU010188N2 | ... | EDU010194N1 | EDU010194N2 | EDU010195F | EDU010195D | EDU010195N1 | EDU010195N2 | EDU010196F | EDU010196D | EDU010196N1 | EDU010196N2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | UNITED STATES | 0 | 0 | 40024299 | 0 | 0 | 0 | 39967624 | 0 | 0 | ... | 0 | 0 | 0 | 43993459 | 0 | 0 | 0 | 44715737 | 0 | 0 |
| 1 | ALABAMA | 1000 | 0 | 733735 | 0 | 0 | 0 | 728234 | 0 | 0 | ... | 0 | 0 | 0 | 727989 | 0 | 0 | 0 | 736825 | 0 | 0 |
| 2 | Autauga, AL | 1001 | 0 | 6829 | 0 | 0 | 0 | 6900 | 0 | 0 | ... | 0 | 0 | 0 | 7568 | 0 | 0 | 0 | 7834 | 0 | 0 |
| 3 | Baldwin, AL | 1003 | 0 | 16417 | 0 | 0 | 0 | 16465 | 0 | 0 | ... | 0 | 0 | 0 | 19961 | 0 | 0 | 0 | 20699 | 0 | 0 |
| 4 | Barbour, AL | 1005 | 0 | 5071 | 0 | 0 | 0 | 5098 | 0 | 0 | ... | 0 | 0 | 0 | 5017 | 0 | 0 | 0 | 5053 | 0 | 0 |
5 rows × 42 columns
Reading JSON Data
JSON data has a structure similar to a dictionary
- Key-value pairs
{
{
"name": "Barry Sanders"
"games" : 153
"position": "RB"
},
{
"name": "Joe Montana"
"games": 192
"position": "QB"
}
}
read_json()function frompandaswill work!- Read in data from URL: “https://api.exchangerate-api.com/v4/latest/USD”
url = "https://api.exchangerate-api.com/v4/latest/USD"
usd_data = pd.read_json(url)
usd_data.head()| provider | WARNING_UPGRADE_TO_V6 | terms | base | date | time_last_updated | rates | |
|---|---|---|---|---|---|---|---|
| USD | https://www.exchangerate-api.com | https://www.exchangerate-api.com/docs/free | https://www.exchangerate-api.com/terms | USD | 2025-01-03 | 1735862402 | 1.00 |
| AED | https://www.exchangerate-api.com | https://www.exchangerate-api.com/docs/free | https://www.exchangerate-api.com/terms | USD | 2025-01-03 | 1735862402 | 3.67 |
| AFN | https://www.exchangerate-api.com | https://www.exchangerate-api.com/docs/free | https://www.exchangerate-api.com/terms | USD | 2025-01-03 | 1735862402 | 70.58 |
| ALL | https://www.exchangerate-api.com | https://www.exchangerate-api.com/docs/free | https://www.exchangerate-api.com/terms | USD | 2025-01-03 | 1735862402 | 94.12 |
| AMD | https://www.exchangerate-api.com | https://www.exchangerate-api.com/docs/free | https://www.exchangerate-api.com/terms | USD | 2025-01-03 | 1735862402 | 396.72 |
Every API is different so contacting them and returning data differs depending on the website you use. This is an important way we obtain data these days!
Writing Out a DataFrame
- We can write out data using the
.to_csv()method
usd_data.to_csv('usd_data.csv', index = False) #this goes to the folder on the left in colab!Quick Video
This video shows some examples of reading in some SAS data and data from an API.
Remember to pop the video out into the full player.
The notebook written in the video is available here.
from IPython.display import IFrame
IFrame(src="https://ncsu.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=30db871f-1750-4e5d-9de8-b0ff000c225c&autoplay=false&offerviewer=true&showtitle=true&showbrand=true&captions=false&interactivity=all", height="405", width="720")Recap
Read data in with
pandasread_csv()for delimited dataExcelFile()orread_excel()for excel dataread_json()for JSON
Write to a
.csvwith.to_csv()
If you are on the course website, use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!
If you are on Google Colab, head back to our course website for our next lesson!