Published

2025-03-31

Pandas for Reading Raw Data

Justin Post

pandas library has functionality for reading delimited data, Excel data, SAS data, JSON data, and more!
Remember we’ll need to import the pandas library in order to use it!

import pandas as pd

Note: These types of webpages are built from Jupyter notebooks (.ipynb files). You can access your own versions of them by clicking here. It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you’d like!

Data Formats

Raw data comes in many different formats. Understanding the raw data format is essential for reading that data into python. Some raw data types include:

‘Delimited’ data: Character (such as ‘,’ , ‘>’, or [’ ’]) separated data
Fixed field data
Excel data
From other statistical software, Ex: SPSS formatted data or SAS data sets
From an Application Programming Interface (API) (often returned as a JSON file - key/value pairs, similar to a dictionary)
From a database

Delimited Data

Let’s start with delimited data.

One common format for raw data is delimited data
- Data that has a character or characters that separates the data values
- Character(s) is (are) called delimiter(s)
Using pandas the read_csv() function can read in this kind of data (although csv stands for ‘comma separated value’, this function is used for reading most delimited data via pandas)
- If the raw data is well-formatted, we just need to tell python where to find it!

Locating a File

How does python locate the file?
Not in colab
- You can give file full path name
  - ex: ‘S:/Documents/repos/ST-554/datasets/data.csv’
  - ex: ‘S:\Documents\repos\ST-554\datasets\data.csv’
Or use local paths!
- Determine your working directory
- Use a path relative to that
- If your working directory is ‘S:/Documents/repos/ST-554’ you can get to ‘data.csv’ via ‘datasets/data.csv’
The os module gives you access to function for finding and setting your working directory

Using a cloud-based platform complicates things a bit - In colab you can + Mount your google drive + Read files from URLs + Upload files via the menu on the left (folder icon, then upload a file via the icons there)

import os
#getcwd() stands for get current working directory
os.getcwd() #shows the directory you can get to via the folder icon on the left
#chdir() stands for change current directory
#os.chdir("S:/Documents/repos/ST-554") #won't work in colab but would work on a local python session

'/content'

This /content refers to the main folder on the left hand side of Colab!

Reading Files ‘Locally’ in Colab

Nicely formatted .csv files can be read in with the read_csv() function from pandas
neuralgia.csv has been loaded into the folder on colab in my session. Therefore, it exists in my working directory. This won’t be the case for you unless you upload the data during your session! You can click on the folder icon on the left, then click the upload button to upload this data set.

neuralgia_data = pd.read_csv("neuralgia.csv") #neuralgia.csv file was uploaded to colab for my session
neuralgia_data.head() #this code block won't work unless you upload the data in your session

	Treatment	Sex	Age	Duration	Pain
0	P	F	68	1	No
1	B	M	74	16	No
2	P	F	67	30	No
3	P	M	66	26	Yes
4	B	F	67	28	No

neuralgia_data.shape

(60, 5)

Reading From a URL

Nicely formatted .csv files can be read in with the read_csv() function from pandas
scoresFull.csv file at a URL given by ‘https://www4.stat.ncsu.edu/~online/datasets/scoresFull.csv’

scores_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/scoresFull.csv")
scores_data.head()

	week	date	day	season	awayTeam	AQ1	AQ2	AQ3	AQ4	AOT	...	homeFumLost	homeNumPen	homePenYds	home3rdConv	home3rdAtt	home4thConv	home4thAtt	homeTOP	HminusAScore	homeSpread
0	1	5-Sep	Thu	2002	San Francisco 49ers	3	0	7	6	-1	...	0	10	80	4	8	0	1	32.47	-3	-4.0
1	1	8-Sep	Sun	2002	Minnesota Vikings	3	17	0	3	-1	...	1	4	33	2	6	0	0	28.48	4	4.5
2	1	8-Sep	Sun	2002	New Orleans Saints	6	7	7	0	6	...	0	8	85	1	6	0	1	31.48	-6	6.0
3	1	8-Sep	Sun	2002	New York Jets	0	17	3	11	6	...	1	10	82	4	8	2	2	39.13	-6	-3.0
4	1	8-Sep	Sun	2002	Arizona Cardinals	10	3	3	7	-1	...	0	7	56	6	10	1	2	34.40	8	6.0

5 rows × 82 columns

scores_data.shape

(3471, 82)

Oddly, to read other types of delimited data, we also use read_csv()!
- Specify the sep = argument
chemical.txt file (space delimiter) stored at “https://www4.stat.ncsu.edu/~online/datasets/chemical.txt”

chem_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/chemical.txt", sep=" ")
chem_data.head()

	temp	conc	time	percent
0	-1.0	-1.0	-1.0	45.9
1	1.0	-1.0	-1.0	60.6
2	-1.0	1.0	-1.0	57.5
3	1.0	1.0	-1.0	58.6
4	-1.0	-1.0	1.0	53.3

crabs.txt file (tab delimiter) stored at “https://www4.stat.ncsu.edu/~online/datasets/crabs.txt”
- Tab is \t

crabs_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/crabs.txt", sep="\t")
crabs_data.head()

	color	spine	width	satell	weight	y
0	3	3	28.3	8	3050	1
1	4	3	22.5	0	1550	0
2	2	1	26.0	9	2300	1
3	4	3	24.8	0	2100	0
4	4	3	26.0	4	2600	1

umps2012.txt file (> delimiter) stored at “https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt”
- No column names in raw file
- Can specify header = None and give column names when reading (via names = [list of names])

ump_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt",
                      sep=">",
                      header=None,
                      names=["Year", "Month", "Day", "Home", "Away", "HPUmpire"])
ump_data.head()

	Year	Month	Day	Home	Away	HPUmpire
0	2012	4	12	MIN	LAA	D.J. Reyburn
1	2012	4	12	SD	ARI	Marty Foster
2	2012	4	12	WSH	CIN	Mike Everitt
3	2012	4	12	PHI	MIA	Jeff Nelson
4	2012	4	12	CHC	MIL	Fieldin Culbreth

Reading Excel Data

Use the ExcelFile() function from pandas
censusEd.xlsx file located at “https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx”

ed_data = pd.ExcelFile("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx")
ed_data

<pandas.io.excel._base.ExcelFile at 0x7d6ef1627880>

Unfortunately, there are different attributes associated with this data object!

#ed_data.head(), ed_data.info() won't work!
type(ed_data)

pandas.io.excel._base.ExcelFile
def __init__(path_or_buffer, engine: str | None=None, storage_options: StorageOptions | None=None, engine_kwargs: dict | None=None) -> None

/usr/local/lib/python3.10/dist-packages/pandas/io/excel/_base.pyClass for parsing tabular Excel sheets into DataFrame objects.

See read_excel for more documentation.

Parameters
----------
path_or_buffer : str, bytes, path object (pathlib.Path or py._path.local.LocalPath),
    A file-like object, xlrd workbook or openpyxl workbook.
    If a string or path object, expected to be a path to a
    .xls, .xlsx, .xlsb, .xlsm, .odf, .ods, or .odt file.
engine : str, default None
    If io is not a buffer or path, this must be set to identify io.
    Supported engines: ``xlrd``, ``openpyxl``, ``odf``, ``pyxlsb``, ``calamine``
    Engine compatibility :

    - ``xlrd`` supports old-style Excel files (.xls).
    - ``openpyxl`` supports newer Excel file formats.
    - ``odf`` supports OpenDocument file formats (.odf, .ods, .odt).
    - ``pyxlsb`` supports Binary Excel files.
    - ``calamine`` supports Excel (.xls, .xlsx, .xlsm, .xlsb)
      and OpenDocument (.ods) file formats.

    .. versionchanged:: 1.2.0

       The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_
       now only supports old-style ``.xls`` files.
       When ``engine=None``, the following logic will be
       used to determine the engine:

       - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt),
         then `odf <https://pypi.org/project/odfpy/>`_ will be used.
       - Otherwise if ``path_or_buffer`` is an xls format,
         ``xlrd`` will be used.
       - Otherwise if ``path_or_buffer`` is in xlsb format,
         `pyxlsb <https://pypi.org/project/pyxlsb/>`_ will be used.

       .. versionadded:: 1.3.0

       - Otherwise if `openpyxl <https://pypi.org/project/openpyxl/>`_ is installed,
         then ``openpyxl`` will be used.
       - Otherwise if ``xlrd >= 2.0`` is installed, a ``ValueError`` will be raised.

       .. warning::

        Please do not report issues when using ``xlrd`` to read ``.xlsx`` files.
        This is not supported, switch to using ``openpyxl`` instead.
engine_kwargs : dict, optional
    Arbitrary keyword arguments passed to excel engine.

Examples
--------
>>> file = pd.ExcelFile('myfile.xlsx')  # doctest: +SKIP
>>> with pd.ExcelFile("myfile.xls") as xls:  # doctest: +SKIP
...     df1 = pd.read_excel(xls, "Sheet1")  # doctest: +SKIP

ed_data.sheet_names

['EDU01A',
 'EDU01B',
 'EDU01C',
 'EDU01D',
 'EDU01E',
 'EDU01F',
 'EDU01G',
 'EDU01H',
 'EDU01I',
 'EDU01J']

Use .parse() method with sheet to obtain a usual DataFrame

ed_data.parse('EDU01A').head()

/usr/local/lib/python3.10/dist-packages/openpyxl/worksheet/header_footer.py:48: UserWarning: Cannot parse header or footer so it will be ignored
  warn("""Cannot parse header or footer so it will be ignored""")

	Area_name	STCOU	EDU010187D	EDU010188D	...	EDU010195D	EDU010196D
0	UNITED STATES	0	40024299	39967624	...	43993459	44715737
1	ALABAMA	1000	733735	728234	...	727989	736825
2	Autauga, AL	1001	6829	6900	...	7568	7834
3	Baldwin, AL	1003	16417	16465	...	19961	20699
4	Barbour, AL	1005	5071	5098	...	5017	5053

5 rows × 42 columns

Alternatively, use the read_excel() function from pandas
This reads things in to a standard DataFrame but you have to specify a sheet to read in (or it defaults to the 1st)
censusEd.xlsx file located at “https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx”

ed_data = pd.read_excel("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx",
                        sheet_name = 0) #or "EDU01A"
ed_data.head()

/usr/local/lib/python3.10/dist-packages/openpyxl/worksheet/header_footer.py:48: UserWarning: Cannot parse header or footer so it will be ignored
  warn("""Cannot parse header or footer so it will be ignored""")

	Area_name	STCOU	EDU010187D	EDU010188D	...	EDU010195D	EDU010196D
0	UNITED STATES	0	40024299	39967624	...	43993459	44715737
1	ALABAMA	1000	733735	728234	...	727989	736825
2	Autauga, AL	1001	6829	6900	...	7568	7834
3	Baldwin, AL	1003	16417	16465	...	19961	20699
4	Barbour, AL	1005	5071	5098	...	5017	5053

5 rows × 42 columns

You can read all sheets with sheet_name = None
This gets read into a dictionary!
- Keys are the sheet name
- Values are the DataFrames from each sheet

ed_data = pd.read_excel("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx",
                        sheet_name = None)
type(ed_data)

/usr/local/lib/python3.10/dist-packages/openpyxl/worksheet/header_footer.py:48: UserWarning: Cannot parse header or footer so it will be ignored
  warn("""Cannot parse header or footer so it will be ignored""")

dict

ed_data.keys()

dict_keys(['EDU01A', 'EDU01B', 'EDU01C', 'EDU01D', 'EDU01E', 'EDU01F', 'EDU01G', 'EDU01H', 'EDU01I', 'EDU01J'])

ed_data.get("EDU01A").head() #get one DataFrame using its key!

	Area_name	STCOU	EDU010187D	EDU010188D	...	EDU010195D	EDU010196D
0	UNITED STATES	0	40024299	39967624	...	43993459	44715737
1	ALABAMA	1000	733735	728234	...	727989	736825
2	Autauga, AL	1001	6829	6900	...	7568	7834
3	Baldwin, AL	1003	16417	16465	...	19961	20699
4	Barbour, AL	1005	5071	5098	...	5017	5053

5 rows × 42 columns

Reading JSON Data

JSON data has a structure similar to a dictionary
- Key-value pairs

{  
  {  
    "name": "Barry Sanders"  
    "games" : 153  
    "position": "RB"  
  },  
  {  
    "name": "Joe Montana"  
    "games": 192  
    "position": "QB"  
  }  
}

read_json() function from pandas will work!
Read in data from URL: “https://api.exchangerate-api.com/v4/latest/USD”

url = "https://api.exchangerate-api.com/v4/latest/USD"
usd_data = pd.read_json(url)
usd_data.head()

	provider	WARNING_UPGRADE_TO_V6	terms	base	date	time_last_updated	rates
USD	https://www.exchangerate-api.com	https://www.exchangerate-api.com/docs/free	https://www.exchangerate-api.com/terms	USD	2025-01-03	1735862402	1.00
AED	https://www.exchangerate-api.com	https://www.exchangerate-api.com/docs/free	https://www.exchangerate-api.com/terms	USD	2025-01-03	1735862402	3.67
AFN	https://www.exchangerate-api.com	https://www.exchangerate-api.com/docs/free	https://www.exchangerate-api.com/terms	USD	2025-01-03	1735862402	70.58
ALL	https://www.exchangerate-api.com	https://www.exchangerate-api.com/docs/free	https://www.exchangerate-api.com/terms	USD	2025-01-03	1735862402	94.12
AMD	https://www.exchangerate-api.com	https://www.exchangerate-api.com/docs/free	https://www.exchangerate-api.com/terms	USD	2025-01-03	1735862402	396.72

Every API is different so contacting them and returning data differs depending on the website you use. This is an important way we obtain data these days!

Writing Out a `DataFrame`

We can write out data using the .to_csv() method

usd_data.to_csv('usd_data.csv', index = False) #this goes to the folder on the left in colab!

Quick Video

This video shows some examples of reading in some SAS data and data from an API.

Remember to pop the video out into the full player.

The notebook written in the video is available here.

from IPython.display import IFrame
IFrame(src="https://ncsu.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=30db871f-1750-4e5d-9de8-b0ff000c225c&autoplay=false&offerviewer=true&showtitle=true&showbrand=true&captions=false&interactivity=all", height="405", width="720")

Recap

Read data in with pandas
- read_csv() for delimited data
- ExcelFile() or read_excel() for excel data
- read_json() for JSON
Write to a .csv with .to_csv()

If you are on the course website, use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!

If you are on Google Colab, head back to our course website for our next lesson!

Pandas for Reading Raw Data

Data Formats

Delimited Data

Locating a File

Reading Files ‘Locally’ in Colab

Reading From a URL

Reading Excel Data

Reading JSON Data

Writing Out a DataFrame

Quick Video

Recap

Writing Out a `DataFrame`