Published

2025-03-31

Open In Colab

Pandas for Reading Raw Data

Justin Post

  • pandas library has functionality for reading delimited data, Excel data, SAS data, JSON data, and more!

  • Remember we’ll need to import the pandas library in order to use it!

import pandas as pd

Note: These types of webpages are built from Jupyter notebooks (.ipynb files). You can access your own versions of them by clicking here. It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you’d like!


Data Formats

Raw data comes in many different formats. Understanding the raw data format is essential for reading that data into python. Some raw data types include:

  • ‘Delimited’ data: Character (such as ‘,’ , ‘>’, or [’ ’]) separated data

  • Fixed field data

  • Excel data

  • From other statistical software, Ex: SPSS formatted data or SAS data sets

  • From an Application Programming Interface (API) (often returned as a JSON file - key/value pairs, similar to a dictionary)

  • From a database


Delimited Data

Let’s start with delimited data.

  • One common format for raw data is delimited data

    • Data that has a character or characters that separates the data values

    • Character(s) is (are) called delimiter(s)

  • Using pandas the read_csv() function can read in this kind of data (although csv stands for ‘comma separated value’, this function is used for reading most delimited data via pandas)

    • If the raw data is well-formatted, we just need to tell python where to find it!

Locating a File

  • How does python locate the file?
  • Not in colab
    • You can give file full path name
      • ex: ‘S:/Documents/repos/ST-554/datasets/data.csv’
      • ex: ‘S:\Documents\repos\ST-554\datasets\data.csv’
  • Or use local paths!
    • Determine your working directory
    • Use a path relative to that
    • If your working directory is ‘S:/Documents/repos/ST-554’ you can get to ‘data.csv’ via ‘datasets/data.csv’
  • The os module gives you access to function for finding and setting your working directory

Using a cloud-based platform complicates things a bit - In colab you can + Mount your google drive + Read files from URLs + Upload files via the menu on the left (folder icon, then upload a file via the icons there)

import os
#getcwd() stands for get current working directory
os.getcwd() #shows the directory you can get to via the folder icon on the left
#chdir() stands for change current directory
#os.chdir("S:/Documents/repos/ST-554") #won't work in colab but would work on a local python session
'/content'

This /content refers to the main folder on the left hand side of Colab!


Reading Files ‘Locally’ in Colab

  • Nicely formatted .csv files can be read in with the read_csv() function from pandas

  • neuralgia.csv has been loaded into the folder on colab in my session. Therefore, it exists in my working directory. This won’t be the case for you unless you upload the data during your session! You can click on the folder icon on the left, then click the upload button to upload this data set.

neuralgia_data = pd.read_csv("neuralgia.csv") #neuralgia.csv file was uploaded to colab for my session
neuralgia_data.head() #this code block won't work unless you upload the data in your session
Treatment Sex Age Duration Pain
0 P F 68 1 No
1 B M 74 16 No
2 P F 67 30 No
3 P M 66 26 Yes
4 B F 67 28 No
neuralgia_data.shape
(60, 5)

Reading From a URL

  • Nicely formatted .csv files can be read in with the read_csv() function from pandas

  • scoresFull.csv file at a URL given by ‘https://www4.stat.ncsu.edu/~online/datasets/scoresFull.csv’

scores_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/scoresFull.csv")
scores_data.head()
week date day season awayTeam AQ1 AQ2 AQ3 AQ4 AOT ... homeFumLost homeNumPen homePenYds home3rdConv home3rdAtt home4thConv home4thAtt homeTOP HminusAScore homeSpread
0 1 5-Sep Thu 2002 San Francisco 49ers 3 0 7 6 -1 ... 0 10 80 4 8 0 1 32.47 -3 -4.0
1 1 8-Sep Sun 2002 Minnesota Vikings 3 17 0 3 -1 ... 1 4 33 2 6 0 0 28.48 4 4.5
2 1 8-Sep Sun 2002 New Orleans Saints 6 7 7 0 6 ... 0 8 85 1 6 0 1 31.48 -6 6.0
3 1 8-Sep Sun 2002 New York Jets 0 17 3 11 6 ... 1 10 82 4 8 2 2 39.13 -6 -3.0
4 1 8-Sep Sun 2002 Arizona Cardinals 10 3 3 7 -1 ... 0 7 56 6 10 1 2 34.40 8 6.0

5 rows × 82 columns

scores_data.shape
(3471, 82)
  • Oddly, to read other types of delimited data, we also use read_csv()!
    • Specify the sep = argument
  • chemical.txt file (space delimiter) stored at “https://www4.stat.ncsu.edu/~online/datasets/chemical.txt”
chem_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/chemical.txt", sep=" ")
chem_data.head()
temp conc time percent
0 -1.0 -1.0 -1.0 45.9
1 1.0 -1.0 -1.0 60.6
2 -1.0 1.0 -1.0 57.5
3 1.0 1.0 -1.0 58.6
4 -1.0 -1.0 1.0 53.3
  • crabs.txt file (tab delimiter) stored at “https://www4.stat.ncsu.edu/~online/datasets/crabs.txt”
    • Tab is \t
crabs_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/crabs.txt", sep="\t")
crabs_data.head()
color spine width satell weight y
0 3 3 28.3 8 3050 1
1 4 3 22.5 0 1550 0
2 2 1 26.0 9 2300 1
3 4 3 24.8 0 2100 0
4 4 3 26.0 4 2600 1
  • umps2012.txt file (> delimiter) stored at “https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt”
    • No column names in raw file
    • Can specify header = None and give column names when reading (via names = [list of names])
ump_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt",
                      sep=">",
                      header=None,
                      names=["Year", "Month", "Day", "Home", "Away", "HPUmpire"])
ump_data.head()
Year Month Day Home Away HPUmpire
0 2012 4 12 MIN LAA D.J. Reyburn
1 2012 4 12 SD ARI Marty Foster
2 2012 4 12 WSH CIN Mike Everitt
3 2012 4 12 PHI MIA Jeff Nelson
4 2012 4 12 CHC MIL Fieldin Culbreth

Reading Excel Data

  • Use the ExcelFile() function from pandas

  • censusEd.xlsx file located at “https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx”

ed_data = pd.ExcelFile("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx")
ed_data
<pandas.io.excel._base.ExcelFile at 0x7d6ef1627880>
  • Unfortunately, there are different attributes associated with this data object!
#ed_data.head(), ed_data.info() won't work!
type(ed_data)
pandas.io.excel._base.ExcelFile
def __init__(path_or_buffer, engine: str | None=None, storage_options: StorageOptions | None=None, engine_kwargs: dict | None=None) -> None
Class for parsing tabular Excel sheets into DataFrame objects.

See read_excel for more documentation.

Parameters
----------
path_or_buffer : str, bytes, path object (pathlib.Path or py._path.local.LocalPath),
    A file-like object, xlrd workbook or openpyxl workbook.
    If a string or path object, expected to be a path to a
    .xls, .xlsx, .xlsb, .xlsm, .odf, .ods, or .odt file.
engine : str, default None
    If io is not a buffer or path, this must be set to identify io.
    Supported engines: ``xlrd``, ``openpyxl``, ``odf``, ``pyxlsb``, ``calamine``
    Engine compatibility :

    - ``xlrd`` supports old-style Excel files (.xls).
    - ``openpyxl`` supports newer Excel file formats.
    - ``odf`` supports OpenDocument file formats (.odf, .ods, .odt).
    - ``pyxlsb`` supports Binary Excel files.
    - ``calamine`` supports Excel (.xls, .xlsx, .xlsm, .xlsb)
      and OpenDocument (.ods) file formats.

    .. versionchanged:: 1.2.0

       The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_
       now only supports old-style ``.xls`` files.
       When ``engine=None``, the following logic will be
       used to determine the engine:

       - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt),
         then `odf <https://pypi.org/project/odfpy/>`_ will be used.
       - Otherwise if ``path_or_buffer`` is an xls format,
         ``xlrd`` will be used.
       - Otherwise if ``path_or_buffer`` is in xlsb format,
         `pyxlsb <https://pypi.org/project/pyxlsb/>`_ will be used.

       .. versionadded:: 1.3.0

       - Otherwise if `openpyxl <https://pypi.org/project/openpyxl/>`_ is installed,
         then ``openpyxl`` will be used.
       - Otherwise if ``xlrd >= 2.0`` is installed, a ``ValueError`` will be raised.

       .. warning::

        Please do not report issues when using ``xlrd`` to read ``.xlsx`` files.
        This is not supported, switch to using ``openpyxl`` instead.
engine_kwargs : dict, optional
    Arbitrary keyword arguments passed to excel engine.

Examples
--------
>>> file = pd.ExcelFile('myfile.xlsx')  # doctest: +SKIP
>>> with pd.ExcelFile("myfile.xls") as xls:  # doctest: +SKIP
...     df1 = pd.read_excel(xls, "Sheet1")  # doctest: +SKIP
ed_data.sheet_names
['EDU01A',
 'EDU01B',
 'EDU01C',
 'EDU01D',
 'EDU01E',
 'EDU01F',
 'EDU01G',
 'EDU01H',
 'EDU01I',
 'EDU01J']
  • Use .parse() method with sheet to obtain a usual DataFrame
ed_data.parse('EDU01A').head()
/usr/local/lib/python3.10/dist-packages/openpyxl/worksheet/header_footer.py:48: UserWarning: Cannot parse header or footer so it will be ignored
  warn("""Cannot parse header or footer so it will be ignored""")
Area_name STCOU EDU010187F EDU010187D EDU010187N1 EDU010187N2 EDU010188F EDU010188D EDU010188N1 EDU010188N2 ... EDU010194N1 EDU010194N2 EDU010195F EDU010195D EDU010195N1 EDU010195N2 EDU010196F EDU010196D EDU010196N1 EDU010196N2
0 UNITED STATES 0 0 40024299 0 0 0 39967624 0 0 ... 0 0 0 43993459 0 0 0 44715737 0 0
1 ALABAMA 1000 0 733735 0 0 0 728234 0 0 ... 0 0 0 727989 0 0 0 736825 0 0
2 Autauga, AL 1001 0 6829 0 0 0 6900 0 0 ... 0 0 0 7568 0 0 0 7834 0 0
3 Baldwin, AL 1003 0 16417 0 0 0 16465 0 0 ... 0 0 0 19961 0 0 0 20699 0 0
4 Barbour, AL 1005 0 5071 0 0 0 5098 0 0 ... 0 0 0 5017 0 0 0 5053 0 0

5 rows × 42 columns

  • Alternatively, use the read_excel() function from pandas

  • This reads things in to a standard DataFrame but you have to specify a sheet to read in (or it defaults to the 1st)

  • censusEd.xlsx file located at “https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx”

ed_data = pd.read_excel("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx",
                        sheet_name = 0) #or "EDU01A"
ed_data.head()
/usr/local/lib/python3.10/dist-packages/openpyxl/worksheet/header_footer.py:48: UserWarning: Cannot parse header or footer so it will be ignored
  warn("""Cannot parse header or footer so it will be ignored""")
Area_name STCOU EDU010187F EDU010187D EDU010187N1 EDU010187N2 EDU010188F EDU010188D EDU010188N1 EDU010188N2 ... EDU010194N1 EDU010194N2 EDU010195F EDU010195D EDU010195N1 EDU010195N2 EDU010196F EDU010196D EDU010196N1 EDU010196N2
0 UNITED STATES 0 0 40024299 0 0 0 39967624 0 0 ... 0 0 0 43993459 0 0 0 44715737 0 0
1 ALABAMA 1000 0 733735 0 0 0 728234 0 0 ... 0 0 0 727989 0 0 0 736825 0 0
2 Autauga, AL 1001 0 6829 0 0 0 6900 0 0 ... 0 0 0 7568 0 0 0 7834 0 0
3 Baldwin, AL 1003 0 16417 0 0 0 16465 0 0 ... 0 0 0 19961 0 0 0 20699 0 0
4 Barbour, AL 1005 0 5071 0 0 0 5098 0 0 ... 0 0 0 5017 0 0 0 5053 0 0

5 rows × 42 columns

  • You can read all sheets with sheet_name = None
  • This gets read into a dictionary!
    • Keys are the sheet name
    • Values are the DataFrames from each sheet
ed_data = pd.read_excel("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx",
                        sheet_name = None)
type(ed_data)
/usr/local/lib/python3.10/dist-packages/openpyxl/worksheet/header_footer.py:48: UserWarning: Cannot parse header or footer so it will be ignored
  warn("""Cannot parse header or footer so it will be ignored""")
dict
ed_data.keys()
dict_keys(['EDU01A', 'EDU01B', 'EDU01C', 'EDU01D', 'EDU01E', 'EDU01F', 'EDU01G', 'EDU01H', 'EDU01I', 'EDU01J'])
ed_data.get("EDU01A").head() #get one DataFrame using its key!
Area_name STCOU EDU010187F EDU010187D EDU010187N1 EDU010187N2 EDU010188F EDU010188D EDU010188N1 EDU010188N2 ... EDU010194N1 EDU010194N2 EDU010195F EDU010195D EDU010195N1 EDU010195N2 EDU010196F EDU010196D EDU010196N1 EDU010196N2
0 UNITED STATES 0 0 40024299 0 0 0 39967624 0 0 ... 0 0 0 43993459 0 0 0 44715737 0 0
1 ALABAMA 1000 0 733735 0 0 0 728234 0 0 ... 0 0 0 727989 0 0 0 736825 0 0
2 Autauga, AL 1001 0 6829 0 0 0 6900 0 0 ... 0 0 0 7568 0 0 0 7834 0 0
3 Baldwin, AL 1003 0 16417 0 0 0 16465 0 0 ... 0 0 0 19961 0 0 0 20699 0 0
4 Barbour, AL 1005 0 5071 0 0 0 5098 0 0 ... 0 0 0 5017 0 0 0 5053 0 0

5 rows × 42 columns


Reading JSON Data

  • JSON data has a structure similar to a dictionary

    • Key-value pairs
{  
  {  
    "name": "Barry Sanders"  
    "games" : 153  
    "position": "RB"  
  },  
  {  
    "name": "Joe Montana"  
    "games": 192  
    "position": "QB"  
  }  
}
  • read_json() function from pandas will work!
  • Read in data from URL: “https://api.exchangerate-api.com/v4/latest/USD”
url = "https://api.exchangerate-api.com/v4/latest/USD"
usd_data = pd.read_json(url)
usd_data.head()
provider WARNING_UPGRADE_TO_V6 terms base date time_last_updated rates
USD https://www.exchangerate-api.com https://www.exchangerate-api.com/docs/free https://www.exchangerate-api.com/terms USD 2025-01-03 1735862402 1.00
AED https://www.exchangerate-api.com https://www.exchangerate-api.com/docs/free https://www.exchangerate-api.com/terms USD 2025-01-03 1735862402 3.67
AFN https://www.exchangerate-api.com https://www.exchangerate-api.com/docs/free https://www.exchangerate-api.com/terms USD 2025-01-03 1735862402 70.58
ALL https://www.exchangerate-api.com https://www.exchangerate-api.com/docs/free https://www.exchangerate-api.com/terms USD 2025-01-03 1735862402 94.12
AMD https://www.exchangerate-api.com https://www.exchangerate-api.com/docs/free https://www.exchangerate-api.com/terms USD 2025-01-03 1735862402 396.72

Every API is different so contacting them and returning data differs depending on the website you use. This is an important way we obtain data these days!


Writing Out a DataFrame

  • We can write out data using the .to_csv() method
usd_data.to_csv('usd_data.csv', index = False) #this goes to the folder on the left in colab!

Quick Video

This video shows some examples of reading in some SAS data and data from an API.

Remember to pop the video out into the full player.

The notebook written in the video is available here.

from IPython.display import IFrame
IFrame(src="https://ncsu.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=30db871f-1750-4e5d-9de8-b0ff000c225c&autoplay=false&offerviewer=true&showtitle=true&showbrand=true&captions=false&interactivity=all", height="405", width="720")

Recap

  • Read data in with pandas

    • read_csv() for delimited data

    • ExcelFile() or read_excel() for excel data

    • read_json() for JSON

  • Write to a .csv with .to_csv()

If you are on the course website, use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!

If you are on Google Colab, head back to our course website for our next lesson!