Published

2025-03-31

Open In Colab

Data Storage and Basic SQL

Justin Post


Before we get into Big Data, we need to understand the most common way to store many different, related, datasets. We’ve already looked at basic ways to store data in a single file:

  • .csv (comma separated value)
  • .txt (generic, usually delimited, data)
  • .dat (generic, usually delimited, data)
  • .JSON (think dictionary)

When we have multiple data sets (or sources) a database is often used along with a database management system (DBMS).

Consider the diagram below. This shows many different tables (think data frames) such as playlists, playlist_track, and tracks. These tables are linked through keys such as the Playlistid or Trackid.

  • The data, the DBMS, and the applications associated with them are often simply called a ‘database’
  • As the data tables are generally linked, we call this a relational database
  • The DBMS software is used to create, edit, and maintain database files and records, enabling easier file and record creation, data entry, data editing, updating, and reporting.

In this notebook we explore databases and the basics of the common structured query language (SQL) used to interact with them.

Note: These types of webpages are built from Jupyter notebooks (.ipynb files). You can access your own versions of them by clicking here. It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you’d like!

Common Database Software

Many common types of relational databases management systems (RDBMS) exist. Some are free and some are not. A few common ones are:

Most RDBMS have their own Structured Query Language (SQL), however the basic functionality is similar across them!

Actions on Databases

There are a few common actions we often want to perform on a database. The acronym CRUD is used to describe four actions:

  • Create data
  • Read data
  • Update data
  • Delete data

We will write SQL code to do these actions!

You could imagine that, with multiple users possibly accessing a database, we need to be very careful in how we do these actions. There are four properties that relational database transactions must have:

  • Atomicity defines all the elements that make up a complete database transaction.
  • Consistency defines the rules for maintaining data points in a correct state after a transaction.
  • Isolation keeps the effect of a transaction invisible to others until it is committed, to avoid confusion.
  • Durability ensures that data changes become permanent once the transaction is committed.

This is commonly referred to as ACID properties.

Let’s explore using SQLite with python!

Accessing an SQLite Database in python

The sqlite3 module provides the ability to connect and perform actions on an SQLite database.

  • We need to import this module
  • Then we just need a path to our database file (if we already have one, otherwise one will be created at the path - if possible)
  • We’ll look at the commonly used chinook database the diagram above described.
  • As we need to have the ability to write to the database from our notebook, we can’t share the same database. That means you’ll need to download the chinook.db file and upload it to your folder area on the left!
  • If you want the changes to remain after you are done, you’d need to download that file before closing the session!
#bring in the module
import sqlite3
#make the connection to the .db file. My file is in the main folder on the left
con = sqlite3.connect("chinook.db")

Accessing the Table Schema

Every SQLite database contains a “schema table” with information about that database. This describes the - tables - indices (special lookup tables to improve efficiency of queries) - triggers (named database object that is executed automatically when an INSERT, UPDATE or DELETE statement is issued against the associated table) - views (read-only tables, combinations of tables, etc.)

A schema file contains one row for each table, index, view, and trigger in the schema.

We can get the schema by issuing an SQL command! A common SQL command for querying a database looks like this: - SELECT column1, column2 FROM table WHERE logical_of_some_kind;

For SQLite from python, usually we’ll follow this structure:

  1. Create a cursor object using cursor = con.cursor()
  2. Write our SQL command as a string
  3. Execute the SQL code using the .execute() method (cursor.execute()) on our cursor object
  4. Use the .fetchall() method (cursor.fetchall()) to actually return the data requested
  5. Close the conneciton made by the cursor (cursor.close())
  • Below we use the multiline comment (three quotation marks to start and end) in order to write more legible SQL code
#create a 'cursor' object from our connection
cursor = con.cursor()

#SQL query to return all table names in the data base
#The * indicates we want to select everything
get_schema = '''
        SELECT *
        FROM sqlite_schema
        WHERE type = "table";
        '''

#execute the SQL query on the database!
cursor.execute(get_schema)

#The information for the query is stored in memory. We use the fetchall() method to actually return the information
result = cursor.fetchall()

#finall we close the connection the cursor made with that query
cursor.close()

We can now look at what is stored in result. Here it is actually a list that we can cycle through and print information from.

print(type(result))
for i in result:
  print(i)
<class 'list'>
('table', 'albums', 'albums', 2, 'CREATE TABLE "albums"\r\n(\r\n    [AlbumId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,\r\n    [Title] NVARCHAR(160)  NOT NULL,\r\n    [ArtistId] INTEGER  NOT NULL,\r\n    FOREIGN KEY ([ArtistId]) REFERENCES "artists" ([ArtistId]) \r\n\t\tON DELETE NO ACTION ON UPDATE NO ACTION\r\n)')
('table', 'sqlite_sequence', 'sqlite_sequence', 3, 'CREATE TABLE sqlite_sequence(name,seq)')
('table', 'artists', 'artists', 4, 'CREATE TABLE "artists"\r\n(\r\n    [ArtistId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,\r\n    [Name] NVARCHAR(120)\r\n)')
('table', 'customers', 'customers', 5, 'CREATE TABLE "customers"\r\n(\r\n    [CustomerId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,\r\n    [FirstName] NVARCHAR(40)  NOT NULL,\r\n    [LastName] NVARCHAR(20)  NOT NULL,\r\n    [Company] NVARCHAR(80),\r\n    [Address] NVARCHAR(70),\r\n    [City] NVARCHAR(40),\r\n    [State] NVARCHAR(40),\r\n    [Country] NVARCHAR(40),\r\n    [PostalCode] NVARCHAR(10),\r\n    [Phone] NVARCHAR(24),\r\n    [Fax] NVARCHAR(24),\r\n    [Email] NVARCHAR(60)  NOT NULL,\r\n    [SupportRepId] INTEGER,\r\n    FOREIGN KEY ([SupportRepId]) REFERENCES "employees" ([EmployeeId]) \r\n\t\tON DELETE NO ACTION ON UPDATE NO ACTION\r\n)')
('table', 'employees', 'employees', 8, 'CREATE TABLE "employees"\r\n(\r\n    [EmployeeId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,\r\n    [LastName] NVARCHAR(20)  NOT NULL,\r\n    [FirstName] NVARCHAR(20)  NOT NULL,\r\n    [Title] NVARCHAR(30),\r\n    [ReportsTo] INTEGER,\r\n    [BirthDate] DATETIME,\r\n    [HireDate] DATETIME,\r\n    [Address] NVARCHAR(70),\r\n    [City] NVARCHAR(40),\r\n    [State] NVARCHAR(40),\r\n    [Country] NVARCHAR(40),\r\n    [PostalCode] NVARCHAR(10),\r\n    [Phone] NVARCHAR(24),\r\n    [Fax] NVARCHAR(24),\r\n    [Email] NVARCHAR(60),\r\n    FOREIGN KEY ([ReportsTo]) REFERENCES "employees" ([EmployeeId]) \r\n\t\tON DELETE NO ACTION ON UPDATE NO ACTION\r\n)')
('table', 'genres', 'genres', 10, 'CREATE TABLE "genres"\r\n(\r\n    [GenreId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,\r\n    [Name] NVARCHAR(120)\r\n)')
('table', 'invoices', 'invoices', 11, 'CREATE TABLE "invoices"\r\n(\r\n    [InvoiceId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,\r\n    [CustomerId] INTEGER  NOT NULL,\r\n    [InvoiceDate] DATETIME  NOT NULL,\r\n    [BillingAddress] NVARCHAR(70),\r\n    [BillingCity] NVARCHAR(40),\r\n    [BillingState] NVARCHAR(40),\r\n    [BillingCountry] NVARCHAR(40),\r\n    [BillingPostalCode] NVARCHAR(10),\r\n    [Total] NUMERIC(10,2)  NOT NULL,\r\n    FOREIGN KEY ([CustomerId]) REFERENCES "customers" ([CustomerId]) \r\n\t\tON DELETE NO ACTION ON UPDATE NO ACTION\r\n)')
('table', 'invoice_items', 'invoice_items', 13, 'CREATE TABLE "invoice_items"\r\n(\r\n    [InvoiceLineId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,\r\n    [InvoiceId] INTEGER  NOT NULL,\r\n    [TrackId] INTEGER  NOT NULL,\r\n    [UnitPrice] NUMERIC(10,2)  NOT NULL,\r\n    [Quantity] INTEGER  NOT NULL,\r\n    FOREIGN KEY ([InvoiceId]) REFERENCES "invoices" ([InvoiceId]) \r\n\t\tON DELETE NO ACTION ON UPDATE NO ACTION,\r\n    FOREIGN KEY ([TrackId]) REFERENCES "tracks" ([TrackId]) \r\n\t\tON DELETE NO ACTION ON UPDATE NO ACTION\r\n)')
('table', 'media_types', 'media_types', 15, 'CREATE TABLE "media_types"\r\n(\r\n    [MediaTypeId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,\r\n    [Name] NVARCHAR(120)\r\n)')
('table', 'playlists', 'playlists', 16, 'CREATE TABLE "playlists"\r\n(\r\n    [PlaylistId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,\r\n    [Name] NVARCHAR(120)\r\n)')
('table', 'playlist_track', 'playlist_track', 17, 'CREATE TABLE "playlist_track"\r\n(\r\n    [PlaylistId] INTEGER  NOT NULL,\r\n    [TrackId] INTEGER  NOT NULL,\r\n    CONSTRAINT [PK_PlaylistTrack] PRIMARY KEY  ([PlaylistId], [TrackId]),\r\n    FOREIGN KEY ([PlaylistId]) REFERENCES "playlists" ([PlaylistId]) \r\n\t\tON DELETE NO ACTION ON UPDATE NO ACTION,\r\n    FOREIGN KEY ([TrackId]) REFERENCES "tracks" ([TrackId]) \r\n\t\tON DELETE NO ACTION ON UPDATE NO ACTION\r\n)')
('table', 'tracks', 'tracks', 20, 'CREATE TABLE "tracks"\r\n(\r\n    [TrackId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,\r\n    [Name] NVARCHAR(200)  NOT NULL,\r\n    [AlbumId] INTEGER,\r\n    [MediaTypeId] INTEGER  NOT NULL,\r\n    [GenreId] INTEGER,\r\n    [Composer] NVARCHAR(220),\r\n    [Milliseconds] INTEGER  NOT NULL,\r\n    [Bytes] INTEGER,\r\n    [UnitPrice] NUMERIC(10,2)  NOT NULL,\r\n    FOREIGN KEY ([AlbumId]) REFERENCES "albums" ([AlbumId]) \r\n\t\tON DELETE NO ACTION ON UPDATE NO ACTION,\r\n    FOREIGN KEY ([GenreId]) REFERENCES "genres" ([GenreId]) \r\n\t\tON DELETE NO ACTION ON UPDATE NO ACTION,\r\n    FOREIGN KEY ([MediaTypeId]) REFERENCES "media_types" ([MediaTypeId]) \r\n\t\tON DELETE NO ACTION ON UPDATE NO ACTION\r\n)')
('table', 'sqlite_stat1', 'sqlite_stat1', 864, 'CREATE TABLE sqlite_stat1(tbl,idx,stat)')

Often we want to put this information into our common pandas data frame format!

import pandas as pd
#create a data frame
schema_df = pd.DataFrame(result)
schema_df
0 1 2 3 4
0 table albums albums 2 CREATE TABLE "albums"\r\n(\r\n [AlbumId] IN...
1 table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
2 table artists artists 4 CREATE TABLE "artists"\r\n(\r\n [ArtistId] ...
3 table customers customers 5 CREATE TABLE "customers"\r\n(\r\n [Customer...
4 table employees employees 8 CREATE TABLE "employees"\r\n(\r\n [Employee...
5 table genres genres 10 CREATE TABLE "genres"\r\n(\r\n [GenreId] IN...
6 table invoices invoices 11 CREATE TABLE "invoices"\r\n(\r\n [InvoiceId...
7 table invoice_items invoice_items 13 CREATE TABLE "invoice_items"\r\n(\r\n [Invo...
8 table media_types media_types 15 CREATE TABLE "media_types"\r\n(\r\n [MediaT...
9 table playlists playlists 16 CREATE TABLE "playlists"\r\n(\r\n [Playlist...
10 table playlist_track playlist_track 17 CREATE TABLE "playlist_track"\r\n(\r\n [Pla...
11 table tracks tracks 20 CREATE TABLE "tracks"\r\n(\r\n [TrackId] IN...
12 table sqlite_stat1 sqlite_stat1 864 CREATE TABLE sqlite_stat1(tbl,idx,stat)

Pretty good, just missing column names. For the schema table, those are standardized.

#current column names
schema_df.columns
RangeIndex(start=0, stop=5, step=1)
schema_df.rename(columns={0: 'type', 1: 'name', 2: 'tbl_name', 3: 'rootpage', 4: 'sql'}, inplace = True)
schema_df
type name tbl_name rootpage sql
0 table albums albums 2 CREATE TABLE "albums"\r\n(\r\n [AlbumId] IN...
1 table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
2 table artists artists 4 CREATE TABLE "artists"\r\n(\r\n [ArtistId] ...
3 table customers customers 5 CREATE TABLE "customers"\r\n(\r\n [Customer...
4 table employees employees 8 CREATE TABLE "employees"\r\n(\r\n [Employee...
5 table genres genres 10 CREATE TABLE "genres"\r\n(\r\n [GenreId] IN...
6 table invoices invoices 11 CREATE TABLE "invoices"\r\n(\r\n [InvoiceId...
7 table invoice_items invoice_items 13 CREATE TABLE "invoice_items"\r\n(\r\n [Invo...
8 table media_types media_types 15 CREATE TABLE "media_types"\r\n(\r\n [MediaT...
9 table playlists playlists 16 CREATE TABLE "playlists"\r\n(\r\n [Playlist...
10 table playlist_track playlist_track 17 CREATE TABLE "playlist_track"\r\n(\r\n [Pla...
11 table tracks tracks 20 CREATE TABLE "tracks"\r\n(\r\n [TrackId] IN...
12 table sqlite_stat1 sqlite_stat1 864 CREATE TABLE sqlite_stat1(tbl,idx,stat)

Alternatively, we can use the function read_sql() from pandas to automatically put the result into a data frame!

  • Here we don’t need a cursor object, we just pass the SQL string and the connection object.
schema_df2 = pd.read_sql(get_schema, con)
schema_df2
type name tbl_name rootpage sql
0 table albums albums 2 CREATE TABLE "albums"\r\n(\r\n [AlbumId] IN...
1 table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
2 table artists artists 4 CREATE TABLE "artists"\r\n(\r\n [ArtistId] ...
3 table customers customers 5 CREATE TABLE "customers"\r\n(\r\n [Customer...
4 table employees employees 8 CREATE TABLE "employees"\r\n(\r\n [Employee...
5 table genres genres 10 CREATE TABLE "genres"\r\n(\r\n [GenreId] IN...
6 table invoices invoices 11 CREATE TABLE "invoices"\r\n(\r\n [InvoiceId...
7 table invoice_items invoice_items 13 CREATE TABLE "invoice_items"\r\n(\r\n [Invo...
8 table media_types media_types 15 CREATE TABLE "media_types"\r\n(\r\n [MediaT...
9 table playlists playlists 16 CREATE TABLE "playlists"\r\n(\r\n [Playlist...
10 table playlist_track playlist_track 17 CREATE TABLE "playlist_track"\r\n(\r\n [Pla...
11 table tracks tracks 20 CREATE TABLE "tracks"\r\n(\r\n [TrackId] IN...
12 table sqlite_stat1 sqlite_stat1 864 CREATE TABLE sqlite_stat1(tbl,idx,stat)

Ahh, that’s better!

Querying a Table

Now that we know which tables exist, we can query them! Let’s return all the albums in the albums table.

We go through the same process as before (as we closed our cursor object we need to create a new one):

  1. Create our cursor using con.cursor()
  2. Write our SQL command as a string
  3. Execute the SQL code using the .execute() method (cursor.execute()) on our cursor object
  4. Use the .fetchall() method (cursor.fetchall()) to actually return the data requested
  5. Close the conneciton made by the cursor (cursor.close())

The FROM function allows us to choose which table to query. LIMIT 20 says to limit what is returned to 20 rows.

#create the cursor instance
cursor = con.cursor()
#create the SQL string
get_albums = '''
        SELECT *
        FROM albums
        LIMIT 20;
        '''
#execute the query
albums = cursor.execute(get_albums)
#grab the results
album_results = albums.fetchall()
#close the cursor
cursor.close()
#check the results
album_results
[(1, 'For Those About To Rock We Salute You', 1),
 (2, 'Balls to the Wall', 2),
 (3, 'Restless and Wild', 2),
 (4, 'Let There Be Rock', 1),
 (5, 'Big Ones', 3),
 (6, 'Jagged Little Pill', 4),
 (7, 'Facelift', 5),
 (8, 'Warner 25 Anos', 6),
 (9, 'Plays Metallica By Four Cellos', 7),
 (10, 'Audioslave', 8),
 (11, 'Out Of Exile', 8),
 (12, 'BackBeat Soundtrack', 9),
 (13, 'The Best Of Billy Cobham', 10),
 (14, 'Alcohol Fueled Brewtality Live! [Disc 1]', 11),
 (15, 'Alcohol Fueled Brewtality Live! [Disc 2]', 11),
 (16, 'Black Sabbath', 12),
 (17, 'Black Sabbath Vol. 4 (Remaster)', 12),
 (18, 'Body Count', 13),
 (19, 'Chemical Wedding', 14),
 (20, 'The Best Of Buddy Guy - The Millenium Collection', 15)]

Ok, when just reading a table, let’s go with pd.read_sql() instead since it returns things in a nicer format!

album_results = pd.read_sql(get_albums, con)
album_results
AlbumId Title ArtistId
0 1 For Those About To Rock We Salute You 1
1 2 Balls to the Wall 2
2 3 Restless and Wild 2
3 4 Let There Be Rock 1
4 5 Big Ones 3
5 6 Jagged Little Pill 4
6 7 Facelift 5
7 8 Warner 25 Anos 6
8 9 Plays Metallica By Four Cellos 7
9 10 Audioslave 8
10 11 Out Of Exile 8
11 12 BackBeat Soundtrack 9
12 13 The Best Of Billy Cobham 10
13 14 Alcohol Fueled Brewtality Live! [Disc 1] 11
14 15 Alcohol Fueled Brewtality Live! [Disc 2] 11
15 16 Black Sabbath 12
16 17 Black Sabbath Vol. 4 (Remaster) 12
17 18 Body Count 13
18 19 Chemical Wedding 14
19 20 The Best Of Buddy Guy - The Millenium Collection 15

Process for doing a CRUD activity

(Create, Read, Update, Delete)

If we wanted to do something other than just read a table, we need to follow our general structure from above:

  • Create a connection using sqlite3.connect(path)
  • Create a cursor object associated with the connection
  • Write an SQL query as a string
  • Use cursor.execute() (or cursor.executemany()) to execute the SQL
  • Close the cursor object

Common SQL commands

  • CREATE TABLE - creates a new table
  • INSERT INTO - adds records to a table
  • UPDATE - modify existing records
  • DELETE FROM - deletes data
  • DROP TABLE - removes a table
  • SELECT - reads data (use fetchone(), fetchall(), or the returned value as an iterator but we’ll use pd.read_sql() for simple read commands)

Also many important joins we’ll cover shortly.

Let’s go through a few of these actions to show how we can write our SQL code and execute it! Note that SQL code is not case sensitive.

Create a Table

Let’s start by creating a table. We write the SQL code and specify the name of the table and and variables we want to create.

  • We specify the type of data the variable will hold after naming the variable
  • Check this reference for possible data types!
#create the cursor instance
cursor = con.cursor()
#write our SQL to create a table
#here we also specify two variables: album and artist along with the type of data they'll hold
ct = """
    CREATE TABLE IF NOT EXISTS justin_music (
        album TEXT,
        artist TEXT);
"""
#execute the SQL code
cursor.execute(ct)
<sqlite3.Cursor at 0x7a23d8f825c0>

Let’s check that it worked before we close the connection!

pd.read_sql('''
        SELECT *
        FROM justin_music;
        ''', con)
album artist

Cool, now let’s close the connection.

cursor = con.cursor()

You can see we’ll do this process a lot. It may be best to create a function to help us out. (Modified from https://realpython.com/python-sql-libraries/)

def execute_query(connection, query):
    cursor = connection.cursor()
    try:
        cursor.execute(query)
        print("Query executed successfully")
    except Error as e:
        print(f"The error '{e}' occurred")
    cursor.close()

The function will execute the query for us and close the cursor.

execute_query(con, ct)
Query executed successfully

INSERT INTO

Now lets add some data to our table using INSERT INTO. We need to pass the table name, optionally the columns we’ll specify, and the values to fill with.

create_rows = """
       INSERT INTO
           justin_music (album, artist)
       VALUES
           ("Sixteen Stone", "Bush"),
           ("Listener Supported", "Dave Matthews Band"),
           ("Chris Stapleton", "Traveler"),
           ("1989", "Taylor Swift");
"""
execute_query(con, create_rows)
Query executed successfully

Let’s check if it worked!

pd.read_sql("SELECT * FROM justin_music", con)
album artist
0 Sixteen Stone Bush
1 Listener Supported Dave Matthews Band
2 Chris Stapleton Traveler
3 1989 Taylor Swift

Now let’s add a row with only an artist.

create_row = """
       INSERT INTO
           justin_music (artist)
       VALUES
           ("Taylor Swift");
"""

execute_query(con, create_row)
Query executed successfully

This inserts a None into the data frame but note that the value in the original table is actually a NULL.

pd.read_sql("SELECT * FROM justin_music", con)
album artist
0 Sixteen Stone Bush
1 Listener Supported Dave Matthews Band
2 Chris Stapleton Traveler
3 1989 Taylor Swift
4 None Taylor Swift

UPDATE

We can use update to modify an existing row of data. We use

  • UPDATE to specify the table
  • SET to specify the values of the columns
  • WHERE to describe which records (rows) to modify

We can use AND, OR, IN, and other logical operators in our SQL code! (See Operators and Parse-Affecting Attributes)

mod_row = """
    UPDATE justin_music
    SET album = "Red (Taylor's Version)", artist = "Taylor Swift"
    WHERE (album is null) AND (artist = "Taylor Swift");
"""

execute_query(con, mod_row)
Query executed successfully
pd.read_sql("SELECT * FROM justin_music", con)
album artist
0 Sixteen Stone Bush
1 Listener Supported Dave Matthews Band
2 Chris Stapleton Traveler
3 1989 Taylor Swift
4 Red (Taylor's Version) Taylor Swift

DELETE FROM

We can also delete certain rows. We use

  • DELETE FROM to specify the table
  • a WHERE condition to determine what is deleted
del_row = """
    DELETE FROM justin_music
    WHERE artist = "Taylor Swift";
    """

execute_query(con, del_row)
Query executed successfully
pd.read_sql("SELECT * FROM justin_music", con)
album artist
0 Sixteen Stone Bush
1 Listener Supported Dave Matthews Band
2 Chris Stapleton Traveler

DROP TABLE

And of course we can remove an entire table using DROP TABLE. Just specify the table to remove!

execute_query(con, "DROP TABLE justin_music")
Query executed successfully
pd.read_sql(get_schema, con)
type name tbl_name rootpage sql
0 table albums albums 2 CREATE TABLE "albums"\r\n(\r\n [AlbumId] IN...
1 table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
2 table artists artists 4 CREATE TABLE "artists"\r\n(\r\n [ArtistId] ...
3 table customers customers 5 CREATE TABLE "customers"\r\n(\r\n [Customer...
4 table employees employees 8 CREATE TABLE "employees"\r\n(\r\n [Employee...
5 table genres genres 10 CREATE TABLE "genres"\r\n(\r\n [GenreId] IN...
6 table invoices invoices 11 CREATE TABLE "invoices"\r\n(\r\n [InvoiceId...
7 table invoice_items invoice_items 13 CREATE TABLE "invoice_items"\r\n(\r\n [Invo...
8 table media_types media_types 15 CREATE TABLE "media_types"\r\n(\r\n [MediaT...
9 table playlists playlists 16 CREATE TABLE "playlists"\r\n(\r\n [Playlist...
10 table playlist_track playlist_track 17 CREATE TABLE "playlist_track"\r\n(\r\n [Pla...
11 table tracks tracks 20 CREATE TABLE "tracks"\r\n(\r\n [TrackId] IN...
12 table sqlite_stat1 sqlite_stat1 864 CREATE TABLE sqlite_stat1(tbl,idx,stat)

My table is gone :(

SELECT

As a statistician, you’d spend most of your time pulling data of interest and then summarizing it, running models, etc.

SELECT is the workhorse for that task! As we are just reading in data from the database, we can just use pd.read_sql().

With SELECT we

  • SELECT the column(s) we want
  • FROM the table of interest
  • Specifing the records (rows) of interest with a WHERE

First, let’s see a way to look at all columns in a table.

pd.read_sql('SELECT * FROM albums', con)
AlbumId Title ArtistId
0 1 For Those About To Rock We Salute You 1
1 2 Balls to the Wall 2
2 3 Restless and Wild 2
3 4 Let There Be Rock 1
4 5 Big Ones 3
... ... ... ...
342 343 Respighi:Pines of Rome 226
343 344 Schubert: The Late String Quartets & String Qu... 272
344 345 Monteverdi: L'Orfeo 273
345 346 Mozart: Chamber Music 274
346 347 Koyaanisqatsi (Soundtrack from the Motion Pict... 275

347 rows × 3 columns

Now let’s select the Title and ArtistID columns only. LIMIT is a good way to only return (up to) a certain number of results.

pd.read_sql("SELECT Title, artistID FROM albums LIMIT 10;", con)
Title ArtistId
0 For Those About To Rock We Salute You 1
1 Balls to the Wall 2
2 Restless and Wild 2
3 Let There Be Rock 1
4 Big Ones 3
5 Jagged Little Pill 4
6 Facelift 5
7 Warner 25 Anos 6
8 Plays Metallica By Four Cellos 7
9 Audioslave 8

WHERE allows us to specify conditions on rows.

pd.read_sql("SELECT * FROM albums WHERE artistID = 2 LIMIT 10;", con)
AlbumId Title ArtistId
0 2 Balls to the Wall 2
1 3 Restless and Wild 2

Clean Up

If we want to save any changes made to our database, we need to do a commit() to the connection via con.commit(). This saves the changes made and releases any locks on the data. Other connections to the database can only see changes you’ve made if you commit them.

When you are done working you should also close your connection. This is done with the close() method.

con.close()

We skimmed over some of the syntax that SQL follows. You should read over the arithmetic operators, comparison operators, and the other operators linked on the left of that page.


Recap

Database is a term usually used to represent a set of related tables and the software used to interact with them.

In python you can connect to most any kind of database. We looked at connecting to the easy, light-weight, sqlite database.

The most important actions are given by CRUD (Create, Read, Update, and Delete).

We’ll get much more into python when we start dealing with pyspark!

If you are on the course website, use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!

If you are on Google Colab, head back to our course website for our next lesson!