class: center, middle, inverse, title-slide .title[ # SQL Style Joins ] .author[ ### Justin Post ] --- layout: true <div class="my-footer"><img src="data:image/png;base64,#img/logo.png" style="height: 60px;"/></div> --- # Relational Databases - Often want to combine data from multiple tables to summarize/model <img src="data:image/png;base64,#img/dbDiagram.jpg" width="600px" style="display: block; margin: auto;" /> --- - The common types of joins we do are given below! (Using `dplyr` not the particular SQL language.) <img src="data:image/png;base64,#img/joins.PNG" width="325px" style="display: block; margin: auto;" /> --- - We often need some different logic to make our joins work. That exists in `dplyr` as well! <img src="data:image/png;base64,#img/joins2.PNG" width="400px" style="display: block; margin: auto;" /> --- # Joins - Let's go through our common joins! - Inner Join: Returns records with matching keys in both tables <img src="data:image/png;base64,#img/inner_join.png" width="350px" style="display: block; margin: auto;" /> --- # Inner Join Make our connection and look at the tables ```r library(DBI) library(dplyr) ``` ```r con <- dbConnect(RSQLite::SQLite(), "data/lahman.db") dbListTables(con) ``` ``` ## [1] "AllstarFull" "Appearances" "AwardsManagers" ## [4] "AwardsPlayers" "AwardsShareManagers" "AwardsSharePlayers" ## [7] "Batting" "BattingPost" "CollegePlaying" ## [10] "Fielding" "FieldingOF" "FieldingOFsplit" ## [13] "FieldingPost" "HallOfFame" "HomeGames" ## [16] "LahmanData" "Managers" "ManagersHalf" ## [19] "Parks" "People" "Pitching" ## [22] "PitchingPost" "Salaries" "Schools" ## [25] "SeriesPost" "Teams" "TeamsFranchises" ## [28] "TeamsHalf" "battingLabels" "fieldingLabels" ## [31] "pitchingLabels" ``` --- # Inner Join Combine the `Batting` table and the `Pitching` table on common variables ```r #note this code differs slightly from what was in the video! inner_join(tbl(con, "Batting") |> filter(yearID == 2000), tbl(con, "Pitching") |> filter(yearID == 2000), by = c("playerID", "stint", "teamID", "lgID")) |> collect() ``` ``` ## # A tibble: 677 x 48 ## playerID yearID.x stint teamID lgID G.x AB R.x H.x X2B X3B HR.x ## <chr> <int> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> ## 1 abbotpa~ 2000 1 SEA AL 35 5 1 2 1 0 0 ## 2 aceveju~ 2000 1 MIL NL 62 1 1 0 0 0 0 ## 3 adamste~ 2000 1 LAN NL 66 2 0 0 0 0 0 ## 4 aguilri~ 2000 1 CHN NL 54 0 0 0 0 0 0 ## 5 aldresc~ 2000 1 PHI NL 23 0 0 0 0 0 0 ## # i 672 more rows ## # i 36 more variables: RBI <int>, SB <int>, CS <int>, BB.x <int>, SO.x <int>, ## # IBB.x <int>, HBP.x <int>, SH.x <int>, SF.x <int>, GIDP.x <int>, ## # yearID.y <int>, W <int>, L <int>, G.y <int>, GS <int>, CG <int>, SHO <int>, ## # SV <int>, IPouts <int>, H.y <int>, ER <int>, HR.y <int>, BB.y <int>, ## # SO.y <int>, BAOpp <dbl>, ERA <dbl>, IBB.y <int>, WP <int>, HBP.y <int>, ## # BK <int>, BFP <int>, GF <int>, R.y <int>, SH.y <int>, SF.y <int>, ... ``` --- # Can Write SQL code instead - (I'm not a great SQL programmer) ```r tbl(con, sql( "SELECT p.playerID as pplayerID, p.stint as pstint, p.teamID as pteamID, p.lgID as plgID, p.G as pG, p.HR as pHR, p.BB as pBB, p.SO as pSO, p.HBP as pHBP, p.R as pR, p.SF as pSF, p.GIDP as pGIDP, p.IBB as pIBB, p.SH as pSH, p.W, p.L, p.GS, p.CG, p.SHO, p.SV, p.IPouts, p.ER, p.BAopp, p.ERA, p.WP, p.BK, p.BFP, p.GF, b.* FROM Pitching as p INNER JOIN Batting as b on ((p.playerID = b.playerID) AND (pstint = b.stint) AND (pteamID = b.teamID) AND (plgID = b.lgID)) WHERE b.yearID = 2000 AND p.yearID = 2000" )) ``` --- # Joins - Left Join: Returns all records from the 'left' table and any matching records from the 'right' table <img src="data:image/png;base64,#img/left_join.png" width="350px" style="display: block; margin: auto;" /> --- # Left Join: Return left table and matching right records ```r left_join(tbl(con, "Batting") |> filter(yearID == 2000), tbl(con, "Pitching") |> filter(yearID == 2000), by = c("playerID", "stint", "teamID", "lgID")) |> collect() |> select(playerID, ERA, everything()) ``` ``` ## # A tibble: 1,384 x 48 ## playerID ERA yearID.x stint teamID lgID G.x AB R.x H.x X2B X3B ## <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> ## 1 abbotje~ NA 2000 1 CHA AL 80 215 31 59 15 1 ## 2 abbotku~ NA 2000 1 NYN NL 79 157 22 34 7 1 ## 3 abbotpa~ 4.22 2000 1 SEA AL 35 5 1 2 1 0 ## 4 abreubo~ NA 2000 1 PHI NL 154 576 103 182 42 10 ## 5 aceveju~ 3.81 2000 1 MIL NL 62 1 1 0 0 0 ## # i 1,379 more rows ## # i 36 more variables: HR.x <int>, RBI <int>, SB <int>, CS <int>, BB.x <int>, ## # SO.x <int>, IBB.x <int>, HBP.x <int>, SH.x <int>, SF.x <int>, GIDP.x <int>, ## # yearID.y <int>, W <int>, L <int>, G.y <int>, GS <int>, CG <int>, SHO <int>, ## # SV <int>, IPouts <int>, H.y <int>, ER <int>, HR.y <int>, BB.y <int>, ## # SO.y <int>, BAOpp <dbl>, IBB.y <int>, WP <int>, HBP.y <int>, BK <int>, ## # BFP <int>, GF <int>, R.y <int>, SH.y <int>, SF.y <int>, GIDP.y <int> ``` --- # Joins - Right Join: Returns all records from the 'right' table and any matching records from the 'left' table <img src="data:image/png;base64,#img/right_join.png" width="350px" style="display: block; margin: auto;" /> --- # Right Join - Just do a left join and switch the table (or use `right_join()`) ```r right_join(tbl(con, "Batting") |> filter(yearID == 2000), tbl(con, "Pitching") |> filter(yearID == 2000), by = c("playerID", "stint", "teamID", "lgID")) |> collect() |> select(playerID, ERA, everything()) ``` ``` ## # A tibble: 677 x 48 ## playerID ERA yearID.x stint teamID lgID G.x AB R.x H.x X2B X3B ## <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> ## 1 abbotpa~ 4.22 2000 1 SEA AL 35 5 1 2 1 0 ## 2 aceveju~ 3.81 2000 1 MIL NL 62 1 1 0 0 0 ## 3 adamste~ 3.52 2000 1 LAN NL 66 2 0 0 0 0 ## 4 aguilri~ 4.91 2000 1 CHN NL 54 0 0 0 0 0 ## 5 aldresc~ 5.75 2000 1 PHI NL 23 0 0 0 0 0 ## # i 672 more rows ## # i 36 more variables: HR.x <int>, RBI <int>, SB <int>, CS <int>, BB.x <int>, ## # SO.x <int>, IBB.x <int>, HBP.x <int>, SH.x <int>, SF.x <int>, GIDP.x <int>, ## # yearID.y <int>, W <int>, L <int>, G.y <int>, GS <int>, CG <int>, SHO <int>, ## # SV <int>, IPouts <int>, H.y <int>, ER <int>, HR.y <int>, BB.y <int>, ## # SO.y <int>, BAOpp <dbl>, IBB.y <int>, WP <int>, HBP.y <int>, BK <int>, ## # BFP <int>, GF <int>, R.y <int>, SH.y <int>, SF.y <int>, GIDP.y <int> ``` --- # Joins - Outer Join: Returns all records when there is a match from the 'left' or 'right' table (also called a **full join**) <img src="data:image/png;base64,#img/outer_join.png" width="350px" style="display: block; margin: auto;" /> --- # Outer Join: Return all matches from both tables (All players are in the Batting table even if they have no at bats!) ```r full_join(tbl(con, "Batting") |> filter(yearID == 2000), tbl(con, "Pitching") |> filter(yearID == 2000), by = c("playerID", "stint", "teamID", "lgID")) |> collect() ``` ``` ## # A tibble: 1,384 x 48 ## playerID yearID.x stint teamID lgID G.x AB R.x H.x X2B X3B HR.x ## <chr> <int> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> ## 1 abbotje~ 2000 1 CHA AL 80 215 31 59 15 1 3 ## 2 abbotku~ 2000 1 NYN NL 79 157 22 34 7 1 6 ## 3 abbotpa~ 2000 1 SEA AL 35 5 1 2 1 0 0 ## 4 abreubo~ 2000 1 PHI NL 154 576 103 182 42 10 25 ## 5 aceveju~ 2000 1 MIL NL 62 1 1 0 0 0 0 ## # i 1,379 more rows ## # i 36 more variables: RBI <int>, SB <int>, CS <int>, BB.x <int>, SO.x <int>, ## # IBB.x <int>, HBP.x <int>, SH.x <int>, SF.x <int>, GIDP.x <int>, ## # yearID.y <int>, W <int>, L <int>, G.y <int>, GS <int>, CG <int>, SHO <int>, ## # SV <int>, IPouts <int>, H.y <int>, ER <int>, HR.y <int>, BB.y <int>, ## # SO.y <int>, BAOpp <dbl>, ERA <dbl>, IBB.y <int>, WP <int>, HBP.y <int>, ## # BK <int>, BFP <int>, GF <int>, R.y <int>, SH.y <int>, SF.y <int>, ... ``` --- # Other Joins Those are the major joins covered by `dplyr`. Lots of other joins out there! - [See here for examples](https://www.sqlitetutorial.net/sqlite-join/)! + The right sidebar has more than the standard joins. - Also ways to do [if then else type logic](https://www.sqlitetutorial.net/sqlite-case/), [intersections](https://www.sqlitetutorial.net/sqlite-intersect/), etc. in SQL - Can do basic [summaries using SQL](https://www.sqlitetutorial.net/sqlite-avg/) as well (including [grouping](https://www.sqlitetutorial.net/sqlite-group-by/)), but we'll just use `dplyr` for that! --- # Recap - Joins are combining two tables - inner_join - match records that appear in both tables - left/right join - full outer join