Manipulating Data with `tidyr`

Published

2025-06-05

We now have a good handle on common actions we want to take on our data frames. However, we’ve been treating our data as though it is already in long format where each row consists of one observation and each column one variable.

This isn’t always the case! Sometimes we have wide format data where we may have more than one observation in a given row.

A data frame in long format is shown. This data frame has one observation per row - here patient, weight and year. This is transformed to wide format where multiple observations exist in one row - here patient, weight in 2018, weight in 2019, and weight in 2020.

You might see wide data if you deal with pivot tables in excel. It is often a nicer way to display data, but almost all of the plotting, summarizing, and modeling we do in statistics expects data to be in long form. Luckily, the tidyr package gives us nice functionality for switching between these two forms!

`tidyr` Package

This pacakge allows us to easily manipulate data via

pivot_longer() - lengthens data by increasing the number of rows and decreasing the number of columns
- Most important as analysis methods often prefer this form
pivot_wider() - widens data by increasing the number of columns and decreasing the number of rows

We’ll also look at a couple other functions from tidyr that can be useful.

`pivot_longer()`

Consider the data set called cityTemps.txt available via the URL below.

library(readr)
temps_data <- read_table(file = "https://www4.stat.ncsu.edu/~online/datasets/cityTemps.txt")


── Column specification ────────────────────────────────────────────────────────
cols(
  city = col_character(),
  sun = col_double(),
  mon = col_double(),
  tue = col_double(),
  wed = col_double(),
  thr = col_double(),
  fri = col_double(),
  sat = col_double()
)

temps_data

# A tibble: 6 × 8
  city        sun   mon   tue   wed   thr   fri   sat
  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 atlanta      81    87    83    79    88    91    94
2 baltimore    73    75    70    78    73    75    79
3 charlotte    82    80    75    82    83    88    93
4 denver       72    71    67    68    72    71    58
5 ellington    51    42    47    52    55    56    59
6 frankfort    70    70    72    70    74    74    79

This data is in wide format as more than one observation on a city is in each row.

Switch to ‘Long’ form with pivot_longer(). Checking the help, the major arguments are:
- cols = columns to pivot to longer format (cols = 2:8)
- names_to = new name(s) for columns created (names_to = "day")
- values_to = new name(s) for data values (values_to = "temp")

library(tidyr)
temps_data |>
  pivot_longer(cols = 2:8, 
               names_to = "day", 
               values_to = "temp")

# A tibble: 42 × 3
   city      day    temp
   <chr>     <chr> <dbl>
 1 atlanta   sun      81
 2 atlanta   mon      87
 3 atlanta   tue      83
 4 atlanta   wed      79
 5 atlanta   thr      88
 6 atlanta   fri      91
 7 atlanta   sat      94
 8 baltimore sun      73
 9 baltimore mon      75
10 baltimore tue      70
# ℹ 32 more rows

That’s better! Now each row has one observation in it. Recall we had a lot of functionality for selecting columns within the tidyverse. That holds here as well!

temps_data |>
  pivot_longer(cols = sun:sat, 
               names_to = "day", 
               values_to = "temp")

# A tibble: 42 × 3
   city      day    temp
   <chr>     <chr> <dbl>
 1 atlanta   sun      81
 2 atlanta   mon      87
 3 atlanta   tue      83
 4 atlanta   wed      79
 5 atlanta   thr      88
 6 atlanta   fri      91
 7 atlanta   sat      94
 8 baltimore sun      73
 9 baltimore mon      75
10 baltimore tue      70
# ℹ 32 more rows

`pivot_wider()`

Occasionally we’ll want to make our data wider for display purposes. We can make this switch to ‘Wide’ form with pivot_wider(). There are two major arguments we must specify:

names_from = column(s) to get the names used in the output columns
values_from = column(s) to get the cell values from

Let’s consider our batting data set from the dplyr notes.

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(Lahman)

Warning: package 'Lahman' was built under R version 4.4.2

batting_tbl <- as_tibble(Batting)
batting_tbl

# A tibble: 113,799 × 22
   playerID  yearID stint teamID lgID      G    AB     R     H   X2B   X3B    HR
   <chr>      <int> <int> <fct>  <fct> <int> <int> <int> <int> <int> <int> <int>
 1 aardsda01   2004     1 SFN    NL       11     0     0     0     0     0     0
 2 aardsda01   2006     1 CHN    NL       45     2     0     0     0     0     0
 3 aardsda01   2007     1 CHA    AL       25     0     0     0     0     0     0
 4 aardsda01   2008     1 BOS    AL       47     1     0     0     0     0     0
 5 aardsda01   2009     1 SEA    AL       73     0     0     0     0     0     0
 6 aardsda01   2010     1 SEA    AL       53     0     0     0     0     0     0
 7 aardsda01   2012     1 NYA    AL        1     0     0     0     0     0     0
 8 aardsda01   2013     1 NYN    NL       43     0     0     0     0     0     0
 9 aardsda01   2015     1 ATL    NL       33     1     0     0     0     0     0
10 aaronha01   1954     1 ML1    NL      122   468    58   131    27     6    13
# ℹ 113,789 more rows
# ℹ 10 more variables: RBI <int>, SB <int>, CS <int>, BB <int>, SO <int>,
#   IBB <int>, HBP <int>, SH <int>, SF <int>, GIDP <int>

We may want to get just the data for one team (say the Pirates) and display each players number of hits across the years 2018 to 2020.

Let’s subset the data to get just the pirates (teamID == "PIT")
Then we’ll select only their hits and year columns (playerID, H, and yearID)
Then we need to pivot that data set wider so that we have the year across the top (names_from), the players as the rows, and the entries as the hits (values_from)

batting_tbl |>
  filter(yearID %in% 2018:2020, teamID == "PIT") |>
  select(playerID, yearID, H) |>
  pivot_wider(names_from = yearID, values_from = "H")

# A tibble: 96 × 4
   playerID  `2019` `2020` `2018`
   <chr>      <int>  <int>  <int>
 1 agrazda01      2     NA     NA
 2 alforan01     NA      3     NA
 3 anderta01     NA     NA      0
 4 archech01      4     NA      2
 5 baronst01      2     NA     NA
 6 bashlty01     NA      0     NA
 7 belljo02     146     44    131
 8 bostich01     NA     NA      0
 9 braulst01     14      0      3
10 brubajt01     NA      0     NA
# ℹ 86 more rows

Great! You can see that missing values are filled in for those that didn’t play in a given year. Let’s subset this to remove any rows with missing values (so we only get players that played for the pirates in all three years).

The tidyr function drop_na() does this exact thing for us!

batting_tbl |>
  filter(yearID %in% 2018:2020, teamID == "PIT") |>
  select(playerID, yearID, H) |>
  pivot_wider(names_from = yearID, values_from = "H") |>
  drop_na()

# A tibble: 17 × 4
   playerID  `2019` `2020` `2018`
   <chr>      <int>  <int>  <int>
 1 belljo02     146     44    131
 2 braulst01     14      0      3
 3 burdini01      0      0      0
 4 crickky01      0      0      0
 5 felizmi01      0      0      0
 6 fraziad01    154     48     88
 7 holmecl01      0      0      0
 8 kelake01       0      0      0
 9 moranco01    129     44    115
10 musgrjo01      8      0      5
11 neverdo01      0      0      0
12 newmake01    152     35     19
13 osunajo01     69     16     24
14 polangr01     37     24    117
15 rodriri05      0      0      0
16 stallja01     50     31      8
17 willitr01      6      0      5

Let’s also remove those with 0 hits:

batting_tbl |>
  filter(yearID %in% 2018:2020, teamID == "PIT", H > 0) |>
  select(playerID, yearID, H) |>
  pivot_wider(names_from = yearID, values_from = "H") |>
  drop_na()

# A tibble: 7 × 4
  playerID  `2019` `2020` `2018`
  <chr>      <int>  <int>  <int>
1 belljo02     146     44    131
2 fraziad01    154     48     88
3 moranco01    129     44    115
4 newmake01    152     35     19
5 osunajo01     69     16     24
6 polangr01     37     24    117
7 stallja01     50     31      8

The column names 2018, 2019, and 2020 are non-standard names as they start with a number. That’s not ideal so let’s rename those too!

batting_tbl |>
  filter(yearID %in% 2018:2020, teamID == "PIT", H > 0) |>
  select(playerID, yearID, H) |>
  pivot_wider(names_from = yearID, values_from = "H") |>
  drop_na() |>
  rename('year2018' = `2018`,
         'year2019' = `2019`,
         'year2020' = `2020`)

# A tibble: 7 × 4
  playerID  year2019 year2020 year2018
  <chr>        <int>    <int>    <int>
1 belljo02       146       44      131
2 fraziad01      154       48       88
3 moranco01      129       44      115
4 newmake01      152       35       19
5 osunajo01       69       16       24
6 polangr01       37       24      117
7 stallja01       50       31        8

Would be better with actual player names (we’ll learn about how to combine this data set with another one that has their actual names soon!)

batting_tbl |>
  filter(yearID %in% 2018:2020, teamID == "PIT", H > 0) |>
  select(playerID, yearID, H) |>
  pivot_wider(names_from = yearID, values_from = "H") |>
  drop_na() |>
  rename('year2018' = `2018`,
         'year2019' = `2019`,
         'year2020' = `2020`) |>
  dplyr::inner_join(select(People, playerID, nameFirst, nameLast)) |>
  select(nameFirst, nameLast, everything())

Joining with `by = join_by(playerID)`

# A tibble: 7 × 6
  nameFirst nameLast  playerID  year2019 year2020 year2018
  <chr>     <chr>     <chr>        <int>    <int>    <int>
1 Josh      Bell      belljo02       146       44      131
2 Adam      Frazier   fraziad01      154       48       88
3 Colin     Moran     moranco01      129       44      115
4 Kevin     Newman    newmake01      152       35       19
5 Jose      Osuna     osunajo01       69       16       24
6 Gregory   Polanco   polangr01       37       24      117
7 Jacob     Stallings stallja01       50       31        8

Column Manipulations with `tidyr`

Separate a column using separate_wider_delim() (a few other variants exist as well)
Combine two columns with unite()

A data frame is visualized with a date column. Dates are in the form MM/DD/YYYY. This column is then separated into three columns: month, day, and year.

Consider data set on air pollution in Chicago

chicago_data <- read_csv("https://www4.stat.ncsu.edu/~online/datasets/Chicago.csv")

Rows: 1461 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): city, date, season
dbl (8): X, death, temp, dewpoint, pm10, o3, time, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

chicago_data

# A tibble: 1,461 × 11
       X city  date      death  temp dewpoint  pm10    o3  time season  year
   <dbl> <chr> <chr>     <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl> <chr>  <dbl>
 1  3654 chic  1/1/1997    137  36      37.5  13.1   5.66  3654 winter  1997
 2  3655 chic  1/2/1997    123  45      47.2  41.9   5.53  3655 winter  1997
 3  3656 chic  1/3/1997    127  40      38    27.0   6.29  3656 winter  1997
 4  3657 chic  1/4/1997    146  51.5    45.5  25.1   7.54  3657 winter  1997
 5  3658 chic  1/5/1997    102  27      11.2  15.3  20.8   3658 winter  1997
 6  3659 chic  1/6/1997    127  17       5.75  9.36 14.9   3659 winter  1997
 7  3660 chic  1/7/1997    116  16       7    20.2  11.9   3660 winter  1997
 8  3661 chic  1/8/1997    118  19      17.8  33.1   8.68  3661 winter  1997
 9  3662 chic  1/9/1997    148  26      24    12.1  13.4   3662 winter  1997
10  3663 chic  1/10/1997   121  16       5.38 24.8  10.4   3663 winter  1997
# ℹ 1,451 more rows

Although we saw that we should treat date variables as date objects (say from lubridate), we could manually separate out the dates we see here. We can notice that the month comes first followed by a /, then the day, a /, and the year.
- We can split on the delimiter /
- The arguments to give separate_wider_delim() are:
  - cols = the columns we want
  - delim = the delimiter
  - names = new names for the split variables
  - cols_remove - binary, whether to remove the original column or not

chicago_data |>
  separate_wider_delim(cols = date, 
                       delim = "/", 
                       names = c("Month", "Day", "Year"), 
                       cols_remove = FALSE)

# A tibble: 1,461 × 14
       X city  Month Day   Year  date     death  temp dewpoint  pm10    o3  time
   <dbl> <chr> <chr> <chr> <chr> <chr>    <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl>
 1  3654 chic  1     1     1997  1/1/1997   137  36      37.5  13.1   5.66  3654
 2  3655 chic  1     2     1997  1/2/1997   123  45      47.2  41.9   5.53  3655
 3  3656 chic  1     3     1997  1/3/1997   127  40      38    27.0   6.29  3656
 4  3657 chic  1     4     1997  1/4/1997   146  51.5    45.5  25.1   7.54  3657
 5  3658 chic  1     5     1997  1/5/1997   102  27      11.2  15.3  20.8   3658
 6  3659 chic  1     6     1997  1/6/1997   127  17       5.75  9.36 14.9   3659
 7  3660 chic  1     7     1997  1/7/1997   116  16       7    20.2  11.9   3660
 8  3661 chic  1     8     1997  1/8/1997   118  19      17.8  33.1   8.68  3661
 9  3662 chic  1     9     1997  1/9/1997   148  26      24    12.1  13.4   3662
10  3663 chic  1     10    1997  1/10/19…   121  16       5.38 24.8  10.4   3663
# ℹ 1,451 more rows
# ℹ 2 more variables: season <chr>, year <dbl>

Nice! These are character strings so we might want to turn them into numbers but, again, we’d really want to use date type data for these anyway.

unite() allows us to combine two columns into one
- Perhaps we want a new column with the date and the season together (for display purposes)
- We just pass unite() the name of the new column (col =), the columns we want to combine, and the separator to use (sep =)

chicago_data |>
  unite(col = "season_date", season, date, sep = ": ") |>
  select(season_date, everything())

# A tibble: 1,461 × 10
   season_date           X city  death  temp dewpoint  pm10    o3  time  year
   <chr>             <dbl> <chr> <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl>
 1 winter: 1/1/1997   3654 chic    137  36      37.5  13.1   5.66  3654  1997
 2 winter: 1/2/1997   3655 chic    123  45      47.2  41.9   5.53  3655  1997
 3 winter: 1/3/1997   3656 chic    127  40      38    27.0   6.29  3656  1997
 4 winter: 1/4/1997   3657 chic    146  51.5    45.5  25.1   7.54  3657  1997
 5 winter: 1/5/1997   3658 chic    102  27      11.2  15.3  20.8   3658  1997
 6 winter: 1/6/1997   3659 chic    127  17       5.75  9.36 14.9   3659  1997
 7 winter: 1/7/1997   3660 chic    116  16       7    20.2  11.9   3660  1997
 8 winter: 1/8/1997   3661 chic    118  19      17.8  33.1   8.68  3661  1997
 9 winter: 1/9/1997   3662 chic    148  26      24    12.1  13.4   3662  1997
10 winter: 1/10/1997  3663 chic    121  16       5.38 24.8  10.4   3663  1997
# ℹ 1,451 more rows

Recap!

pivot_wider() & pivot_longer() great for reshaping data
unite() & separate_wider_*() nice for dealing with columns
tidyr Cheat Sheet

Use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!

tidyr Package

pivot_longer()

pivot_wider()

Column Manipulations with tidyr

Recap!

`tidyr` Package

`pivot_longer()`

`pivot_wider()`

Column Manipulations with `tidyr`