Relational data: a quick review

Relational data is multiple tables of data that when combined together answer research questions. Relations define the important element, not just the individual tables. Relations are defined between a pair of tables, or potentially complex structures can be built up with more than 2 tables. In many situations, data is stored in a relational format because to do otherwise would introduce redundancy and use unnecessary storage space.

This data structure requires relational verbs to combine data across tables. Mutating joins add new variables to one data frame from matching observations in another, whereas filtering joins filter observations from one data frame based on whether or not they match an observation in the other table.

superheroes and publishers

Let’s review how these different types of joining operations work with relational data on comic books. Load the rcfss library. There are two data frames which contain data on comic books.

library(tidyverse)
library(rcfss)

superheroes
## # A tibble: 7 x 4
##   name     alignment gender publisher    
##   <chr>    <chr>     <chr>  <chr>        
## 1 Magneto  bad       male   Marvel       
## 2 Storm    good      female Marvel       
## 3 Mystique bad       female Marvel       
## 4 Batman   good      male   DC           
## 5 Joker    bad       male   DC           
## 6 Catwoman bad       female DC           
## 7 Sabrina  good      female Archie Comics
publishers
## # A tibble: 3 x 2
##   publisher yr_founded
##   <chr>          <dbl>
## 1 DC              1934
## 2 Marvel          1939
## 3 Image           1992

Would it make sense to store these two data frames in the same tibble? No! This is because each data frame contains substantively different information:

  • superheroes contains data on superheroes
  • publishers contains data on publishers

The units of analysis are completely different. Just as it made sense to split Minard’s data into two separate data frames, it also makes sense to store them separately here. That said, depending on the type of analysis you seek to perform, it makes sense to join the data frames together temporarily. How should we join them? Well it depends on how you plan to ask your question. Let’s look at the result of several different join operations.

Mutating joins

Inner join

inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.
(ijsp <- inner_join(x = superheroes, y = publishers))
## Joining, by = "publisher"
## # A tibble: 6 x 5
##   name     alignment gender publisher yr_founded
##   <chr>    <chr>     <chr>  <chr>          <dbl>
## 1 Magneto  bad       male   Marvel          1939
## 2 Storm    good      female Marvel          1939
## 3 Mystique bad       female Marvel          1939
## 4 Batman   good      male   DC              1934
## 5 Joker    bad       male   DC              1934
## 6 Catwoman bad       female DC              1934

We lose Sabrina in the join because, although she appears in x = superheroes, her publisher Archie Comics does not appear in y = publishers. The join result has all variables from x = superheroes plus yr_founded, from y.

`superheroes`
namealignmentgenderpublisher
MagnetobadmaleMarvel
StormgoodfemaleMarvel
MystiquebadfemaleMarvel
BatmangoodmaleDC
JokerbadmaleDC
CatwomanbadfemaleDC
SabrinagoodfemaleArchie Comics

publishers

publisheryr_founded
DC1934
Marvel1939
Image1992

inner_join(x = superheroes, y = publishers)

namealignmentgenderpublisheryr_founded
MagnetobadmaleMarvel1939
StormgoodfemaleMarvel1939
MystiquebadfemaleMarvel1939
BatmangoodmaleDC1934
JokerbadmaleDC1934
CatwomanbadfemaleDC1934

Left join

left_join(x, y): Return all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.
(ljsp <- left_join(x = superheroes, y = publishers))
## Joining, by = "publisher"
## # A tibble: 7 x 5
##   name     alignment gender publisher     yr_founded
##   <chr>    <chr>     <chr>  <chr>              <dbl>
## 1 Magneto  bad       male   Marvel              1939
## 2 Storm    good      female Marvel              1939
## 3 Mystique bad       female Marvel              1939
## 4 Batman   good      male   DC                  1934
## 5 Joker    bad       male   DC                  1934
## 6 Catwoman bad       female DC                  1934
## 7 Sabrina  good      female Archie Comics         NA

We basically get x = superheroes back, but with the addition of variable yr_founded, which is unique to y = publishers. Sabrina, whose publisher does not appear in y = publishers, has an NA for yr_founded.

`superheroes`
namealignmentgenderpublisher
MagnetobadmaleMarvel
StormgoodfemaleMarvel
MystiquebadfemaleMarvel
BatmangoodmaleDC
JokerbadmaleDC
CatwomanbadfemaleDC
SabrinagoodfemaleArchie Comics

publishers

publisheryr_founded
DC1934
Marvel1939
Image1992

left_join(x = superheroes, y = publishers)

namealignmentgenderpublisheryr_founded
MagnetobadmaleMarvel1939
StormgoodfemaleMarvel1939
MystiquebadfemaleMarvel1939
BatmangoodmaleDC1934
JokerbadmaleDC1934
CatwomanbadfemaleDC1934
SabrinagoodfemaleArchie ComicsNA

Right join

right_join(x, y): Return all rows from y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.
(rjsp <- right_join(x = superheroes, y = publishers))
## Joining, by = "publisher"
## # A tibble: 7 x 5
##   name     alignment gender publisher yr_founded
##   <chr>    <chr>     <chr>  <chr>          <dbl>
## 1 Magneto  bad       male   Marvel          1939
## 2 Storm    good      female Marvel          1939
## 3 Mystique bad       female Marvel          1939
## 4 Batman   good      male   DC              1934
## 5 Joker    bad       male   DC              1934
## 6 Catwoman bad       female DC              1934
## 7 <NA>     <NA>      <NA>   Image           1992

We basically get y = publishers back, but with the addition of variables name, alignment, and gender, which is unique to x = superheroes. Image, who did not publish any of the characters in superheroes, has an NA for the new variables.

We could also accomplish virtually the same thing using left_join() by reversing the order of the data frames in the function:

left_join(x = superheroes, y = publishers)
## Joining, by = "publisher"
## # A tibble: 7 x 5
##   name     alignment gender publisher     yr_founded
##   <chr>    <chr>     <chr>  <chr>              <dbl>
## 1 Magneto  bad       male   Marvel              1939
## 2 Storm    good      female Marvel              1939
## 3 Mystique bad       female Marvel              1939
## 4 Batman   good      male   DC                  1934
## 5 Joker    bad       male   DC                  1934
## 6 Catwoman bad       female DC                  1934
## 7 Sabrina  good      female Archie Comics         NA

Doing so returns the same basic data frame, with the column orders reversed. right_join() is not used as commonly as left_join(), but works well in a piped operation when you perform several functions on x but then want to join it with y and only keep rows that appear in y.

`superheroes`
namealignmentgenderpublisher
MagnetobadmaleMarvel
StormgoodfemaleMarvel
MystiquebadfemaleMarvel
BatmangoodmaleDC
JokerbadmaleDC
CatwomanbadfemaleDC
SabrinagoodfemaleArchie Comics

publishers

publisheryr_founded
DC1934
Marvel1939
Image1992

right_join(x = superheroes, y = publishers)

namealignmentgenderpublisheryr_founded
MagnetobadmaleMarvel1939
StormgoodfemaleMarvel1939
MystiquebadfemaleMarvel1939
BatmangoodmaleDC1934
JokerbadmaleDC1934
CatwomanbadfemaleDC1934
NANANAImage1992

Full join

full_join(x, y): Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing. This is a mutating join.
(fjsp <- full_join(x = superheroes, y = publishers))
## Joining, by = "publisher"
## # A tibble: 8 x 5
##   name     alignment gender publisher     yr_founded
##   <chr>    <chr>     <chr>  <chr>              <dbl>
## 1 Magneto  bad       male   Marvel              1939
## 2 Storm    good      female Marvel              1939
## 3 Mystique bad       female Marvel              1939
## 4 Batman   good      male   DC                  1934
## 5 Joker    bad       male   DC                  1934
## 6 Catwoman bad       female DC                  1934
## 7 Sabrina  good      female Archie Comics         NA
## 8 <NA>     <NA>      <NA>   Image               1992

We get all rows of x = superheroes plus a new row from y = publishers, containing the publisher “Image”. We get all variables from x = superheroes AND all variables from y = publishers. Any row that derives solely from one table or the other carries NAs in the variables found only in the other table.

`superheroes`
namealignmentgenderpublisher
MagnetobadmaleMarvel
StormgoodfemaleMarvel
MystiquebadfemaleMarvel
BatmangoodmaleDC
JokerbadmaleDC
CatwomanbadfemaleDC
SabrinagoodfemaleArchie Comics

publishers

publisheryr_founded
DC1934
Marvel1939
Image1992

full_join(x = superheroes, y = publishers)

namealignmentgenderpublisheryr_founded
MagnetobadmaleMarvel1939
StormgoodfemaleMarvel1939
MystiquebadfemaleMarvel1939
BatmangoodmaleDC1934
JokerbadmaleDC1934
CatwomanbadfemaleDC1934
SabrinagoodfemaleArchie ComicsNA
NANANAImage1992

Filtering joins

Semi join

semi_join(x, y): Return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y (potentially duplicating rows in x), whereas a semi join will never duplicate rows of x. This is a filtering join.
(sjsp <- semi_join(x = superheroes, y = publishers))
## Joining, by = "publisher"
## # A tibble: 6 x 4
##   name     alignment gender publisher
##   <chr>    <chr>     <chr>  <chr>    
## 1 Magneto  bad       male   Marvel   
## 2 Storm    good      female Marvel   
## 3 Mystique bad       female Marvel   
## 4 Batman   good      male   DC       
## 5 Joker    bad       male   DC       
## 6 Catwoman bad       female DC

We get a similar result as with inner_join() but the join result contains only the variables originally found in x = superheroes. But note the row order has changed.

`superheroes`
namealignmentgenderpublisher
MagnetobadmaleMarvel
StormgoodfemaleMarvel
MystiquebadfemaleMarvel
BatmangoodmaleDC
JokerbadmaleDC
CatwomanbadfemaleDC
SabrinagoodfemaleArchie Comics

publishers

publisheryr_founded
DC1934
Marvel1939
Image1992

semi_join(x = superheroes, y = publishers)

namealignmentgenderpublisher
MagnetobadmaleMarvel
StormgoodfemaleMarvel
MystiquebadfemaleMarvel
BatmangoodmaleDC
JokerbadmaleDC
CatwomanbadfemaleDC

Anti join

anti_join(x, y): Return all rows from x where there are not matching values in y, keeping just columns from x. This is a filtering join.
(ajsp <- anti_join(x = superheroes, y = publishers))
## Joining, by = "publisher"
## # A tibble: 1 x 4
##   name    alignment gender publisher    
##   <chr>   <chr>     <chr>  <chr>        
## 1 Sabrina good      female Archie Comics

We keep only Sabrina now (and do not get yr_founded).

`superheroes`
namealignmentgenderpublisher
MagnetobadmaleMarvel
StormgoodfemaleMarvel
MystiquebadfemaleMarvel
BatmangoodmaleDC
JokerbadmaleDC
CatwomanbadfemaleDC
SabrinagoodfemaleArchie Comics

publishers

publisheryr_founded
DC1934
Marvel1939
Image1992

anti_join(x = superheroes, y = publishers)

namealignmentgenderpublisher
SabrinagoodfemaleArchie Comics

Acknowledgments

Session Info

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.4 (2021-02-15)
##  os       macOS Big Sur 10.16         
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2021-05-25                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  blogdown      1.3     2021-04-14 [1] CRAN (R 4.0.2)
##  bookdown      0.22    2021-04-22 [1] CRAN (R 4.0.2)
##  bslib         0.2.5   2021-05-12 [1] CRAN (R 4.0.4)
##  cachem        1.0.5   2021-05-15 [1] CRAN (R 4.0.2)
##  callr         3.7.0   2021-04-20 [1] CRAN (R 4.0.2)
##  cli           2.5.0   2021-04-26 [1] CRAN (R 4.0.2)
##  crayon        1.4.1   2021-02-08 [1] CRAN (R 4.0.2)
##  desc          1.3.0   2021-03-05 [1] CRAN (R 4.0.2)
##  devtools      2.4.1   2021-05-05 [1] CRAN (R 4.0.2)
##  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.2)
##  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.0.2)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.0)
##  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.0.2)
##  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
##  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
##  here          1.0.1   2020-12-13 [1] CRAN (R 4.0.2)
##  htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.0.2)
##  jsonlite      1.7.2   2020-12-09 [1] CRAN (R 4.0.2)
##  knitr         1.33    2021-04-24 [1] CRAN (R 4.0.2)
##  lifecycle     1.0.0   2021-02-15 [1] CRAN (R 4.0.2)
##  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.2)
##  memoise       2.0.0   2021-01-26 [1] CRAN (R 4.0.2)
##  pkgbuild      1.2.0   2020-12-15 [1] CRAN (R 4.0.2)
##  pkgload       1.2.1   2021-04-06 [1] CRAN (R 4.0.2)
##  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
##  processx      3.5.2   2021-04-30 [1] CRAN (R 4.0.2)
##  ps            1.6.0   2021-02-28 [1] CRAN (R 4.0.2)
##  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
##  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.2)
##  remotes       2.3.0   2021-04-01 [1] CRAN (R 4.0.2)
##  rlang         0.4.11  2021-04-30 [1] CRAN (R 4.0.2)
##  rmarkdown     2.8     2021-05-07 [1] CRAN (R 4.0.2)
##  rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.0.2)
##  sass          0.4.0   2021-05-12 [1] CRAN (R 4.0.2)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
##  stringi       1.6.1   2021-05-10 [1] CRAN (R 4.0.2)
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
##  testthat      3.0.2   2021-02-14 [1] CRAN (R 4.0.2)
##  usethis       2.0.1   2021-02-10 [1] CRAN (R 4.0.2)
##  withr         2.4.2   2021-04-18 [1] CRAN (R 4.0.2)
##  xfun          0.23    2021-05-15 [1] CRAN (R 4.0.2)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
## 
## [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library