R Exercise: Joining Tables

1. Introduction

We can merge two datasets in R by using family of join function in dplyr package. The datasets must have same column names on which the merging happens. By utilizing these gunctions, we can improve our dataset in a more efficient manner for analyses. These functions are similar to database join operation in SQL. According to the goal of the merge, there are six different join types:
1. Left Join
2. Right Join
3. Inner Join
4. Semi Join
5. Full Join
6. Anti Join
We’ll now continue with an example of using each of these join types on our two tables nemaly superheroes, and publishers. These exercises are made based on the tutorial. The first dataset superheroes contains the information about fictional characters from graphic novels. There are seven different characters from different publishers. The variables in this dataset are name, alignment, gender, and the publisher. The second dataset publishers is about publishers and the foundation year of each. These two datasets help us to construct a perfect example to introduce the join functions since they have a common column and they have both similar and disparate values. After having a glimpse on the datasets, we will continue with the specific examples for each join function type.

We first create the datasets.

library(tidyverse) 

superheroes <- tibble::tribble(
       ~name, ~alignment,  ~gender,          ~publisher,
   "Magneto",      "bad",   "male",            "Marvel",
     "Storm",     "good", "female",            "Marvel",
  "Mystique",      "bad", "female",            "Marvel",
    "Batman",     "good",   "male",                "DC",
     "Joker",      "bad",   "male",                "DC",
  "Catwoman",      "bad", "female",                "DC",
   "Hellboy",     "good",   "male", "Dark Horse Comics"
  )

publishers <- tibble::tribble(
  ~publisher, ~yr_founded,
        "DC",       1934L,
    "Marvel",       1939L,
     "Image",       1992L
  )

We investigate the first dataset.

superheroes %>% glimpse()

## Rows: 7
## Columns: 4
## $ name      <chr> "Magneto", "Storm", "Mystique", "Batman", "Joker", "Catwo...
## $ alignment <chr> "bad", "good", "bad", "good", "bad", "bad", "good"
## $ gender    <chr> "male", "female", "female", "male", "male", "female", "male"
## $ publisher <chr> "Marvel", "Marvel", "Marvel", "DC", "DC", "DC", "Dark Hor...

print(superheroes)

## # A tibble: 7 x 4
##   name     alignment gender publisher        
##   <chr>    <chr>     <chr>  <chr>            
## 1 Magneto  bad       male   Marvel           
## 2 Storm    good      female Marvel           
## 3 Mystique bad       female Marvel           
## 4 Batman   good      male   DC               
## 5 Joker    bad       male   DC               
## 6 Catwoman bad       female DC               
## 7 Hellboy  good      male   Dark Horse Comics

We investigate the second dataset.

publishers %>% glimpse()

## Rows: 3
## Columns: 2
## $ publisher  <chr> "DC", "Marvel", "Image"
## $ yr_founded <int> 1934, 1939, 1992

print(publishers)

## # A tibble: 3 x 2
##   publisher yr_founded
##   <chr>          <int>
## 1 DC              1934
## 2 Marvel          1939
## 3 Image           1992

2. Join functions

We call datasets left and right, or X and Y where X usually denotes left and Y usually denotes right. We may use different positions for the datasets according to our purpose however in these examples we will use mainly superheroes dataset as the left dataset.

2.1. Left Join and Right Join

left_join(x, y) means return all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join. We basically get x = superheroes back, but with the addition of variable yr_founded, which is unique to y = publishers. Hellboy, whose publisher does not appear in y = publishers, has an NA for yr_founded. In the left join, all rows belong to x are preserved and the only relevant columns of y is used. If there is no matching of the x values on the y table, this rows returns NA values. The logic behind the right join is the same as the left join, therefore in order to prevent confusion, using only one of them is suggested.

left_join(superheroes, publishers, by = "publisher")

## # A tibble: 7 x 5
##   name     alignment gender publisher         yr_founded
##   <chr>    <chr>     <chr>  <chr>                  <int>
## 1 Magneto  bad       male   Marvel                  1939
## 2 Storm    good      female Marvel                  1939
## 3 Mystique bad       female Marvel                  1939
## 4 Batman   good      male   DC                      1934
## 5 Joker    bad       male   DC                      1934
## 6 Catwoman bad       female DC                      1934
## 7 Hellboy  good      male   Dark Horse Comics         NA

2.2. Inner Join

inner_join(x, y) means return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join. We lose Hellboy in the join because, although he appears in x = superheroes, his publisher Dark Horse Comics does not appear in y = publishers. The join result has all variables from x = superheroes plus yr_founded, from y. Inner join is similar to left join. The only difference is that this command does not return the rows with NA values.

inner_join(superheroes, publishers, by = "publisher")

## # A tibble: 6 x 5
##   name     alignment gender publisher yr_founded
##   <chr>    <chr>     <chr>  <chr>          <int>
## 1 Magneto  bad       male   Marvel          1939
## 2 Storm    good      female Marvel          1939
## 3 Mystique bad       female Marvel          1939
## 4 Batman   good      male   DC              1934
## 5 Joker    bad       male   DC              1934
## 6 Catwoman bad       female DC              1934

2.3. Semi Join

semi_join(x, y) means return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. This is a filtering join. We get a similar result as with inner_join() but the join result contains only the variables originally found in x = superheroes.

semi_join(superheroes, publishers, by = "publisher")

## # A tibble: 6 x 4
##   name     alignment gender publisher
##   <chr>    <chr>     <chr>  <chr>    
## 1 Magneto  bad       male   Marvel   
## 2 Storm    good      female Marvel   
## 3 Mystique bad       female Marvel   
## 4 Batman   good      male   DC       
## 5 Joker    bad       male   DC       
## 6 Catwoman bad       female DC

2.4. Full Join

full_join(x, y) means return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing. This is a mutating join. We get all rows of x = superheroes plus a new row from y = publishers , containing the publisher Image. We get all variables from x = superheroes and all variables from y = publishers . Any row that derives solely from one table or the other carries NA s in the variables found only in the other table.

full_join(superheroes, publishers, by = "publisher")

## # A tibble: 8 x 5
##   name     alignment gender publisher         yr_founded
##   <chr>    <chr>     <chr>  <chr>                  <int>
## 1 Magneto  bad       male   Marvel                  1939
## 2 Storm    good      female Marvel                  1939
## 3 Mystique bad       female Marvel                  1939
## 4 Batman   good      male   DC                      1934
## 5 Joker    bad       male   DC                      1934
## 6 Catwoman bad       female DC                      1934
## 7 Hellboy  good      male   Dark Horse Comics         NA
## 8 <NA>     <NA>      <NA>   Image                   1992

2.5. Anti Join

anti_join(x, y) means return all rows from x where there are not matching values in y, keeping just columns from x. This is a filtering join. We keep only Hellboy now and do not get yr_founded.

anti_join(superheroes, publishers, by = "publisher")

## # A tibble: 1 x 4
##   name    alignment gender publisher        
##   <chr>   <chr>     <chr>  <chr>            
## 1 Hellboy good      male   Dark Horse Comics

anti_join(publishers, superheroes, by="publisher")

## # A tibble: 1 x 2
##   publisher yr_founded
##   <chr>          <int>
## 1 Image           1992

3. References

The examples and the data are taken from the notes of STAT 545 by Jenny Bryan. In order to visit the page, you can click here.