Introduction

An SQL join clause - corresponding to a join operation in relational algebra - combines columns from one or more tables in a relational database. It creates a set that can be saved as a table or used as it is. A JOIN is a means for combining columns from one (self-join) or more tables by using values common to each. Relational databases are usually normalized to eliminate duplication of information such as when entity types have one-to-many relationships.

Overview of Datasets superheroes and publishers

We have 2 datasets. Out first dataset superheroes contains information about fictional characters from graphic novels. There are 7 characters from different publishers. Information about those characters include name, alignment (good or bad), gender (male or female), and the publishers where they belong to.

superheroes %>% glimpse()
## Rows: 7
## Columns: 4
## $ name      <chr> "Magneto", "Storm", "Mystique", "Batman", "Joker", "Catwo...
## $ alignment <chr> "bad", "good", "bad", "good", "bad", "bad", "good"
## $ gender    <chr> "male", "female", "female", "male", "male", "female", "male"
## $ publisher <chr> "Marvel", "Marvel", "Marvel", "DC", "DC", "DC", "Dark Hor...
print(superheroes)
## # A tibble: 7 x 4
##   name     alignment gender publisher        
##   <chr>    <chr>     <chr>  <chr>            
## 1 Magneto  bad       male   Marvel           
## 2 Storm    good      female Marvel           
## 3 Mystique bad       female Marvel           
## 4 Batman   good      male   DC               
## 5 Joker    bad       male   DC               
## 6 Catwoman bad       female DC               
## 7 Hellboy  good      male   Dark Horse Comics

Our second dataset is about publishers and the year they are founded. These two datasets are the perfect minimal example to introduce data table joins since they have a common (key) column and they have both similar and disparate values.

publishers %>% glimpse()
## Rows: 3
## Columns: 2
## $ publisher  <chr> "DC", "Marvel", "Image"
## $ yr_founded <int> 1934, 1939, 1992
print(publishers)
## # A tibble: 3 x 2
##   publisher yr_founded
##   <chr>          <int>
## 1 DC              1934
## 2 Marvel          1939
## 3 Image           1992

Joins

We call datasets left and right (hence left join and right join), or X and Y (usually X denotes left and Y denotes right). Here we will use superheroes and publishers in different positions but usually superheroes will be on the left.

Left Join

Keep all columns on left, add matching values from right, NA for missing, multiply right if multiple combinations (similarly for right_join).

left_join(superheroes, publishers, by = "publisher")
## # A tibble: 7 x 5
##   name     alignment gender publisher         yr_founded
##   <chr>    <chr>     <chr>  <chr>                  <int>
## 1 Magneto  bad       male   Marvel                  1939
## 2 Storm    good      female Marvel                  1939
## 3 Mystique bad       female Marvel                  1939
## 4 Batman   good      male   DC                      1934
## 5 Joker    bad       male   DC                      1934
## 6 Catwoman bad       female DC                      1934
## 7 Hellboy  good      male   Dark Horse Comics         NA

Inner Join

Keep both columns, only common values (drop NA).

inner_join(superheroes, publishers, by = "publisher")
## # A tibble: 6 x 5
##   name     alignment gender publisher yr_founded
##   <chr>    <chr>     <chr>  <chr>          <int>
## 1 Magneto  bad       male   Marvel          1939
## 2 Storm    good      female Marvel          1939
## 3 Mystique bad       female Marvel          1939
## 4 Batman   good      male   DC              1934
## 5 Joker    bad       male   DC              1934
## 6 Catwoman bad       female DC              1934

Semi Join

Semi Join is very similar to inner join but without columns from Y.

semi_join(superheroes, publishers, by = "publisher")
## # A tibble: 6 x 4
##   name     alignment gender publisher
##   <chr>    <chr>     <chr>  <chr>    
## 1 Magneto  bad       male   Marvel   
## 2 Storm    good      female Marvel   
## 3 Mystique bad       female Marvel   
## 4 Batman   good      male   DC       
## 5 Joker    bad       male   DC       
## 6 Catwoman bad       female DC

Full Join

Keep all columns on both, NA for missing, use both right and left, multiply both if multiple combinations.

full_join(superheroes, publishers, by = "publisher")
## # A tibble: 8 x 5
##   name     alignment gender publisher         yr_founded
##   <chr>    <chr>     <chr>  <chr>                  <int>
## 1 Magneto  bad       male   Marvel                  1939
## 2 Storm    good      female Marvel                  1939
## 3 Mystique bad       female Marvel                  1939
## 4 Batman   good      male   DC                      1934
## 5 Joker    bad       male   DC                      1934
## 6 Catwoman bad       female DC                      1934
## 7 Hellboy  good      male   Dark Horse Comics         NA
## 8 <NA>     <NA>      <NA>   Image                   1992

Anti Join

Keep all columns on left, retain rows if there is not a match.

anti_join(superheroes, publishers, by = "publisher")
## # A tibble: 1 x 4
##   name    alignment gender publisher        
##   <chr>   <chr>     <chr>  <chr>            
## 1 Hellboy good      male   Dark Horse Comics

References

End of the document. Visit my Progress Journal