An SQL join clause - corresponding to a join operation in relational algebra - combines columns from one or more tables in a relational database. It creates a set that can be saved as a table or used as it is. A JOIN is a means for combining columns from one (self-join) or more tables by using values common to each. Relational databases are usually normalized to eliminate duplication of information such as when entity types have one-to-many relationships.
superheroes
and publishers
We have 2 datasets. Out first dataset superheroes
contains information about fictional characters from graphic novels. There are 7 characters from different publishers. Information about those characters include name, alignment (good or bad), gender (male or female), and the publishers where they belong to.
superheroes %>% glimpse()
## Rows: 7
## Columns: 4
## $ name <chr> "Magneto", "Storm", "Mystique", "Batman", "Joker", "Catwo...
## $ alignment <chr> "bad", "good", "bad", "good", "bad", "bad", "good"
## $ gender <chr> "male", "female", "female", "male", "male", "female", "male"
## $ publisher <chr> "Marvel", "Marvel", "Marvel", "DC", "DC", "DC", "Dark Hor...
print(superheroes)
## # A tibble: 7 x 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Magneto bad male Marvel
## 2 Storm good female Marvel
## 3 Mystique bad female Marvel
## 4 Batman good male DC
## 5 Joker bad male DC
## 6 Catwoman bad female DC
## 7 Hellboy good male Dark Horse Comics
Our second dataset is about publishers
and the year they are founded. These two datasets are the perfect minimal example to introduce data table joins since they have a common (key) column and they have both similar and disparate values.
publishers %>% glimpse()
## Rows: 3
## Columns: 2
## $ publisher <chr> "DC", "Marvel", "Image"
## $ yr_founded <int> 1934, 1939, 1992
print(publishers)
## # A tibble: 3 x 2
## publisher yr_founded
## <chr> <int>
## 1 DC 1934
## 2 Marvel 1939
## 3 Image 1992
We call datasets left and right (hence left join and right join), or X and Y (usually X denotes left and Y denotes right). Here we will use superheroes
and publishers
in different positions but usually superheroes
will be on the left.
Keep all columns on left, add matching values from right, NA for missing, multiply right if multiple combinations (similarly for right_join).
left_join(superheroes, publishers, by = "publisher")
## # A tibble: 7 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
## 7 Hellboy good male Dark Horse Comics NA
Keep both columns, only common values (drop NA).
inner_join(superheroes, publishers, by = "publisher")
## # A tibble: 6 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
Semi Join is very similar to inner join but without columns from Y.
semi_join(superheroes, publishers, by = "publisher")
## # A tibble: 6 x 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Magneto bad male Marvel
## 2 Storm good female Marvel
## 3 Mystique bad female Marvel
## 4 Batman good male DC
## 5 Joker bad male DC
## 6 Catwoman bad female DC
Keep all columns on both, NA for missing, use both right and left, multiply both if multiple combinations.
full_join(superheroes, publishers, by = "publisher")
## # A tibble: 8 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
## 7 Hellboy good male Dark Horse Comics NA
## 8 <NA> <NA> <NA> Image 1992
Keep all columns on left, retain rows if there is not a match.
anti_join(superheroes, publishers, by = "publisher")
## # A tibble: 1 x 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Hellboy good male Dark Horse Comics
End of the document. Visit my Progress Journal