We can merge two datasets in R by using family of join function in dplyr package. The datasets must have same column names on which the merging happens. By utilizing these gunctions, we can improve our dataset in a more efficient manner for analyses. These functions are similar to database join operation in SQL. According to the goal of the merge, there are six different join types:
1. Left Join
2. Right Join
3. Inner Join
4. Semi Join
5. Full Join
6. Anti Join
We’ll now continue with an example of using each of these join types on our two tables nemaly superheroes, and publishers. These exercises are made based on the tutorial. The first dataset superheroes contains the information about fictional characters from graphic novels. There are seven different characters from different publishers. The variables in this dataset are name, alignment, gender, and the publisher. The second dataset publishers is about publishers and the foundation year of each. These two datasets help us to construct a perfect example to introduce the join functions since they have a common column and they have both similar and disparate values. After having a glimpse on the datasets, we will continue with the specific examples for each join function type.
We first create the datasets.
library(tidyverse)
tibble::tribble(
superheroes <-~name, ~alignment, ~gender, ~publisher,
"Magneto", "bad", "male", "Marvel",
"Storm", "good", "female", "Marvel",
"Mystique", "bad", "female", "Marvel",
"Batman", "good", "male", "DC",
"Joker", "bad", "male", "DC",
"Catwoman", "bad", "female", "DC",
"Hellboy", "good", "male", "Dark Horse Comics"
)
tibble::tribble(
publishers <-~publisher, ~yr_founded,
"DC", 1934L,
"Marvel", 1939L,
"Image", 1992L
)
We investigate the first dataset.
%>% glimpse() superheroes
## Rows: 7
## Columns: 4
## $ name <chr> "Magneto", "Storm", "Mystique", "Batman", "Joker", "Catwo...
## $ alignment <chr> "bad", "good", "bad", "good", "bad", "bad", "good"
## $ gender <chr> "male", "female", "female", "male", "male", "female", "male"
## $ publisher <chr> "Marvel", "Marvel", "Marvel", "DC", "DC", "DC", "Dark Hor...
print(superheroes)
## # A tibble: 7 x 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Magneto bad male Marvel
## 2 Storm good female Marvel
## 3 Mystique bad female Marvel
## 4 Batman good male DC
## 5 Joker bad male DC
## 6 Catwoman bad female DC
## 7 Hellboy good male Dark Horse Comics
We investigate the second dataset.
%>% glimpse() publishers
## Rows: 3
## Columns: 2
## $ publisher <chr> "DC", "Marvel", "Image"
## $ yr_founded <int> 1934, 1939, 1992
print(publishers)
## # A tibble: 3 x 2
## publisher yr_founded
## <chr> <int>
## 1 DC 1934
## 2 Marvel 1939
## 3 Image 1992
We call datasets left and right, or X and Y where X usually denotes left and Y usually denotes right. We may use different positions for the datasets according to our purpose however in these examples we will use mainly superheroes dataset as the left dataset.
left_join(x, y)
means return all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join. We basically get x = superheroes
back, but with the addition of variable yr_founded
, which is unique to y = publishers
. Hellboy, whose publisher does not appear in y = publishers
, has an NA
for yr_founded
. In the left join, all rows belong to x are preserved and the only relevant columns of y is used. If there is no matching of the x values on the y table, this rows returns NA
values. The logic behind the right join is the same as the left join, therefore in order to prevent confusion, using only one of them is suggested.
left_join(superheroes, publishers, by = "publisher")
## # A tibble: 7 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
## 7 Hellboy good male Dark Horse Comics NA
inner_join(x, y)
means return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join. We lose Hellboy in the join because, although he appears in x = superheroes
, his publisher Dark Horse Comics does not appear in y = publishers
. The join result has all variables from x = superheroes
plus yr_founded
, from y. Inner join is similar to left join. The only difference is that this command does not return the rows with NA
values.
inner_join(superheroes, publishers, by = "publisher")
## # A tibble: 6 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
semi_join(x, y)
means return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. This is a filtering join. We get a similar result as with inner_join()
but the join result contains only the variables originally found in x = superheroes
.
semi_join(superheroes, publishers, by = "publisher")
## # A tibble: 6 x 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Magneto bad male Marvel
## 2 Storm good female Marvel
## 3 Mystique bad female Marvel
## 4 Batman good male DC
## 5 Joker bad male DC
## 6 Catwoman bad female DC
full_join(x, y)
means return all rows and all columns from both x and y. Where there are not matching values, returns NA
for the one missing. This is a mutating join. We get all rows of x = superheroes
plus a new row from y = publishers
, containing the publisher Image. We get all variables from x = superheroes
and all variables from y = publishers
. Any row that derives solely from one table or the other carries NA
s in the variables found only in the other table.
full_join(superheroes, publishers, by = "publisher")
## # A tibble: 8 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
## 7 Hellboy good male Dark Horse Comics NA
## 8 <NA> <NA> <NA> Image 1992
anti_join(x, y)
means return all rows from x where there are not matching values in y, keeping just columns from x. This is a filtering join. We keep only Hellboy now and do not get yr_founded
.
anti_join(superheroes, publishers, by = "publisher")
## # A tibble: 1 x 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Hellboy good male Dark Horse Comics
anti_join(publishers, superheroes, by="publisher")
## # A tibble: 1 x 2
## publisher yr_founded
## <chr> <int>
## 1 Image 1992
The examples and the data are taken from the notes of STAT 545 by Jenny Bryan. In order to visit the page, you can click here.