To get information about the join()
function, which belongs to tidyverse
package, this exercises are made by using tutorial.
There are some join functions:
left_join()
and right_join()
inner_join()
semi_join()
full_join()
anti_join()
library(tidyverse) ## dplyr provides the join functions
superheroes <- tibble::tribble(
~name, ~alignment, ~gender, ~publisher,
"Magneto", "bad", "male", "Marvel",
"Storm", "good", "female", "Marvel",
"Mystique", "bad", "female", "Marvel",
"Batman", "good", "male", "DC",
"Joker", "bad", "male", "DC",
"Catwoman", "bad", "female", "DC",
"Hellboy", "good", "male", "Dark Horse Comics"
)
publishers <- tibble::tribble(
~publisher, ~yr_founded,
"DC", 1934L,
"Marvel", 1939L,
"Image", 1992L
)
superheroes
## # A tibble: 7 x 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Magneto bad male Marvel
## 2 Storm good female Marvel
## 3 Mystique bad female Marvel
## 4 Batman good male DC
## 5 Joker bad male DC
## 6 Catwoman bad female DC
## 7 Hellboy good male Dark Horse Comics
publishers
## # A tibble: 3 x 2
## publisher yr_founded
## <chr> <int>
## 1 DC 1934
## 2 Marvel 1939
## 3 Image 1992
There are two different data sets, i.e., superheroes and publishers. By using these two data set, we can obtain a data set which is the combination of the superheroes and publishers. To make this merge process, we need to use common data of data sets that is publisher.
In these examples:
x = superheroes, and y=publishers
In the left join, the all rows of X are preserved and the only relevant columns of y is used. If there is no matching of the x values on the y table, this rows returns NA
values.
Left Join: All rows of X are preserved, only relevant rows Y and multiply rows if there are matching.
Note that, the logic behind the right join is same as the left join. For this reason, to prevent confusion, only one of them can be used.
left_join(superheroes, publishers, by="publisher")
## # A tibble: 7 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
## 7 Hellboy good male Dark Horse Comics NA
Inner Join: Only rows with common value are returned and rws rae multiplied if there are multiple matchings
inner_join(superheroes, publishers, by="publisher")
## # A tibble: 6 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
Inner join is similar with the left join, but there is difference in the NA values. If there is an NA value in the rows, inner join does not return this rows in the merged data frame.
Semi Join: Very similar to inner join but without columns from Y
semi_join(superheroes, publishers, by="publisher")
## # A tibble: 6 x 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Magneto bad male Marvel
## 2 Storm good female Marvel
## 3 Mystique bad female Marvel
## 4 Batman good male DC
## 5 Joker bad male DC
## 6 Catwoman bad female DC
This function provide the merged of the data sets but, it does not include the NA value of the X data set, it removes this data and moreover,semi join function does not show column of the y, publisher, data set.
Full join returns all rows and columns from both X and Y and both multiple mathcings and compensates for missing
full_join(superheroes, publishers, by = "publisher")
## # A tibble: 8 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
## 7 Hellboy good male Dark Horse Comics NA
## 8 <NA> <NA> <NA> Image 1992
Full join, merged all data of the superheroes and publishers. For this reason, we can see more NA values in the merged data frame.
Anti join returns all rows from X which do not have information (based on key column) in Y and returns only columns from X.
anti_join(superheroes, publishers, by = "publisher")
## # A tibble: 1 x 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Hellboy good male Dark Horse Comics
By using anti join function, we can obtain values, which have no matching in the second data frame.. In this example, Dark Horse Comics does not take part in the publishers data set, it is not macth.
anti_join(publishers, superheroes, by="publisher")
## # A tibble: 1 x 2
## publisher yr_founded
## <chr> <int>
## 1 Image 1992
In the second anti join function, we use publishers as x values. For this reason, the function shows that the value which does not take part in the y data set, i.e., superheroes in this trial.
This examples are made according to the information in the IE48A class and also the data used from the study of Stat545.