Join First Exercises

To get information about the join() function, which belongs to tidyverse package, this exercises are made by using tutorial.

There are some join functions:

left_join() and right_join()
inner_join()
semi_join()
full_join()
anti_join()

library(tidyverse) ## dplyr provides the join functions

superheroes <- tibble::tribble(
       ~name, ~alignment,  ~gender,          ~publisher,
   "Magneto",      "bad",   "male",            "Marvel",
     "Storm",     "good", "female",            "Marvel",
  "Mystique",      "bad", "female",            "Marvel",
    "Batman",     "good",   "male",                "DC",
     "Joker",      "bad",   "male",                "DC",
  "Catwoman",      "bad", "female",                "DC",
   "Hellboy",     "good",   "male", "Dark Horse Comics"
  )

publishers <- tibble::tribble(
  ~publisher, ~yr_founded,
        "DC",       1934L,
    "Marvel",       1939L,
     "Image",       1992L
  )

superheroes

## # A tibble: 7 x 4
##   name     alignment gender publisher        
##   <chr>    <chr>     <chr>  <chr>            
## 1 Magneto  bad       male   Marvel           
## 2 Storm    good      female Marvel           
## 3 Mystique bad       female Marvel           
## 4 Batman   good      male   DC               
## 5 Joker    bad       male   DC               
## 6 Catwoman bad       female DC               
## 7 Hellboy  good      male   Dark Horse Comics

publishers

## # A tibble: 3 x 2
##   publisher yr_founded
##   <chr>          <int>
## 1 DC              1934
## 2 Marvel          1939
## 3 Image           1992

There are two different data sets, i.e., superheroes and publishers. By using these two data set, we can obtain a data set which is the combination of the superheroes and publishers. To make this merge process, we need to use common data of data sets that is publisher.

In these examples:

x = superheroes, and y=publishers

1. Left Join

In the left join, the all rows of X are preserved and the only relevant columns of y is used. If there is no matching of the x values on the y table, this rows returns NA values.

Left Join: All rows of X are preserved, only relevant rows Y and multiply rows if there are matching.

Note that, the logic behind the right join is same as the left join. For this reason, to prevent confusion, only one of them can be used.

left_join(superheroes, publishers, by="publisher")

## # A tibble: 7 x 5
##   name     alignment gender publisher         yr_founded
##   <chr>    <chr>     <chr>  <chr>                  <int>
## 1 Magneto  bad       male   Marvel                  1939
## 2 Storm    good      female Marvel                  1939
## 3 Mystique bad       female Marvel                  1939
## 4 Batman   good      male   DC                      1934
## 5 Joker    bad       male   DC                      1934
## 6 Catwoman bad       female DC                      1934
## 7 Hellboy  good      male   Dark Horse Comics         NA

2. Inner Join

Inner Join: Only rows with common value are returned and rws rae multiplied if there are multiple matchings

inner_join(superheroes, publishers, by="publisher")

## # A tibble: 6 x 5
##   name     alignment gender publisher yr_founded
##   <chr>    <chr>     <chr>  <chr>          <int>
## 1 Magneto  bad       male   Marvel          1939
## 2 Storm    good      female Marvel          1939
## 3 Mystique bad       female Marvel          1939
## 4 Batman   good      male   DC              1934
## 5 Joker    bad       male   DC              1934
## 6 Catwoman bad       female DC              1934

Inner join is similar with the left join, but there is difference in the NA values. If there is an NA value in the rows, inner join does not return this rows in the merged data frame.

3. Semi Join

Semi Join: Very similar to inner join but without columns from Y

semi_join(superheroes, publishers, by="publisher")

## # A tibble: 6 x 4
##   name     alignment gender publisher
##   <chr>    <chr>     <chr>  <chr>    
## 1 Magneto  bad       male   Marvel   
## 2 Storm    good      female Marvel   
## 3 Mystique bad       female Marvel   
## 4 Batman   good      male   DC       
## 5 Joker    bad       male   DC       
## 6 Catwoman bad       female DC

This function provide the merged of the data sets but, it does not include the NA value of the X data set, it removes this data and moreover,semi join function does not show column of the y, publisher, data set.

4. Full Join

Full join returns all rows and columns from both X and Y and both multiple mathcings and compensates for missing

full_join(superheroes, publishers, by = "publisher")

## # A tibble: 8 x 5
##   name     alignment gender publisher         yr_founded
##   <chr>    <chr>     <chr>  <chr>                  <int>
## 1 Magneto  bad       male   Marvel                  1939
## 2 Storm    good      female Marvel                  1939
## 3 Mystique bad       female Marvel                  1939
## 4 Batman   good      male   DC                      1934
## 5 Joker    bad       male   DC                      1934
## 6 Catwoman bad       female DC                      1934
## 7 Hellboy  good      male   Dark Horse Comics         NA
## 8 <NA>     <NA>      <NA>   Image                   1992

Full join, merged all data of the superheroes and publishers. For this reason, we can see more NA values in the merged data frame.

5. Anti Join

Anti join returns all rows from X which do not have information (based on key column) in Y and returns only columns from X.

anti_join(superheroes, publishers, by = "publisher")

## # A tibble: 1 x 4
##   name    alignment gender publisher        
##   <chr>   <chr>     <chr>  <chr>            
## 1 Hellboy good      male   Dark Horse Comics

By using anti join function, we can obtain values, which have no matching in the second data frame.. In this example, Dark Horse Comics does not take part in the publishers data set, it is not macth.

anti_join(publishers, superheroes, by="publisher")

## # A tibble: 1 x 2
##   publisher yr_founded
##   <chr>          <int>
## 1 Image           1992

In the second anti join function, we use publishers as x values. For this reason, the function shows that the value which does not take part in the y data set, i.e., superheroes in this trial.

6. Reference

This examples are made according to the information in the IE48A class and also the data used from the study of Stat545.