What is rvest?

rvest is a very useful R package to parse HTML files of web pages. Usually information is not directly provided in Excel files but embedded in one or more pages in a web site.

Main sources

rvest Usage Examples

Since rvest is part of the tidyverse package we can easily use pipes with it.

library(tidyverse)
library(rvest)

In the first example, we will extract tables from BKM’s (Interbank Card Center) sectoral development reports (see the website). Let’s start with getting the web page.

the_url <- "https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=2020&filter_month=1&List=Listele"
html_obj <- read_html(the_url) 

When we check the object itself, we see it is a bunch of html code. We are on a good path.

html_obj

We can examine the html structure by using the powerful html_structure() function. Since it is a bit of verbose see for yourself.

html_obj %>% html_structure()

We can simply extract the table using html_table(). If you get an error about inconsistent fields, use parameter fill=TRUE.

## [[1]]
##                                 X1                               X2
## 1 SEÇİLEN AYA AİT SEKTÖREL GELİŞİM Seçilen Aya Ait Sektörel Gelişim
## 2 Seçilen Aya Ait Sektörel Gelişim                             <NA>
##                                 X3
## 1 Seçilen Aya Ait Sektörel Gelişim
## 2                             <NA>
## 
## [[2]]
##                                 X1
## 1 Seçilen Aya Ait Sektörel Gelişim
## 
## [[3]]
##                                                                                                                              X1
## 1 Yıl:\n                    2020201920182017201620152014201320122011Ay:\n                                                123456
## 
## [[4]]
##                                        X1                       X2
## 1                            İşyeri Grubu              İşlem Adedi
## 2                            İşyeri Grubu İşlem Adedi(Kredi Kartı)
## 3                          ARABA KİRALAMA                  341.271
## 4  ARAÇ KİRALAMA-SATIŞ/SERVİS/YEDEK PARÇA                3.497.791
## 5            BENZİN VE YAKIT İSTASYONLARI               27.488.661
## 6                      BIREYSEL EMEKLILIK                2.078.609
## 7                            ÇEŞİTLİ GIDA               33.777.181
## 8                      DOĞRUDAN PAZARLAMA                  627.712
## 9   EĞİTİM / KIRTASİYE / OFİS MALZEMELERİ                6.274.208
## 10   ELEKTRİK-ELEKTRONİK EŞYA, BİLGİSAYAR                8.809.733
## 11                      GİYİM VE AKSESUAR               26.750.805
## 12                            HAVAYOLLARI                2.013.912
## 13                      HİZMET SEKTÖRLERİ               25.890.561
## 14                   KAMU/VERGI ODEMELERI                7.879.506
## 15                              KONAKLAMA                1.863.395
## 16       KULÜP / DERNEK /SOSYAL HİZMETLER                1.353.113
## 17                KUMARHANE/İÇKİLİ YERLER                  567.888
## 18                             KUYUMCULAR                  846.168
## 19         MARKET VE ALIŞVERİŞ MERKEZLERİ              108.748.150
## 20                  MOBİLYA VE DEKORASYON                4.712.723
## 21                       MÜTEAHHİT İŞLERİ                  829.057
## 22        SAĞLIK/SAĞLIK ÜRÜNLERİ/KOZMETİK               17.637.257
## 23         SEYAHAT ACENTELERİ/TAŞIMACILIK                8.115.027
## 24                                SİGORTA                5.090.080
## 25                       TELEKOMÜNİKASYON               19.596.112
## 26  YAPI MALZEMELERİ, HIRDAVAT, NALBURİYE                3.314.053
## 27                                  YEMEK               49.032.939
## 28                                  DİĞER                8.444.282
## 29                                 TOPLAM              375.580.194
##                           X3                                           X4
## 1                İşlem Adedi                     İşlem Tutarı (Milyon TL)
## 2  İşlem Adedi (Banka Kartı) İşlem Tutarı \n                (Kredi Kartı)
## 3                     70.947                                       212,39
## 4                    969.202                                     2.675,77
## 5                 11.642.138                                     5.607,72
## 6                      2.043                                       771,85
## 7                 21.569.718                                     5.284,27
## 8                     60.381                                       728,13
## 9                  3.565.787                                     2.044,27
## 10                 2.609.689                                     4.184,70
## 11                12.487.879                                     5.669,53
## 12                   589.812                                     2.033,19
## 13                 8.073.696                                     5.513,96
## 14                 1.325.853                                     4.339,11
## 15                 1.047.125                                     1.273,61
## 16                   359.911                                       300,30
## 17                   502.004                                       118,88
## 18                   427.264                                     1.033,15
## 19                65.886.572                                    12.736,09
## 20                 1.881.289                                     2.240,39
## 21                   435.806                                       826,08
## 22                 9.096.864                                     3.095,37
## 23                 3.635.735                                     1.945,17
## 24                    37.618                                     3.734,59
## 25                 3.672.814                                     2.118,63
## 26                 1.261.101                                     2.726,61
## 27                46.378.312                                     2.994,79
## 28                 1.473.111                                     2.767,58
## 29               199.062.671                                    76.976,12
##                                                  X5
## 1                          İşlem Tutarı (Milyon TL)
## 2  İşlem Tutarı \n                    (Banka Kartı)
## 3                                             23,18
## 4                                            217,14
## 5                                          1.059,71
## 6                                              1,03
## 7                                          1.123,63
## 8                                             11,15
## 9                                            337,33
## 10                                           522,79
## 11                                         1.956,91
## 12                                           754,91
## 13                                           920,45
## 14                                           201,53
## 15                                           353,27
## 16                                            46,15
## 17                                            48,58
## 18                                           601,21
## 19                                         3.235,18
## 20                                           305,51
## 21                                            51,42
## 22                                           856,75
## 23                                           496,54
## 24                                            14,85
## 25                                           288,14
## 26                                           251,68
## 27                                         1.582,02
## 28                                           440,51
## 29                                        15.701,58
html_obj %>% html_table(fill=TRUE)

Now we are getting somewhere on the fourth item but it is not up to our quality standard. Let’s deploy dplyr functions to make it better.

html_df <- read_html(the_url) %>% html_table(fill=TRUE) %>% `[[`(4) 

html_df %>% 
  # Since we do not have too many columns let's rename them manually 
  # number (num) or value (val) of transactions (txn) 
  # by credit card (cc) or debit card (dc)
  rename(category = 1, num_txn_cc = 2, num_txn_dc = 3, val_txn_cc = 4, val_txn_dc = 5) %>%
  # remove the first two rows because they are actually titles
  slice(-(1:2)) %>%
  # then convert every numeric value by using parse_number function from readr
  mutate(
    across(-category, 
           ~readr::parse_number(.,
              locale=readr::locale(decimal_mark=",",grouping_mark = ".")
            )
           )
    )
##                                  category num_txn_cc num_txn_dc val_txn_cc
## 1                          ARABA KİRALAMA     341271      70947     212.39
## 2  ARAÇ KİRALAMA-SATIŞ/SERVİS/YEDEK PARÇA    3497791     969202    2675.77
## 3            BENZİN VE YAKIT İSTASYONLARI   27488661   11642138    5607.72
## 4                      BIREYSEL EMEKLILIK    2078609       2043     771.85
## 5                            ÇEŞİTLİ GIDA   33777181   21569718    5284.27
## 6                      DOĞRUDAN PAZARLAMA     627712      60381     728.13
## 7   EĞİTİM / KIRTASİYE / OFİS MALZEMELERİ    6274208    3565787    2044.27
## 8    ELEKTRİK-ELEKTRONİK EŞYA, BİLGİSAYAR    8809733    2609689    4184.70
## 9                       GİYİM VE AKSESUAR   26750805   12487879    5669.53
## 10                            HAVAYOLLARI    2013912     589812    2033.19
## 11                      HİZMET SEKTÖRLERİ   25890561    8073696    5513.96
## 12                   KAMU/VERGI ODEMELERI    7879506    1325853    4339.11
## 13                              KONAKLAMA    1863395    1047125    1273.61
## 14       KULÜP / DERNEK /SOSYAL HİZMETLER    1353113     359911     300.30
## 15                KUMARHANE/İÇKİLİ YERLER     567888     502004     118.88
## 16                             KUYUMCULAR     846168     427264    1033.15
## 17         MARKET VE ALIŞVERİŞ MERKEZLERİ  108748150   65886572   12736.09
## 18                  MOBİLYA VE DEKORASYON    4712723    1881289    2240.39
## 19                       MÜTEAHHİT İŞLERİ     829057     435806     826.08
## 20        SAĞLIK/SAĞLIK ÜRÜNLERİ/KOZMETİK   17637257    9096864    3095.37
## 21         SEYAHAT ACENTELERİ/TAŞIMACILIK    8115027    3635735    1945.17
## 22                                SİGORTA    5090080      37618    3734.59
## 23                       TELEKOMÜNİKASYON   19596112    3672814    2118.63
## 24  YAPI MALZEMELERİ, HIRDAVAT, NALBURİYE    3314053    1261101    2726.61
## 25                                  YEMEK   49032939   46378312    2994.79
## 26                                  DİĞER    8444282    1473111    2767.58
## 27                                 TOPLAM  375580194  199062671   76976.12
##    val_txn_dc
## 1       23.18
## 2      217.14
## 3     1059.71
## 4        1.03
## 5     1123.63
## 6       11.15
## 7      337.33
## 8      522.79
## 9     1956.91
## 10     754.91
## 11     920.45
## 12     201.53
## 13     353.27
## 14      46.15
## 15      48.58
## 16     601.21
## 17    3235.18
## 18     305.51
## 19      51.42
## 20     856.75
## 21     496.54
## 22      14.85
## 23     288.14
## 24     251.68
## 25    1582.02
## 26     440.51
## 27   15701.58

In the second example we will harvest the links from Istanbul’s Şehir Hatları (ferry line) domestic trips.

html_obj <- read_html("https://sehirhatlari.istanbul/en/timetables/domestic-trips")

Let’s get all the links using “a” nodes and “href” attributes. We are looking for “domestic trips” links.

links_vec <- html_obj %>% html_nodes("a") %>% html_attr("href")
links_vec
##  [1] "/en/corporate/privacy-and-cookie-policy-744"                                                      
##  [2] "javascript:__doPostBack('ctl00$Cookie$btnCerezler','')"                                           
##  [3] NA                                                                                                 
##  [4] "/tr"                                                                                              
##  [5] "/en"                                                                                              
##  [6] "#"                                                                                                
##  [7] "/en/corporate/sehir-hatlari-616"                                                                  
##  [8] "/en/corporate/presidents-message-720"                                                             
##  [9] "/en/corporate/general-manager-721"                                                                
## [10] "/en/corporate/vision-and-values-217"                                                              
## [11] "/en/corporate/management-systems-policy-364"                                                      
## [12] "#"                                                                                                
## [13] "/en/timetables/domestic-trips"                                                                    
## [14] "/en/timetables/ferry-vehicle"                                                                     
## [15] "/en/timetables/bosphorus-tours"                                                                   
## [16] "/en/price-list"                                                                                   
## [17] "#"                                                                                                
## [18] "/en/information/disabled-guests-186"                                                              
## [19] "/en/information/corporate-film-618"                                                               
## [20] "/en/frequently-asked-questions"                                                                   
## [21] "#"                                                                                                
## [22] "/en/corporate/rental-services-208"                                                                
## [23] "/en/corporate/shipyard-services-617"                                                              
## [24] "/en/corporate/advertising-areas-209"                                                              
## [25] "/en/corporate/filming-210"                                                                        
## [26] "#"                                                                                                
## [27] "/en/contact"                                                                                      
## [28] "/en/applies"                                                                                      
## [29] "https://twitter.com/sehir_hatlari"                                                                
## [30] "https://www.facebook.com/sehirhatlari"                                                            
## [31] "https://www.instagram.com/sehir_hatlari/"                                                         
## [32] "https://www.youtube.com/user/sehirhatlari1"                                                       
## [33] NA                                                                                                 
## [34] "/tr"                                                                                              
## [35] "#"                                                                                                
## [36] "#"                                                                                                
## [37] "/en/timetables/domestic-trips/inner-istanbul-ferry-lines-26"                                      
## [38] "/en/timetables/domestic-trips/bosphorus-lines-52"                                                 
## [39] "/en/timetables/domestic-trips/adalar-princes-islands-lines-176"                                   
## [40] "/en"                                                                                              
## [41] "/en/timetables"                                                                                   
## [42] "#ichatlartarifeleri"                                                                              
## [43] "/en/timetables/domestic-trips/inner-istanbul-ferry-lines-26"                                      
## [44] "/en/timetables/domestic-trips/bosphorus-lines-52"                                                 
## [45] "/en/timetables/domestic-trips/adalar-princes-islands-lines-176"                                   
## [46] "/en/timetables/ferry-vehicle"                                                                     
## [47] "/en/timetables/bosphorus-tours"                                                                   
## [48] "#ucrettarifeleri"                                                                                 
## [49] "/en/price-list/ferry-lines-79"                                                                    
## [50] "/en/price-list/bosphorus-tours-78"                                                                
## [51] "/en/price-list/ferry-vehicle-line-77"                                                             
## [52] "/en/timetables/domestic-trips/inner-istanbul-ferry-lines/kadikoy-besiktas-165"                    
## [53] "/en/timetables/domestic-trips/inner-istanbul-ferry-lines/halic-hatti-37"                          
## [54] "/en/timetables/domestic-trips/inner-istanbul-ferry-lines/bostanci-karakoy-kabatas-166"            
## [55] "/en/timetables/domestic-trips/inner-istanbul-ferry-lines/kadikoykarakoyeminonu-813"               
## [56] "/en/timetables/domestic-trips/inner-istanbul-ferry-lines/uskudarkarakoyeminonu-815"               
## [57] "/en/timetables/domestic-trips/bosphorus-lines/to-bosphorus-from-bosphorus-167"                    
## [58] "/en/timetables/domestic-trips/bosphorus-lines/sariyer-rumeli-kavagi-anadolu-kavagi-168"           
## [59] "/en/timetables/domestic-trips/bosphorus-lines/kucuksu-besiktas-kabatas-169"                       
## [60] "/en/timetables/domestic-trips/bosphorus-lines/cengelkoy-istinye-170"                              
## [61] "/en/timetables/domestic-trips/bosphorus-lines/kadikoy-sariyer-171"                                
## [62] "/en/timetables/domestic-trips/bosphorus-lines/anadolu-kavagi-uskudar-172"                         
## [63] "/en/timetables/domestic-trips/bosphorus-lines/rumeli-kavagi-eminonu-174"                          
## [64] "/en/timetables/domestic-trips/bosphorus-lines/kucuksu-istinye-175"                                
## [65] "/en/timetables/domestic-trips/adalar-princes-islands-lines/adalar-kabatas-817"                    
## [66] "/en/timetables/domestic-trips/adalar-princes-islands-lines/adalarbesiktas-818"                    
## [67] "/en/timetables/domestic-trips/adalar-princes-islands-lines/bostanci-adalar-ring-lines-819"        
## [68] "/uploads/pdf\\tarife.pdf"                                                                         
## [69] "https://www.ibb.istanbul"                                                                         
## [70] "tel:+153"                                                                                         
## [71] "http://ibb.tv/yayin"                                                                              
## [72] "/en/corporate/sehir-hatlari-616"                                                                  
## [73] "/en/corporate/presidents-message-720"                                                             
## [74] "/en/corporate/general-manager-721"                                                                
## [75] "/en/corporate/vision-and-values-217"                                                              
## [76] "/en/corporate/management-systems-policy-364"                                                      
## [77] "/en/ferries"                                                                                      
## [78] "/en/piers"                                                                                        
## [79] "/en/journeys/domestic-trips"                                                                      
## [80] "/en/journeys/bosphorus-tours"                                                                     
## [81] "/en/journeys/ferry-vehicle"                                                                       
## [82] "/en/price-list"                                                                                   
## [83] "/en/information/disabled-guests-186"                                                              
## [84] "en/information/corporate-film-618"                                                                
## [85] "/en/frequently-asked-questions"                                                                   
## [86] "/en/corporate/filming-210"                                                                        
## [87] "/en/corporate/advertising-areas-209"                                                              
## [88] "/en/corporate/rental-services-208"                                                                
## [89] "/en/corporate/shipyard-services-617"                                                              
## [90] "/en/contact"                                                                                      
## [91] "/en/applies"                                                                                      
## [92] "http://www.sehirhatlari.com.tr/sanaltur/index.htm"                                                
## [93] "https://itunes.apple.com/tr/app/%C5%9Fehir-hatlar%C4%B1/id783673371?l=tr&mt=8"                    
## [94] "https://play.google.com/store/apps/details?id=com.spexco.flexcoder2.sehirhatlari.activities&hl=tr"
## [95] "https://www.facebook.com/sehirhatlari"                                                            
## [96] "https://www.instagram.com/sehir_hatlari/"                                                         
## [97] "https://twitter.com/sehir_hatlari"                                                                
## [98] "https://www.youtube.com/user/sehirhatlari1"

With a simple regex and adding the root url we can get all the relevant links from the web page.

domestic_trips_links <- paste0("https://sehirhatlari.istanbul",links_vec[grepl("/domestic-trips/",links_vec)])
domestic_trips_links
##  [1] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines-26"                              
##  [2] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines-52"                                         
##  [3] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines-176"                           
##  [4] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines-26"                              
##  [5] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines-52"                                         
##  [6] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines-176"                           
##  [7] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/kadikoy-besiktas-165"            
##  [8] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/halic-hatti-37"                  
##  [9] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/bostanci-karakoy-kabatas-166"    
## [10] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/kadikoykarakoyeminonu-813"       
## [11] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/uskudarkarakoyeminonu-815"       
## [12] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/to-bosphorus-from-bosphorus-167"            
## [13] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/sariyer-rumeli-kavagi-anadolu-kavagi-168"   
## [14] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/kucuksu-besiktas-kabatas-169"               
## [15] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/cengelkoy-istinye-170"                      
## [16] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/kadikoy-sariyer-171"                        
## [17] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/anadolu-kavagi-uskudar-172"                 
## [18] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/rumeli-kavagi-eminonu-174"                  
## [19] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/kucuksu-istinye-175"                        
## [20] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines/adalar-kabatas-817"            
## [21] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines/adalarbesiktas-818"            
## [22] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines/bostanci-adalar-ring-lines-819"

You can click all the links below.

paste0("+ ",domestic_trips_links," <br>",collapse=" ")

[1] “+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines-26
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines-52
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines-176
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines-26
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines-52
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines-176
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/kadikoy-besiktas-165
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/halic-hatti-37
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/bostanci-karakoy-kabatas-166
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/kadikoykarakoyeminonu-813
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/uskudarkarakoyeminonu-815
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/to-bosphorus-from-bosphorus-167
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/sariyer-rumeli-kavagi-anadolu-kavagi-168
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/kucuksu-besiktas-kabatas-169
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/cengelkoy-istinye-170
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/kadikoy-sariyer-171
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/anadolu-kavagi-uskudar-172
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/rumeli-kavagi-eminonu-174
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/kucuksu-istinye-175
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines/adalar-kabatas-817
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines/adalarbesiktas-818
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines/bostanci-adalar-ring-lines-819

Exercises

  • Collect multiple periods from BKM page and create an analysis.
  • Collect all timetables from Şehir Hatları domestic lines and create a timetable Shiny app.