Main sources
rvest
main development page: https://rvest.tidyverse.org/- Tutorial on rvest, httr and RSelenium
rvest
?rvest
is a very useful R package to parse HTML files of web pages. Usually information is not directly provided in Excel files but embedded in one or more pages in a web site.
rvest
main development page: https://rvest.tidyverse.org/rvest
Usage ExamplesSince rvest
is part of the tidyverse package we can easily use pipes with it.
library(tidyverse)
library(rvest)
In the first example, we will extract tables from BKM’s (Interbank Card Center) sectoral development reports (see the website). Let’s start with getting the web page.
the_url <- "https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=2020&filter_month=1&List=Listele"
html_obj <- read_html(the_url)
When we check the object itself, we see it is a bunch of html code. We are on a good path.
html_obj
We can examine the html structure by using the powerful html_structure()
function. Since it is a bit of verbose see for yourself.
html_obj %>% html_structure()
We can simply extract the table using html_table()
. If you get an error about inconsistent fields, use parameter fill=TRUE
.
## [[1]]
## X1 X2
## 1 SEÇİLEN AYA AİT SEKTÖREL GELİŞİM Seçilen Aya Ait Sektörel Gelişim
## 2 Seçilen Aya Ait Sektörel Gelişim <NA>
## X3
## 1 Seçilen Aya Ait Sektörel Gelişim
## 2 <NA>
##
## [[2]]
## X1
## 1 Seçilen Aya Ait Sektörel Gelişim
##
## [[3]]
## X1
## 1 Yıl:\n 2020201920182017201620152014201320122011Ay:\n 123456
##
## [[4]]
## X1 X2
## 1 İşyeri Grubu İşlem Adedi
## 2 İşyeri Grubu İşlem Adedi(Kredi Kartı)
## 3 ARABA KİRALAMA 341.271
## 4 ARAÇ KİRALAMA-SATIŞ/SERVİS/YEDEK PARÇA 3.497.791
## 5 BENZİN VE YAKIT İSTASYONLARI 27.488.661
## 6 BIREYSEL EMEKLILIK 2.078.609
## 7 ÇEŞİTLİ GIDA 33.777.181
## 8 DOĞRUDAN PAZARLAMA 627.712
## 9 EĞİTİM / KIRTASİYE / OFİS MALZEMELERİ 6.274.208
## 10 ELEKTRİK-ELEKTRONİK EŞYA, BİLGİSAYAR 8.809.733
## 11 GİYİM VE AKSESUAR 26.750.805
## 12 HAVAYOLLARI 2.013.912
## 13 HİZMET SEKTÖRLERİ 25.890.561
## 14 KAMU/VERGI ODEMELERI 7.879.506
## 15 KONAKLAMA 1.863.395
## 16 KULÜP / DERNEK /SOSYAL HİZMETLER 1.353.113
## 17 KUMARHANE/İÇKİLİ YERLER 567.888
## 18 KUYUMCULAR 846.168
## 19 MARKET VE ALIŞVERİŞ MERKEZLERİ 108.748.150
## 20 MOBİLYA VE DEKORASYON 4.712.723
## 21 MÜTEAHHİT İŞLERİ 829.057
## 22 SAĞLIK/SAĞLIK ÜRÜNLERİ/KOZMETİK 17.637.257
## 23 SEYAHAT ACENTELERİ/TAŞIMACILIK 8.115.027
## 24 SİGORTA 5.090.080
## 25 TELEKOMÜNİKASYON 19.596.112
## 26 YAPI MALZEMELERİ, HIRDAVAT, NALBURİYE 3.314.053
## 27 YEMEK 49.032.939
## 28 DİĞER 8.444.282
## 29 TOPLAM 375.580.194
## X3 X4
## 1 İşlem Adedi İşlem Tutarı (Milyon TL)
## 2 İşlem Adedi (Banka Kartı) İşlem Tutarı \n (Kredi Kartı)
## 3 70.947 212,39
## 4 969.202 2.675,77
## 5 11.642.138 5.607,72
## 6 2.043 771,85
## 7 21.569.718 5.284,27
## 8 60.381 728,13
## 9 3.565.787 2.044,27
## 10 2.609.689 4.184,70
## 11 12.487.879 5.669,53
## 12 589.812 2.033,19
## 13 8.073.696 5.513,96
## 14 1.325.853 4.339,11
## 15 1.047.125 1.273,61
## 16 359.911 300,30
## 17 502.004 118,88
## 18 427.264 1.033,15
## 19 65.886.572 12.736,09
## 20 1.881.289 2.240,39
## 21 435.806 826,08
## 22 9.096.864 3.095,37
## 23 3.635.735 1.945,17
## 24 37.618 3.734,59
## 25 3.672.814 2.118,63
## 26 1.261.101 2.726,61
## 27 46.378.312 2.994,79
## 28 1.473.111 2.767,58
## 29 199.062.671 76.976,12
## X5
## 1 İşlem Tutarı (Milyon TL)
## 2 İşlem Tutarı \n (Banka Kartı)
## 3 23,18
## 4 217,14
## 5 1.059,71
## 6 1,03
## 7 1.123,63
## 8 11,15
## 9 337,33
## 10 522,79
## 11 1.956,91
## 12 754,91
## 13 920,45
## 14 201,53
## 15 353,27
## 16 46,15
## 17 48,58
## 18 601,21
## 19 3.235,18
## 20 305,51
## 21 51,42
## 22 856,75
## 23 496,54
## 24 14,85
## 25 288,14
## 26 251,68
## 27 1.582,02
## 28 440,51
## 29 15.701,58
html_obj %>% html_table(fill=TRUE)
Now we are getting somewhere on the fourth item but it is not up to our quality standard. Let’s deploy dplyr
functions to make it better.
html_df <- read_html(the_url) %>% html_table(fill=TRUE) %>% `[[`(4)
html_df %>%
# Since we do not have too many columns let's rename them manually
# number (num) or value (val) of transactions (txn)
# by credit card (cc) or debit card (dc)
rename(category = 1, num_txn_cc = 2, num_txn_dc = 3, val_txn_cc = 4, val_txn_dc = 5) %>%
# remove the first two rows because they are actually titles
slice(-(1:2)) %>%
# then convert every numeric value by using parse_number function from readr
mutate(
across(-category,
~readr::parse_number(.,
locale=readr::locale(decimal_mark=",",grouping_mark = ".")
)
)
)
## category num_txn_cc num_txn_dc val_txn_cc
## 1 ARABA KİRALAMA 341271 70947 212.39
## 2 ARAÇ KİRALAMA-SATIŞ/SERVİS/YEDEK PARÇA 3497791 969202 2675.77
## 3 BENZİN VE YAKIT İSTASYONLARI 27488661 11642138 5607.72
## 4 BIREYSEL EMEKLILIK 2078609 2043 771.85
## 5 ÇEŞİTLİ GIDA 33777181 21569718 5284.27
## 6 DOĞRUDAN PAZARLAMA 627712 60381 728.13
## 7 EĞİTİM / KIRTASİYE / OFİS MALZEMELERİ 6274208 3565787 2044.27
## 8 ELEKTRİK-ELEKTRONİK EŞYA, BİLGİSAYAR 8809733 2609689 4184.70
## 9 GİYİM VE AKSESUAR 26750805 12487879 5669.53
## 10 HAVAYOLLARI 2013912 589812 2033.19
## 11 HİZMET SEKTÖRLERİ 25890561 8073696 5513.96
## 12 KAMU/VERGI ODEMELERI 7879506 1325853 4339.11
## 13 KONAKLAMA 1863395 1047125 1273.61
## 14 KULÜP / DERNEK /SOSYAL HİZMETLER 1353113 359911 300.30
## 15 KUMARHANE/İÇKİLİ YERLER 567888 502004 118.88
## 16 KUYUMCULAR 846168 427264 1033.15
## 17 MARKET VE ALIŞVERİŞ MERKEZLERİ 108748150 65886572 12736.09
## 18 MOBİLYA VE DEKORASYON 4712723 1881289 2240.39
## 19 MÜTEAHHİT İŞLERİ 829057 435806 826.08
## 20 SAĞLIK/SAĞLIK ÜRÜNLERİ/KOZMETİK 17637257 9096864 3095.37
## 21 SEYAHAT ACENTELERİ/TAŞIMACILIK 8115027 3635735 1945.17
## 22 SİGORTA 5090080 37618 3734.59
## 23 TELEKOMÜNİKASYON 19596112 3672814 2118.63
## 24 YAPI MALZEMELERİ, HIRDAVAT, NALBURİYE 3314053 1261101 2726.61
## 25 YEMEK 49032939 46378312 2994.79
## 26 DİĞER 8444282 1473111 2767.58
## 27 TOPLAM 375580194 199062671 76976.12
## val_txn_dc
## 1 23.18
## 2 217.14
## 3 1059.71
## 4 1.03
## 5 1123.63
## 6 11.15
## 7 337.33
## 8 522.79
## 9 1956.91
## 10 754.91
## 11 920.45
## 12 201.53
## 13 353.27
## 14 46.15
## 15 48.58
## 16 601.21
## 17 3235.18
## 18 305.51
## 19 51.42
## 20 856.75
## 21 496.54
## 22 14.85
## 23 288.14
## 24 251.68
## 25 1582.02
## 26 440.51
## 27 15701.58
In the second example we will harvest the links from Istanbul’s Şehir Hatları (ferry line) domestic trips.
html_obj <- read_html("https://sehirhatlari.istanbul/en/timetables/domestic-trips")
Let’s get all the links using “a” nodes and “href” attributes. We are looking for “domestic trips” links.
links_vec <- html_obj %>% html_nodes("a") %>% html_attr("href")
links_vec
## [1] "/en/corporate/privacy-and-cookie-policy-744"
## [2] "javascript:__doPostBack('ctl00$Cookie$btnCerezler','')"
## [3] NA
## [4] "/tr"
## [5] "/en"
## [6] "#"
## [7] "/en/corporate/sehir-hatlari-616"
## [8] "/en/corporate/presidents-message-720"
## [9] "/en/corporate/general-manager-721"
## [10] "/en/corporate/vision-and-values-217"
## [11] "/en/corporate/management-systems-policy-364"
## [12] "#"
## [13] "/en/timetables/domestic-trips"
## [14] "/en/timetables/ferry-vehicle"
## [15] "/en/timetables/bosphorus-tours"
## [16] "/en/price-list"
## [17] "#"
## [18] "/en/information/disabled-guests-186"
## [19] "/en/information/corporate-film-618"
## [20] "/en/frequently-asked-questions"
## [21] "#"
## [22] "/en/corporate/rental-services-208"
## [23] "/en/corporate/shipyard-services-617"
## [24] "/en/corporate/advertising-areas-209"
## [25] "/en/corporate/filming-210"
## [26] "#"
## [27] "/en/contact"
## [28] "/en/applies"
## [29] "https://twitter.com/sehir_hatlari"
## [30] "https://www.facebook.com/sehirhatlari"
## [31] "https://www.instagram.com/sehir_hatlari/"
## [32] "https://www.youtube.com/user/sehirhatlari1"
## [33] NA
## [34] "/tr"
## [35] "#"
## [36] "#"
## [37] "/en/timetables/domestic-trips/inner-istanbul-ferry-lines-26"
## [38] "/en/timetables/domestic-trips/bosphorus-lines-52"
## [39] "/en/timetables/domestic-trips/adalar-princes-islands-lines-176"
## [40] "/en"
## [41] "/en/timetables"
## [42] "#ichatlartarifeleri"
## [43] "/en/timetables/domestic-trips/inner-istanbul-ferry-lines-26"
## [44] "/en/timetables/domestic-trips/bosphorus-lines-52"
## [45] "/en/timetables/domestic-trips/adalar-princes-islands-lines-176"
## [46] "/en/timetables/ferry-vehicle"
## [47] "/en/timetables/bosphorus-tours"
## [48] "#ucrettarifeleri"
## [49] "/en/price-list/ferry-lines-79"
## [50] "/en/price-list/bosphorus-tours-78"
## [51] "/en/price-list/ferry-vehicle-line-77"
## [52] "/en/timetables/domestic-trips/inner-istanbul-ferry-lines/kadikoy-besiktas-165"
## [53] "/en/timetables/domestic-trips/inner-istanbul-ferry-lines/halic-hatti-37"
## [54] "/en/timetables/domestic-trips/inner-istanbul-ferry-lines/bostanci-karakoy-kabatas-166"
## [55] "/en/timetables/domestic-trips/inner-istanbul-ferry-lines/kadikoykarakoyeminonu-813"
## [56] "/en/timetables/domestic-trips/inner-istanbul-ferry-lines/uskudarkarakoyeminonu-815"
## [57] "/en/timetables/domestic-trips/bosphorus-lines/to-bosphorus-from-bosphorus-167"
## [58] "/en/timetables/domestic-trips/bosphorus-lines/sariyer-rumeli-kavagi-anadolu-kavagi-168"
## [59] "/en/timetables/domestic-trips/bosphorus-lines/kucuksu-besiktas-kabatas-169"
## [60] "/en/timetables/domestic-trips/bosphorus-lines/cengelkoy-istinye-170"
## [61] "/en/timetables/domestic-trips/bosphorus-lines/kadikoy-sariyer-171"
## [62] "/en/timetables/domestic-trips/bosphorus-lines/anadolu-kavagi-uskudar-172"
## [63] "/en/timetables/domestic-trips/bosphorus-lines/rumeli-kavagi-eminonu-174"
## [64] "/en/timetables/domestic-trips/bosphorus-lines/kucuksu-istinye-175"
## [65] "/en/timetables/domestic-trips/adalar-princes-islands-lines/adalar-kabatas-817"
## [66] "/en/timetables/domestic-trips/adalar-princes-islands-lines/adalarbesiktas-818"
## [67] "/en/timetables/domestic-trips/adalar-princes-islands-lines/bostanci-adalar-ring-lines-819"
## [68] "/uploads/pdf\\tarife.pdf"
## [69] "https://www.ibb.istanbul"
## [70] "tel:+153"
## [71] "http://ibb.tv/yayin"
## [72] "/en/corporate/sehir-hatlari-616"
## [73] "/en/corporate/presidents-message-720"
## [74] "/en/corporate/general-manager-721"
## [75] "/en/corporate/vision-and-values-217"
## [76] "/en/corporate/management-systems-policy-364"
## [77] "/en/ferries"
## [78] "/en/piers"
## [79] "/en/journeys/domestic-trips"
## [80] "/en/journeys/bosphorus-tours"
## [81] "/en/journeys/ferry-vehicle"
## [82] "/en/price-list"
## [83] "/en/information/disabled-guests-186"
## [84] "en/information/corporate-film-618"
## [85] "/en/frequently-asked-questions"
## [86] "/en/corporate/filming-210"
## [87] "/en/corporate/advertising-areas-209"
## [88] "/en/corporate/rental-services-208"
## [89] "/en/corporate/shipyard-services-617"
## [90] "/en/contact"
## [91] "/en/applies"
## [92] "http://www.sehirhatlari.com.tr/sanaltur/index.htm"
## [93] "https://itunes.apple.com/tr/app/%C5%9Fehir-hatlar%C4%B1/id783673371?l=tr&mt=8"
## [94] "https://play.google.com/store/apps/details?id=com.spexco.flexcoder2.sehirhatlari.activities&hl=tr"
## [95] "https://www.facebook.com/sehirhatlari"
## [96] "https://www.instagram.com/sehir_hatlari/"
## [97] "https://twitter.com/sehir_hatlari"
## [98] "https://www.youtube.com/user/sehirhatlari1"
With a simple regex and adding the root url we can get all the relevant links from the web page.
domestic_trips_links <- paste0("https://sehirhatlari.istanbul",links_vec[grepl("/domestic-trips/",links_vec)])
domestic_trips_links
## [1] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines-26"
## [2] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines-52"
## [3] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines-176"
## [4] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines-26"
## [5] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines-52"
## [6] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines-176"
## [7] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/kadikoy-besiktas-165"
## [8] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/halic-hatti-37"
## [9] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/bostanci-karakoy-kabatas-166"
## [10] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/kadikoykarakoyeminonu-813"
## [11] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/uskudarkarakoyeminonu-815"
## [12] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/to-bosphorus-from-bosphorus-167"
## [13] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/sariyer-rumeli-kavagi-anadolu-kavagi-168"
## [14] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/kucuksu-besiktas-kabatas-169"
## [15] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/cengelkoy-istinye-170"
## [16] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/kadikoy-sariyer-171"
## [17] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/anadolu-kavagi-uskudar-172"
## [18] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/rumeli-kavagi-eminonu-174"
## [19] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/kucuksu-istinye-175"
## [20] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines/adalar-kabatas-817"
## [21] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines/adalarbesiktas-818"
## [22] "https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines/bostanci-adalar-ring-lines-819"
You can click all the links below.
paste0("+ ",domestic_trips_links," <br>",collapse=" ")
[1] “+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines-26
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines-52
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines-176
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines-26
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines-52
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines-176
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/kadikoy-besiktas-165
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/halic-hatti-37
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/bostanci-karakoy-kabatas-166
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/kadikoykarakoyeminonu-813
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/inner-istanbul-ferry-lines/uskudarkarakoyeminonu-815
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/to-bosphorus-from-bosphorus-167
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/sariyer-rumeli-kavagi-anadolu-kavagi-168
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/kucuksu-besiktas-kabatas-169
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/cengelkoy-istinye-170
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/kadikoy-sariyer-171
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/anadolu-kavagi-uskudar-172
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/rumeli-kavagi-eminonu-174
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/bosphorus-lines/kucuksu-istinye-175
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines/adalar-kabatas-817
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines/adalarbesiktas-818
+ https://sehirhatlari.istanbul/en/timetables/domestic-trips/adalar-princes-islands-lines/bostanci-adalar-ring-lines-819
”