Airbnb is an online marketplace since 2008, which connects people who want to rent their homes with people who are looking for accommodations in a particular location. It covers more than 81,000 cities and 191 countries worldwide. The company ,which is based in San Francisco, California, does not own any of the property listings, but it receives commissions from each booking like a broker. The name “Airbnb” comes from “air mattress Bed and Breakfast.” The Airbnb logo is called the Bélo, which is a short version for saying ‘Belong Anywhere’. Airbnb hosts list many different kinds of properties such as private rooms, apartments, shared rooms, houseboats, entire houses, etc.
This dataset describes the listing activity and metrics in NYC for 2019. It includes all the necessary information in order to find out more about hosts, prices, geographical availability, and necessary information to make predictions and draw conclusions for NYC. The explanation of the variables in our data, which consists of 16 columns and 48,895 rows, will be made in the next part. The data used in this assignment is called New York City Airbnb Open Data which is downloaded from Kaggle. This public dataset is a part of Airbnb, and the original source can be found on this website.
In this assignment, we will perform an exploratory data analysis(EDA) in order to investigate each of the variables and also come up with a conclusion for the relationship between variables. The main purpose is to identify which variables affect the price mostly. In addition to these, we will explore which neighborhood groups and room types are the most popular ones among the guests, and which hosts are the most preferred ones. The processes during the assignment can be listed as below:
We have used several packages during the analysis of the historical data of Airbnb in NYC in order to make data manipulation and visualization. The list of packages used in this assignment can be seen below:
pti <- c("tidyverse", "lubridate", "tinytex", "wordcloud", "shiny", "knitr", "data.table", "tm", "SnowballC", "corpus")
pti <- pti[!(pti %in% installed.packages())]
if(length(pti)>0){
install.packages(pti)
}
library(tidyverse)
library(lubridate)
library(tinytex)
library(wordcloud)
library(shiny)
library(knitr)
library(data.table)
library(tm)
library(SnowballC)
library(corpus)
After the importing data, to investigate variables in the data frame,i.e., airbnb data set, we use glimpse()
function.
file <- if(file.exists("AB_NYC_2019.csv")) {
"AB_NYC_2019.csv"
} else {
url('https://raw.githubusercontent.com/pjournal/boun01g-data-mine-r-s/gh-pages/Assignment/AB_NYC_2019.csv')
}
airbnb = read_csv(file)
airbnb$last_review<-as.POSIXct(airbnb$last_review,format="%Y-%m-%d")
airbnb %>% glimpse()
## Rows: 48,895
## Columns: 16
## $ id <dbl> 2539, 2595, 3647, 3831, 5022, 5099, ...
## $ name <chr> "Clean & quiet apt home by the park"...
## $ host_id <dbl> 2787, 2845, 4632, 4869, 7192, 7322, ...
## $ host_name <chr> "John", "Jennifer", "Elisabeth", "Li...
## $ neighbourhood_group <chr> "Brooklyn", "Manhattan", "Manhattan"...
## $ neighbourhood <chr> "Kensington", "Midtown", "Harlem", "...
## $ latitude <dbl> 40.64749, 40.75362, 40.80902, 40.685...
## $ longitude <dbl> -73.97237, -73.98377, -73.94190, -73...
## $ room_type <chr> "Private room", "Entire home/apt", "...
## $ price <dbl> 149, 225, 150, 89, 80, 200, 60, 79, ...
## $ minimum_nights <dbl> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1, 5, 2...
## $ number_of_reviews <dbl> 9, 45, 0, 270, 9, 74, 49, 430, 118, ...
## $ last_review <dttm> 2018-10-19 03:00:00, 2019-05-21 03:...
## $ reviews_per_month <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.59, 0....
## $ calculated_host_listings_count <dbl> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, ...
## $ availability_365 <dbl> 365, 355, 365, 194, 0, 129, 0, 220, ...
The glimpse()
is a function of the dplyr()
. If you do not use the dplyr() package, you can use str()
function in the base R as an alternative. These two functions give the same results.
## tibble [48,895 x 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:48895] 2539 2595 3647 3831 5022 ...
## $ name : chr [1:48895] "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : num [1:48895] 2787 2845 4632 4869 7192 ...
## $ host_name : chr [1:48895] "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : chr [1:48895] "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
## $ neighbourhood : chr [1:48895] "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
## $ latitude : num [1:48895] 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num [1:48895] -74 -74 -73.9 -74 -73.9 ...
## $ room_type : chr [1:48895] "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ price : num [1:48895] 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : num [1:48895] 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : num [1:48895] 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : POSIXct[1:48895], format: "2018-10-19 03:00:00" "2019-05-21 03:00:00" ...
## $ reviews_per_month : num [1:48895] 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: num [1:48895] 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : num [1:48895] 365 355 365 194 0 129 0 220 0 188 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. name = col_character(),
## .. host_id = col_double(),
## .. host_name = col_character(),
## .. neighbourhood_group = col_character(),
## .. neighbourhood = col_character(),
## .. latitude = col_double(),
## .. longitude = col_double(),
## .. room_type = col_character(),
## .. price = col_double(),
## .. minimum_nights = col_double(),
## .. number_of_reviews = col_double(),
## .. last_review = col_date(format = ""),
## .. reviews_per_month = col_double(),
## .. calculated_host_listings_count = col_double(),
## .. availability_365 = col_double()
## .. )
This dataset contains 16 features/variables about Airbnb listings within New York City. Below are the features with their descriptions:
id
: Listing ID (numeric variable)name
: Listing Title (categorical variable)host_id
: ID of Host (numeric variable)host_name
: Name of Host (categorical Variable)neighbourhood_group
: Neighbourhood group that contains listing (categorical variable)neighbourhood
: Neighbourhood group that contains listing (categorical variable)latitude
: Latitude of listing (numeric variable)longitude
: Longitude of listing (numeric variable)room_type
: Type of the offered property (categorical variable)price
: Price per night in USD (numeric variable)minimum_nights
: Minimum number of nights required to book listing (numeric variable)number_of_reviews
: Total number of reviews that listing has (numeric variable)last_review
: Last rent date of the listing (date variable)reviews_per_month
: Total number of reviews divided by the number of months that the listing is active (numeric variable)calculated_host_listings_count
: Amount of listing per host (numeric variable)availability_365
: Number of days per year the listing is active (numeric variable)Our data set has almost 49.000 rows. Therefore it may include duplicate and/or missing values. To check them, we run the following codes.
There are 10052 values missing in the dataset and all of them are in the reviews_per_month
column.
## [1] 0
There is no duplicated row in this dataset.
The summary of the data set can be seen below.
## id name host_id host_name
## Min. : 2539 Length:48895 Min. : 2438 Length:48895
## 1st Qu.: 9471945 Class :character 1st Qu.: 7822033 Class :character
## Median :19677284 Mode :character Median : 30793816 Mode :character
## Mean :19017143 Mean : 67620011
## 3rd Qu.:29152178 3rd Qu.:107434423
## Max. :36487245 Max. :274321313
##
## neighbourhood_group neighbourhood latitude longitude
## Length:48895 Length:48895 Min. :40.50 Min. :-74.24
## Class :character Class :character 1st Qu.:40.69 1st Qu.:-73.98
## Mode :character Mode :character Median :40.72 Median :-73.96
## Mean :40.73 Mean :-73.95
## 3rd Qu.:40.76 3rd Qu.:-73.94
## Max. :40.91 Max. :-73.71
##
## room_type price minimum_nights number_of_reviews
## Length:48895 Min. : 0.0 Min. : 1.00 Min. : 0.00
## Class :character 1st Qu.: 69.0 1st Qu.: 1.00 1st Qu.: 1.00
## Mode :character Median : 106.0 Median : 3.00 Median : 5.00
## Mean : 152.7 Mean : 7.03 Mean : 23.27
## 3rd Qu.: 175.0 3rd Qu.: 5.00 3rd Qu.: 24.00
## Max. :10000.0 Max. :1250.00 Max. :629.00
##
## last_review reviews_per_month calculated_host_listings_count
## Min. :2011-03-28 02:00:00 Min. : 0.010 Min. : 1.000
## 1st Qu.:2018-07-08 03:00:00 1st Qu.: 0.190 1st Qu.: 1.000
## Median :2019-05-19 03:00:00 Median : 0.720 Median : 1.000
## Mean :2018-10-04 04:47:23 Mean : 1.373 Mean : 7.144
## 3rd Qu.:2019-06-23 03:00:00 3rd Qu.: 2.020 3rd Qu.: 2.000
## Max. :2019-07-08 03:00:00 Max. :58.500 Max. :327.000
## NA's :10052 NA's :10052
## availability_365
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 45.0
## Mean :112.8
## 3rd Qu.:227.0
## Max. :365.0
##
Before starting our analysis, we also want to check the outlier points in this dataset and we take the quantile 1 and 3 as references.
qtl1 = quantile(airbnb$price, 0.25)
qtl3 = quantile(airbnb$price, 0.75)
iqr = qtl3 - qtl1
lower = qtl1 - iqr * 1.5
upper = qtl3 + iqr * 1.5
lower
## 25%
## -90
## 75%
## 334
airbnb %>%
filter(price < lower | price > upper) %>%
top_n(10, price) %>%
select(neighbourhood_group, neighbourhood, price) %>%
arrange(desc(price)) %>%
kable(col.names = c("Neighbourhood Group", "Neighbourhood", "Price"))
Neighbourhood Group | Neighbourhood | Price |
---|---|---|
Queens | Astoria | 10000 |
Brooklyn | Greenpoint | 10000 |
Manhattan | Upper West Side | 10000 |
Manhattan | East Harlem | 9999 |
Manhattan | Lower East Side | 9999 |
Manhattan | Lower East Side | 9999 |
Manhattan | Tribeca | 8500 |
Brooklyn | Clinton Hill | 8000 |
Manhattan | Upper East Side | 7703 |
Manhattan | Battery Park City | 7500 |
Brooklyn | East Flatbush | 7500 |
When we analyze the lower and upper bound of the non-outliers data, the lower bound is obtained as minus 90. In our data set, as we consider the price of the airbnb room, there is no negative price. For this reason, we only consider the upper bound. The upper bound address the 334. This means that, if the price value is greater than 334, it becomes an outlier value.In this data set, there are 2972 outliers and the top ten with the highest price is listed as above.
In order to see the location of airbnb rooms, we use coordinates (latitude and longitude) and color the neighborhood groups. Moreover, to see the density of the rooms in each neighborhood group, we use feature of the geom_point()
, which is alpha
.
ggplot(airbnb, aes(latitude, longitude, color = neighbourhood_group)) +
geom_point(alpha = 0.6) +
theme_minimal() +
labs(title = "Coordinates of Airbnb Rooms According to the Neighbourhood Group",
subtitle = "2019 NYC Airbnb Data",
x = "Latitude",
y = "Longitude",
color = "Neighbourhood Group")
Bronx and Staten Island have less room than the others. The room densities of Brooklyn and Manhattan are distributed balanced in their regions.
By using quantile function, we divide price interval into five. Then, we define values in this intervals as very low, low, medium, high, very high. Then by using this categorical value, we prepare pie chart for each neighborhood group.
quant = quantile(airbnb$price, seq(0, 1, 0.2))
#quant
airbnb_price_group = airbnb %>%
mutate(price_group = case_when(
price < quant[2] ~ "Very Low",
price < quant[3] ~ "Low",
price < quant[4] ~ "Medium",
price < quant[5] ~ "High",
TRUE ~ "Very High"
)) %>%
mutate(price_group = factor(price_group, levels = c("Very Low", "Low", "Medium", "High", "Very High")))
airbnb_price_group %>%
group_by(neighbourhood_group, price_group) %>%
summarize(counter = n()) %>%
ggplot(., aes(x = '', y = counter, fill = price_group)) +
geom_bar(width = 1, stat = "identity", position = "fill") +
coord_polar("y") +
theme_void() +
theme(plot.title = element_text(vjust = 0.5)) +
facet_wrap(~neighbourhood_group) +
labs(title = "Price Group Analyses of Neighborhood Groups",
subtitle = "2019 NYC Airbnb Data",
fill = "Price Group")
We summarize the results as follow:
In the previous analysis, we try to define the percentage of the price group in each neighborhood group. To illustrate the price group change by location in each neighborhood group, the following plots are conducted.
airbnb_price_group %>%
ggplot(., aes(latitude, longitude, color = price_group)) +
geom_point() +
theme_minimal() +
facet_wrap(~neighbourhood_group, scales = "free") +
labs(title = "Spread of the Price Group In Each Neighborhood Group",
subtitle = "2019 NYC Airbnb Data",
x = "Latitude",
y = "Longtitude",
color = "Price Group")
Very high price group in Manhattan concentrates in a particular area, although there are homogeneous spread of price groups in Bronx, Brooklyn, Queens, and Staten Island.
Before the detailed explanatory data analysis, we obtain average, minimum and maximum price for each neighborhood group. Moreover, we can add the average availability and average number of reviews. These values give general information about the airbnb rooms.
airbnb %>%
group_by(neighbourhood_group) %>%
summarise(min_price = min(price),
mean_price = round(mean(price), digits = 2),
max_price = max(price),
average_availability = round(mean(availability_365), digits = 2),
average_review = round(mean(number_of_reviews), digits = 2)) %>%
select(neighbourhood_group, min_price, mean_price, max_price, average_availability, average_review ) %>%
arrange(desc(mean_price)) %>%
kable(col.names = c("Neighborhood Group", "Min Price", "Mean Price", "Max Price", "Average Availability", "Average Review"))
Neighborhood Group | Min Price | Mean Price | Max Price | Average Availability | Average Review |
---|---|---|---|---|---|
Manhattan | 0 | 196.88 | 10000 | 111.98 | 20.99 |
Brooklyn | 0 | 124.38 | 10000 | 100.23 | 24.20 |
Staten Island | 13 | 114.81 | 5000 | 199.68 | 30.94 |
Queens | 10 | 99.52 | 10000 | 144.45 | 27.70 |
Bronx | 0 | 87.50 | 2500 | 165.76 | 26.00 |
After the neighborhood group analysis, same preparation can be made by using room types. This table presents a more comprehensive analysis.
airbnb %>%
group_by(neighbourhood_group, room_type) %>%
summarise(min_price = min(price),
mean_price = round(mean(price), digits = 2),
max_price = max(price),
average_availability = round(mean(availability_365), digits = 2),
average_review = round(mean(number_of_reviews), digits = 2)) %>%
select(neighbourhood_group,room_type, min_price, mean_price, max_price, average_availability, average_review ) %>%
kable(col.names = c("Neighborhood Group", "Room Type", "Min Price", "Mean Price", "Max Price", "Average Availability", "Average Review"))
Neighborhood Group | Room Type | Min Price | Mean Price | Max Price | Average Availability | Average Review |
---|---|---|---|---|---|---|
Bronx | Entire home/apt | 28 | 127.51 | 1000 | 158.00 | 30.68 |
Bronx | Private room | 0 | 66.79 | 2500 | 171.33 | 25.02 |
Bronx | Shared room | 20 | 59.80 | 800 | 154.22 | 7.20 |
Brooklyn | Entire home/apt | 0 | 178.33 | 10000 | 97.21 | 27.95 |
Brooklyn | Private room | 0 | 76.50 | 7500 | 99.92 | 21.09 |
Brooklyn | Shared room | 0 | 50.53 | 725 | 178.01 | 14.03 |
Manhattan | Entire home/apt | 0 | 249.24 | 10000 | 117.14 | 17.82 |
Manhattan | Private room | 10 | 116.78 | 9999 | 101.85 | 26.20 |
Manhattan | Shared room | 10 | 88.98 | 1000 | 138.57 | 21.40 |
Queens | Entire home/apt | 10 | 147.05 | 2600 | 132.27 | 28.93 |
Queens | Private room | 10 | 71.76 | 10000 | 149.22 | 27.75 |
Queens | Shared room | 11 | 69.02 | 1800 | 192.19 | 13.86 |
Staten Island | Entire home/apt | 48 | 173.85 | 5000 | 178.07 | 33.28 |
Staten Island | Private room | 20 | 62.29 | 300 | 226.36 | 30.16 |
Staten Island | Shared room | 13 | 57.44 | 150 | 64.78 | 1.56 |
We can obtain the most and least expensive neighborhoods according to the mean price. To provide more understandable results, we use bar chart that illustrates the neighborhoods and its neighborhood groups.
airbnb %>%
group_by(neighbourhood_group,neighbourhood)%>%
summarise(mean_price = mean(price))%>%
arrange(desc(mean_price))%>%
head(15)%>%
ggplot(., aes(x = reorder(neighbourhood, -mean_price) , y = mean_price, fill = neighbourhood_group)) +
geom_col() +
theme_minimal() +
geom_text(aes(label = format(mean_price,digits=3)), size=4, position = position_dodge(0.9),vjust = 5) +
theme(axis.text.x = element_text(angle = 45), legend.position = "right") +
labs(title = "Top 15 Most Expensive Neighbourhoods",
subtitle ="2019 NYC Airbnb Data",
x = "Neighbourhood",
y = "Mean price",
fill = "Neighbourhood Group")
The most expensive neighborhood is Fort Wadsworth with the average price $800. The other inference obtained from the bar chart is that the most expensive rooms are located in Manhattan and Staten Island. Same analysis can be made for the least expensive neighborhoods.
airbnb %>%
group_by(neighbourhood_group,neighbourhood)%>%
summarise(mean_price = mean(price))%>%
arrange(mean_price) %>%
head(15)%>%
ggplot(., aes(x = reorder(neighbourhood, mean_price) , y = mean_price, fill = neighbourhood_group)) +
geom_col() +
theme_minimal() +
geom_text(aes(label = format(mean_price,digits=3)), size=4, position = position_dodge(0.9),vjust = 5) +
theme(axis.text.x = element_text(angle = 45), legend.position = "right") +
labs(title = "Top 15 Least Expensive Neighbourhoods",
subtitle ="2019 NYC Airbnb Data",
x = "Neighbourhood",
y = "Mean price",
fill = "Neighbourhood Group")
The least expensive neighborhood is Bull’s Head with the average price $47.3. The other inference obtained from the bar chart is that the least expensive rooms are located in Bronx and Staten Island. Moreover, there is no room belongs to Manhattan in the least expensive neighborhoods.
These results show that, the price of rooms in Staten Island has a wide range.
The neighborhoods are also investigated by using average room availability. In this part of the report, we give the most and least available neighborhoods according to the neighborhood groups.
airbnb %>%
group_by(neighbourhood, neighbourhood_group)%>%
summarise(mean_availability = mean(availability_365))%>%
arrange(desc(mean_availability))%>%
head(15)%>%
ggplot(., aes(x = reorder(neighbourhood,-mean_availability) , y = mean_availability, fill = neighbourhood_group)) +
geom_col() +
theme_minimal() +
geom_text(aes(label = format(mean_availability, digits = 3)), size=4, position = position_dodge(0.9),vjust = 5) +
theme(axis.text.x = element_text(angle = 45), legend.position = "right") +
labs(title = "Top 15 Most Available Neighbourhoods",
subtitle ="2019 NYC Airbnb Data",
x = "Neighbourhood",
y = "Mean availability",
fill = "Neighbourhood Group")
By using average availability of the rooms, the graph shows that Staten Island is the most available neighborhood group in the top 15. Manhattan, on the other hand, does not have any neighborhood in the top 15.
airbnb %>%
group_by(neighbourhood, neighbourhood_group)%>%
summarise(mean_availability = mean(availability_365))%>%
arrange(mean_availability)%>%
head(15)%>%
ggplot(., aes(x = reorder(neighbourhood,mean_availability) , y = mean_availability, fill = neighbourhood_group)) +
geom_col() +
theme_minimal() +
geom_text(aes(label = format(mean_availability, digits = 3)), size=4, position = position_dodge(0.9),vjust = 5) +
theme(axis.text.x = element_text(angle = 45), legend.position = "right") +
labs(title = "Top 15 Least Available Neighbourhoods",
subtitle ="2019 NYC Airbnb Data",
x = "Neighbourhood",
y = "Mean availability",
fill = "Neighbourhood Group")
There are many neighborhoods in the data set with zero availability. Bay Terrace (Staten Island) and New Dorp (Staten Island) do not have availability.
We use box plot to illustrate the log(price) of the different room types for each neighborhood group.
ggplot(airbnb, aes(x = room_type, y = price, fill = room_type)) + scale_y_log10() +
geom_boxplot() +
theme_minimal() +
labs (x="", y= "Price") +
facet_wrap(~neighbourhood_group) +
facet_grid(.~ neighbourhood_group) +
theme(axis.text.x = element_text(angle = 90), legend.position = "right") +
labs(title = "Room Type Analysis of Neighborhood Groups",
subtitle = "2019 NYC Airbnb Data",
fill = "Room Type")
Results show that:
To see the availability of different room types, we use geom_jitter()
function and also check the density of each room type.
airbnb %>%
ggplot(., aes(x = room_type, y = availability_365, color = room_type)) +
geom_jitter() +
theme_minimal() +
theme(legend.position="bottom", plot.title = element_text(vjust = 0.5)) +
labs(title = "Availability of Room Types",
subtitle = "2019 NYC Airbnb Data",
x = "Room Type",
y = "Availability",
color = " ")
Entire home and private room have homogeneous distribution of availability, while the shared room accumulates on the edge of the intervals. To make more analysis, we also plot histogram. In this histogram, we want to analyze the availability of room types according to the neighborhood groups. It can be said that entire home/apt and private room can be reached every day in a year, whereas, shared room is not always accessible.
There are almost 50000 rooms in our data set. As we want to find the number of rooms and compare with each other, first we draw a pie chart and then we summarize in the table to provide clear difference.
airbnb %>%
group_by(neighbourhood_group) %>%
summarise(count = n(), percentage = n()/nrow(airbnb)) %>%
ggplot(., aes(x = '', y = count, fill = neighbourhood_group)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y") +
theme_void() +
#geom_text(aes(label = scales::percent(round(percentage,2))), position = position_stack(vjust = 0.5)) +
theme(legend.position="bottom", plot.title = element_text(vjust = 0.5)) +
labs(title = "The Comparison of the Number of Room",
subtitle = "2019 NYC Airbnb Data",
fill = "Neighborhood Group")
airbnb %>%
group_by(neighbourhood_group) %>%
summarise(count = n())%>%
transmute(neighbourhood_group,count, percentage = round(100*(count/nrow(airbnb)),digits = 2)) %>%
kable(col.names = c("Neighborhood Group", "Number", "Percentage"))
Neighborhood Group | Number | Percentage |
---|---|---|
Bronx | 1091 | 2.23 |
Brooklyn | 20104 | 41.12 |
Manhattan | 21661 | 44.30 |
Queens | 5666 | 11.59 |
Staten Island | 373 | 0.76 |
The results illustrate that the rooms in Manhattan and Brooklyn constitute the huge majority, i.e., the sum of these two percentage is equal to 85.42.
We analyze the number of room in each neighborhood group in previous graph. We can enlarge this analysis by using room type.
airbnb %>%
group_by(neighbourhood_group, room_type) %>%
summarize(room_type_count = n()) %>%
mutate(room_type_percentage = room_type_count / sum(room_type_count)) %>%
ggplot(., aes(x = neighbourhood_group, y = room_type_percentage, fill = room_type)) +
geom_bar(position = "fill",stat = "identity") +
scale_y_continuous(labels = scales::percent_format()) +
geom_text(aes(label = scales::percent(round(room_type_percentage, 4))),
position = position_stack(vjust = .5)) +
theme_minimal() +
labs(title = "The Number of Room Percentage for Different Room Type \n in Each Neighborhood Group",
subtitle = "2019 NYC Airbnb Data",
x = "Neighborhood Group",
y = "Percentage of Room Types",
fill = "Room Type ")
Following results can be obtained:
Like the numerical values, airbnb data includes verbal information such as name
. By using this information, we can obtain the most used words in name column which describes the room features.
wordcloudfunction = function(namesSparse, seed = 123){
set.seed(seed)
m2 <- as.matrix(namesSparse)
v2 <- sort(colSums(m2),decreasing=TRUE)
d2 <- data.frame(word = names(v2),freq=v2)
wordcloud(words = d2$word, freq = d2$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
}
With the wordcloudfunction
, we tried to make the wordcloud process reproducible. After geting the data frame of the frequencies, it shows the wordcloud plot. There will be some randomness in plotting. So, we set the seed before plotting.
To be able to plot the wordcloud, we need to prepare the data. Like in the Unit 5 of Analytics Edge, we need to apply these steps:
removeSparseTerms(frequencies, 0.995)
. 0.995 means the ratio of the word occured in all sentence.)corpus = VCorpus(VectorSource(airbnb$name))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("airbnb", stopwords("english")))
frequencies = DocumentTermMatrix(corpus)
namesSparse = as.data.frame(as.matrix(frequencies))
colnames(namesSparse) = make.names(colnames(namesSparse))
wordcloudfunction(namesSparse)
As you can see in the plot, “bedroom”, “room” and “private” words are the most common words in the name
column. It means that most of the customers of Airbnb looks for the private rooms, so that these listings have these words in their names. As you can see from the plot that “brooklyn” and “manhattan” words are common in the name of the listings. We can infer that Brooklyn and Manhattan would have more listing than the others.
corpus = tm_map(corpus, stemDocument)
frequencies = DocumentTermMatrix(corpus)
namesSparse = as.data.frame(as.matrix(frequencies))
colnames(namesSparse) = make.names(colnames(namesSparse))
wordcloudfunction(namesSparse)
In this plot, we applied the stemming operation and then plot the words.As you can see, the word “apartment” is turned into “apart” and its occurence in the data becomes approximately the same with “privat” which stands for the “private” or “privates” words.
In this part, we want to analyze the neighborhoods according to the average night of the minimum_nights
variable.
airbnb %>%
group_by(neighbourhood, neighbourhood_group)%>%
summarise(average_night = mean(minimum_nights))%>%
arrange(average_night) %>%
head(25)%>%
ggplot(., aes(y=reorder(neighbourhood, average_night), x= average_night, fill = neighbourhood_group)) +
geom_col() +
theme_minimal() +
labs(title = "Average Minimum Nights and Neighborhood Relationship",
subtitle = "2019 NYC Airbnb Data",
x = "Average Minimum Night",
y = "Neighborhood",
fill = "Neighborhood Group")
When we search information about NYC to make more clear analysis by reading this link, we realize that the most of the landmarks are located in Manhattan. Thus, we expect the more staying in this place. To check the assumption, we order the neighborhoods according to their average minimum nights. The above plot shows that neighborhoods in Manhattan are not included daily hosting.
Like we find the most popular neighborhoods, we can also determine the most popular host in NYC according to listing counts.
top_10_listing_counts = airbnb %>%
group_by(host_id) %>%
summarise(listing_count = n()) %>%
arrange(desc(listing_count))
id_name = distinct(airbnb[, c("host_id", "host_name")])
top_10_listing_counts[1:10, ] %>%
left_join(., id_name, by = "host_id") %>%
ggplot(., aes(x = reorder(host_name, -listing_count) , y = listing_count, fill = host_name)) +
geom_col() +
theme_minimal() +
geom_text(aes(label = format(listing_count,digits=3)), size=4, position = position_dodge(0.9),vjust = 2) +
theme(axis.text.x = element_text(angle = 45), legend.position = "right") +
labs(title = "Top 10 Hosts in NYC",
subtitle = "2019 NYC Airbnb Data",
x = "Host Names",
y = "Listing Counts",
fill = "Host Name")
The another analysis can be made by using number of reviews.
airbnb %>%
group_by(neighbourhood, neighbourhood_group) %>%
summarise(mean_review = mean(number_of_reviews)) %>%
arrange(desc(mean_review)) %>%
head(30) %>%
ggplot(aes(x=mean_review, y = reorder(neighbourhood,mean_review), fill = neighbourhood_group)) +
geom_col() +
theme_minimal() +
labs(title = "Top 30 Neighborhood According to The Average Number of Reviews",
subtitle = "2019 NYC Airbnb Data",
x = "Average Number of Reviews",
y = "Neighborhood",
fill = "Neighbourhood Group")
The results show that Bronx and Staten Island take the most of the reviews. On the other hand, there is no neighborhood from Manhattan in the top 30.
In 2019, the fluctuation of the average price is getting smaller after April.
airbnb1 <- na.omit(airbnb)
airbnb1 %>%
group_by(last_review) %>%
transmute(last_review, average_price = mean(price)) %>%
filter(lubridate::year(last_review) > 2018) %>%
ggplot(aes(x=last_review, y = average_price)) +
geom_line(color = "#F08090") +
theme_minimal() +
labs(title = "Average Price of Airbnb Room Last Reviewed Date",
subtitle = "2019 NYC Airbnb Data",
x = "Last Review Date",
y = "Average Price")
In this study, we address the explanatory analysis of the airbnb data with several key features such as price, neighborhood, neighborhood group, room type, number of reviews, etc. By using these data,
The other analysis made to calculate following relationships:
To prepare this report we use some directive notes, reports, and web pages that are listed below:
The Extra Notebooks in Kaggle