In this report, Airbnb data in New York will be examined using mainly ggplot2
and dplyr
packages. More packages can be used for a better analysis.
library(dplyr)
library(ggplot2)
library(tidyr)
library(tidyverse)
library(png)
library(grid)
library(tm)
library(SnowballC)
library("wordcloud")
library("RColorBrewer")
library(rio)
library(lubridate)
Firstly, we read the csv file and saved it as rds. Throughout the analysis, we created the data frames from the rds file.
object_csv<-read.csv("AB_NYC_2019.csv",sep=",")
saveRDS(object_csv, file = "my_data.rds")
object_rds<-readRDS(file = "my_data.rds")
glimpse(object_rds)
## Rows: 48,895
## Columns: 16
## $ id <int> 2539, 2595, 3647, 3831, 5022, 5099, ...
## $ name <chr> "Clean & quiet apt home by the park"...
## $ host_id <int> 2787, 2845, 4632, 4869, 7192, 7322, ...
## $ host_name <chr> "John", "Jennifer", "Elisabeth", "Li...
## $ neighbourhood_group <chr> "Brooklyn", "Manhattan", "Manhattan"...
## $ neighbourhood <chr> "Kensington", "Midtown", "Harlem", "...
## $ latitude <dbl> 40.64749, 40.75362, 40.80902, 40.685...
## $ longitude <dbl> -73.97237, -73.98377, -73.94190, -73...
## $ room_type <chr> "Private room", "Entire home/apt", "...
## $ price <int> 149, 225, 150, 89, 80, 200, 60, 79, ...
## $ minimum_nights <int> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1, 5, 2...
## $ number_of_reviews <int> 9, 45, 0, 270, 9, 74, 49, 430, 118, ...
## $ last_review <chr> "2018-10-19", "2019-05-21", "", "201...
## $ reviews_per_month <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.59, 0....
## $ calculated_host_listings_count <int> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, ...
## $ availability_365 <int> 365, 355, 365, 194, 0, 129, 0, 220, ...
When getting a glimpse of object_rds
, we can see the data frame consists of 16 columns and 48895 rows. The variables have the following names:
id
: Post IDname
: Short description of the accommodation optionhost_id
: Host IDhost_name
: Name of the hostneighbourhood_group
: One of the five neighbourhood groups in NYCneighbourhood
: Neighbourhood namelatitude
: Location of the option as latitudelongitude
: Location of the option as longituderoom_type
: One of the three room typesprice
: Priceminimum_nights
: Number of minimum nights to be rentednumber_of_reviews
: Number of reviews performed for a postlast_review
: The date which the post is reviewed lastlyreviews_per_month
: Average number of reviews per month performed for a postcalculated_host_listings_count
: Calculated host listings countavailability_365
: Number of available days in a year for a room/houseVariables which will be mainly used for data visualisation are price
,latitude
,longitude
,host_id
,neighbourhood
,neighbourhood_group
,name
,room_type
,number_of_reviews
,availability_365
.
It will be useful to change data types of certain variables. Also it is possible to handle with NA
values of reviews_per_month
by assigning zero to them because if the there is no entry, it means there is no review for related post.
object_rds$last_review<-as.POSIXct(object_rds$last_review,format="%Y-%m-%d")
object_rds$reviews_per_month[is.na(object_rds$reviews_per_month)] <- 0
object_rds$room_type<-as.factor(object_rds$room_type)
Creating a new data frame in order to see the effect of neighbourhood on prices
data1<-object_rds %>% group_by(neighbourhood_group)%>%summarise(AvgPrice=mean(price))
Visualization of the relation between price and neighbourhood groups.
ggplot(data1,aes(x=neighbourhood_group,y=AvgPrice,group=1,fill=(neighbourhood_group)))+
ggtitle("Changes in Prices over Neigbourhood Groups")+ geom_point()+
geom_line()+
theme_minimal() +
labs(x = "Neigbourhood Groups",y = "Average Price Values", fill="Neighbourhood Groups") +
theme(axis.text.x = element_text(angle =90,size=7,vjust=0.4))
This plot makes it easier to see the outliers in room prices according to different neighbourhood groups.
ggplot(object_rds,aes(x=neighbourhood_group,y=price,group=1,color=room_type))+
ggtitle(label="Prices over Neigbourhood Groups for Room Types")+
theme(plot.title = element_text(hjust=1))+
geom_point()+
theme_minimal() +
labs(x = "Neigbourhood Groups",y = "Price Values", color="Room Type") +
theme(axis.text.x = element_text(angle =90,size=7,vjust=0.4))
Average price of rooms changes according to neighbourhood groups. Room prices are more expensive in Manhattan and Brooklyn than other neighbourhood groups.
ggplot(object_rds, aes(x=neighbourhood_group, y=availability_365, fill=neighbourhood_group)) +
labs(x="Neighbourhood Group",y="Availability",title="Relation between Availability of Rooms and Neigbourhood Groups")+
geom_boxplot()+
theme(axis.text.x = element_text(angle=90,size=7,vjust=0.4))
Room availability is low in Manhattan and Brooklyn because the two boroughs are two of the top three highest-populated boroughs in New York City [1].
piedata<-aggregate(cbind(count = neighbourhood_group) ~ neighbourhood_group,
data = object_rds,
FUN = function(x){NROW(x)})
Visualization of number of rooms in different neighbourhood groups.
ggplot(piedata,aes(x="",y=count,fill=neighbourhood_group)) +
labs(x="",y="",title="Number of Rooms in Neighbourhood Groups", fill="Neighbourhood Groups")+
geom_bar(stat="identity",width=1) + coord_polar("y")
Number of rooms advertised in Manhattan, Brooklyn and Queens are more than other two boroughs of New York. It can be correlated with population of boroughs[1].
Creating a new data frame in order to see the effect of room types on prices.
data2<-object_rds %>% group_by(room_type)%>%summarise(AvgPrice=mean(price))
Visualization of the relation between average room prices for different room types.
ggplot(data2,aes(x=room_type,y=AvgPrice,fill=(room_type))) +
geom_bar(stat="identity",position="dodge") +
theme_minimal() +
labs(x="Room Types",y="Average Price Values",title="Average Prices for Different Room Types",
fill="Room Type") +
theme(axis.text.x = element_text(angle=90,size=7,vjust=0.4))
Room prices vary depending on the number of people who can stay and the quality of the room. Since the number of people who can stay in the entire building is large, the most expensive price belongs here. Private rooms are also the second most expensive, as they are more luxurious than shared rooms.
Creating new data frames in order to visualize average prices of rooms according to their room type and their neighbourhood groups.
data3<-object_rds%>%filter(neighbourhood_group=="Bronx")%>%summarise(neighbourhood_group,price,room_type)
data4<-data3%>%group_by(room_type)%>%summarise(AvgPriceBronx=mean(price),.groups = 'drop')
data5<-object_rds%>%filter(neighbourhood_group=="Brooklyn")%>%summarise(price,room_type)
data6<-data5%>%group_by(room_type)%>%summarise(AvgPriceBrooklyn=mean(price),.groups = 'drop')
data7<-object_rds%>%filter(neighbourhood_group=="Manhattan")%>%summarise(price,room_type)
data8<-data7%>%group_by(room_type)%>%summarise(AvgPriceManhattan=mean(price),.groups = 'drop')
data9<-object_rds%>%filter(neighbourhood_group=="Queens")%>%summarise(price,room_type)
data10<-data9%>%group_by(room_type)%>%summarise(AvgPriceQueens=mean(price),.groups = 'drop')
data11<-object_rds%>%filter(neighbourhood_group=="Staten Island")%>%summarise(price,room_type)
data12<-data11%>%group_by(room_type)%>%summarise(AvgPriceStatenIsland=mean(price),.groups = 'drop')
Creating newdata
by using inner join technique.
newdata<-inner_join(data4,data6)
newdata<-inner_join(newdata,data8)
newdata<-inner_join(newdata,data10)
newdata<-inner_join(newdata,data12)
Transposing newdata
and creating mydf
,changing column names and row names in order to visualize data in a better way
mydf = setNames(data.frame(t(newdata[,-1])),newdata[,1])
colnames(mydf) <- c("Entire home/apt","Private room","Shared room")
mydf<-mydf%>%rownames_to_column(var = "NeighbourhoodGroups")
mydf[1,1]="Bronx"
mydf[2,1]="Brooklyn"
mydf[3,1]="Manhattan"
mydf[4,1]="Queens"
mydf[5,1]="StatenIsland"
Filtering mydf
dataframe and implement pivot longer operation in order to visualize data.
mydf %>% filter(NeighbourhoodGroups %in% c("Bronx","Brooklyn","Manhattan","Queens","StatenIsland"))%>%pivot_longer(.,cols=c("Entire home/apt","Private room","Shared room"))%>%
ggplot(.,aes(x=NeighbourhoodGroups,y=value,fill=name))+geom_bar(stat = "identity",position="stack")+theme_minimal() +
labs(x="Neigbourhood Groups",y="Average Prices",title="Average Prices for Different Room Types for Neighbourhood Groups") +
theme(axis.text.x = element_text(angle=90,size=7,vjust=0.4))
Average prices of entire room/apartment are higher than private room and shared room since the number of people who can stay in an entire apartment is greater than the number of people who can stay in a room. Average prices of all room types are highest in Manhattan, and the other highest boroughs are Brooklyn and Queens. It also reflects that average room prices of different room types are higher in the three boroughs with highest populations[1].
Grouping the data frame by hosts is required to make further analysis about comparison of the hosts in different aspects. For this purpose, host_id
is used because it is a unique value for each host. After grouping, total number of houses/rooms posted by each host is calculated to visualize activeness of the hosts.
by_host<-object_rds%>%select(id,neighbourhood,neighbourhood_group,
host_id,price,reviews_per_month,availability_365)%>%
group_by(host_id)%>%
tally()%>%
arrange(desc(n))%>%
head(10)
ggplot(by_host, aes (x="", y = n, fill = factor(host_id))) +
geom_bar(width = 1, stat = "identity") +
geom_text(aes(label = paste(round(n / sum(n) * 100, 1), "%")),
position = position_stack(vjust = 0.5)) +
theme(axis.line = element_blank(),
plot.title = element_text(hjust=0.5)) +
labs(fill = "Host ID",
x = NULL,
y = NULL,
title = "The Most Active Host Accounts by Number of Posts")+
coord_polar("y")
The most 10 active hosts are plotted on the pie chart by their ID number. It is possible to say that total number of posts by top 10 hosts is nearly 1250 quarter of which belongs to the #1 host with the ID 219517861. The #2 host also has a great proportion compared to other 8 hosts.
Similar analysis can be made to show popularity of hosts. After the same grouping process, average value of reviews per month for each host is calculated to make interpretation about popularity. That is, the purpose is to find hosts who have the most clicked posts.
by_host_popularity<-object_rds%>%select(id,neighbourhood,neighbourhood_group,
host_id,price,reviews_per_month,availability_365)%>%
group_by(host_id)%>%
summarise(avg_score=mean(reviews_per_month))%>%
arrange(desc(avg_score))%>%
head(10)
ggplot(by_host_popularity, aes (x="", y = avg_score, fill = factor(host_id))) +
geom_bar(width = 1, stat = "identity") +
geom_text(aes(label = paste(round(avg_score / sum(avg_score) * 100, 1), "%")),
position = position_stack(vjust = 0.5)) +
theme(axis.line = element_blank(),
plot.title = element_text(hjust=0.5)) +
labs(fill = "Host ID",
x = NULL,
y = NULL,
title = "The Most Popular Host Accounts by Average Reviews per Month")+
coord_polar("y")
The most 10 popular hosts are plotted on the pie chart by their ID number. It is possible to say that total number of reviews per month by top 10 hosts is almost 150. The most popular host’s ID is 156684502. The proportions are close to each other in contrast to the most active hosts analysis.
Location information can be received by using latitude and longitude data of each post. Thus, some of the parameters such as popularity, availability and expensiveness can be demonstrated on a NYC map. However, we have to show locations of neighbourhood group first to decide which posts belongs to which neighnourhood group when analyzing popularity, availability and expensiveness of the posts.
object_rds%>%
select(latitude,longitude,neighbourhood_group)%>%
ggplot(.,aes(x=longitude,y=latitude,color=neighbourhood_group))+
geom_point(alpha=0.5)+
labs(title = "House Locations by Neighbourhood Group",
color="Neighbourhood Group",
x="Longitude",
y="Latitude")
All the house/room posts are plotted on the scatter plot by their distinct neighbourhood groups and received a map with their location metrics.
To see price level change with location, latitude
, longitude
and price
variables are taken into consideration. NYC map image is added to plot background for a better visualization. [2] Before starting to analysis, the upper limit of prices
is found to eliminate outliers. In this way, extreme values cannot affect the analysis.
sorted<-sort(object_rds$price)
Q1<-0.25*48896
Q3<-0.75*48896
IQR<-sorted[Q3]-sorted[Q1]
UL<-sorted[Q3]+1.5*IQR
LL<-sorted[Q1]-1.5*IQR
price_map<-object_rds%>%
select(longitude,latitude,price)%>%
filter(price<UL)
img<-readPNG("New_York_City_.png")
ggplot(price_map,aes(x=longitude,y=latitude,color=price))+
annotation_custom(rasterGrob(img,
width = unit(1,"npc"),
height = unit(1,"npc")),
-Inf, Inf, -Inf, Inf)+
geom_point(alpha=0.5)+
scale_color_gradient(low="green", high="red")+
labs(title = "Price Levels wrt Locations",
color="Price",
x="Longitude",
y="Latitude")
When we look at the House Locations by Neighbourhood Group and Price Levels wrt Locations plots at the same time, it is clear that price level is not so high in Bronx, Queens, Staten Island and south region of Brooklyn. However, the price color turns to red in Manhattan and north region of Brooklyn, which means price of houses/rooms are getting high in this area.
To see number of available days change with location, latitude
, longitude
and availability_365
variables are taken into consideration. Similarly, the upper limit of availability_365
is found to eliminate outliers. In this way, extreme values cannot affect the analysis.
sorted_av<-sort(object_rds$availability_365)
Q1_av<-0.25*48896
Q3_av<-0.75*48896
IQR_av<-sorted_av[Q3_av]-sorted_av[Q1_av]
UL_av<-sorted_av[Q3_av]+1.5*IQR_av
LL_av<-sorted_av[Q1_av]-1.5*IQR_av
availability_map<-object_rds%>%
select(longitude,latitude,availability_365)%>%
filter(availability_365<=UL_av)
ggplot(availability_map,aes(x=longitude,y=latitude,color=availability_365))+
annotation_custom(rasterGrob(img,
width = unit(1,"npc"),
height = unit(1,"npc")),
-Inf, Inf, -Inf, Inf)+
geom_point(alpha=0.5)+
scale_color_gradient(low="green", high="red")+
labs(title = "Availability of Accomomodation Options wrt Locations",
color="Available Days (yearly)",
x="Longitude",
y="Latitude")
There is no such big color difference between neighbourhood groups as in the price map. Nevertheless, green points appear more than red points in Manhattan and Brooklyn. Also, red points are distributed more in other 3 neighbourhood groups. So, it is possible to associate price map with availabilty map. We can say if availability of rooms/houses increases in a region, the prices decrease.
To see popularity with location, latitude
, longitude
and number_of_reviews
variables are taken into consideration. Same as before, the upper limit of number_of_reviews
is found to eliminate outliers. In this way, extreme values cannot affect the analysis.
sorted_pop<-sort(object_rds$number_of_reviews)
Q1_pop<-0.25*48896
Q3_pop<-0.75*48896
IQR_pop<-sorted_pop[Q3_pop]-sorted_pop[Q1_pop]
UL_pop<-sorted_pop[Q3_pop]+1.5*IQR_pop
LL_pop<-sorted_pop[Q1_pop]-1.5*IQR_pop
popularity_map<-object_rds%>%
select(longitude,latitude,number_of_reviews)%>%
filter(number_of_reviews<=UL_pop)
ggplot(popularity_map,aes(x=longitude,y=latitude,color=number_of_reviews))+
annotation_custom(rasterGrob(img,
width = unit(1,"npc"),
height = unit(1,"npc")),
-Inf, Inf, -Inf, Inf)+
geom_point(alpha=0.5)+
scale_color_continuous(limits = c(0,150),
breaks = c(150,125,100,75,50,25,0))+
labs(title = "Popularity of Accomomodation Options wrt Locations",
color="Number of Reviews",
x="Longitude",
y="Latitude")
Popularity of Accomomodation Options wrt Locations plot seems inversely correlated with price map. Expensive areas such as Manhattan and Brooklyn are colored with dark blue which denotes less popular areas. Less popular areas with light blue are the same with cheaper areas. This result may arise from their price level. People may prefer cheaper places to stay.
Hosts usually choose attractive and descriptive words in their post names. We can analyze these text data with word cloud to see the most frequently words used in the posts. [3]
# Create a corpus
text <- Corpus(VectorSource(object_rds$name))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
text <- tm_map(text, toSpace, "/")
text <- tm_map(text, toSpace, ",")
text <- tm_map(text, toSpace, "!")
text <- tm_map(text, toSpace, "-")
text <- tm_map(text, toSpace, ":")
text <- tm_map(text, toSpace, "@")
text <- tm_map(text, toSpace, "\\|")
# Convert the text to lower case
text <- tm_map(text, content_transformer(tolower))
# Remove numbers
text <- tm_map(text, removeNumbers)
# Remove english common stopwords
text <- tm_map(text, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
text <- tm_map(text, removeWords, c("blabla1", "blabla2"))
# Remove punctuations
text <- tm_map(text, removePunctuation)
# Eliminate extra white spaces
text <- tm_map(text, stripWhitespace)
matrix <- as.matrix(TermDocumentMatrix(text))
v <- sort(rowSums(matrix),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
set.seed(1176)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
The most preferred words are room, bedroom, private, apartment. There are also significant words informing about location such as Manhattan, Brooklyn, west, east as well as some other words like sunny, beautiful, luxury, cozy, spacious arousing customers’ interests .
Another analysis is to demonstrate which neighbourhoods are the most expensive ones and to which neighbourhood groups they belongs. For this purpose, neighbourhood
data frame is created
neighbourhood<-object_rds%>%
select(neighbourhood,neighbourhood_group,price,
number_of_reviews,availability_365)%>%
group_by(neighbourhood,neighbourhood_group)%>%
summarise(mean_price=mean(price),
mean_availability=mean(availability_365),
mean_popularity=mean(number_of_reviews))
neighbourhood%>%
arrange(desc(mean_price))%>%
head(40)%>%
ggplot(.,aes(x=mean_price,y=reorder(neighbourhood,mean_price),fill=neighbourhood_group)) +
geom_col()+
labs(fill="Neighbour Group",
x="Average Prices",
y="Neighbourhood Name",
title="The Most Expensive Neighbourhood with their Group")+
theme_minimal()
Manhattan has the maximum number of neighbourhoods in top 40. However, the most 2 expensive neighbourhood belongs to Staten Island. So, Manhattan is the most expensive neighbourhood group in general whereas there are some other specific neighbourhoods in other neighbourhood groups.
The same data frame created before for “The Most Expensive Neighbourhood with their Group” plot is used to demonstrate which neighbourhoods are the most available ones and to which neighbourhood groups they belongs. Availability is found by calculating average number of available days of a house/room in a year for each neighbourhood.
neighbourhood%>%
arrange(desc(mean_availability))%>%
head(40)%>%
ggplot(.,aes(x=mean_availability,y=reorder(neighbourhood,mean_availability),fill=neighbourhood_group)) +
geom_col()+
labs(fill="Neighbour Group",
x="Average Number of Available Days (yearly)",
y="Neighbourhood Name",
title="The Most Available Neighbourhood with their Group")+
theme_minimal()
Pink color is the most apparent in the plot, which means Staten Island has the maximum number of the most available neighbourhoods. However, except for Manhattan and Brooklyn, the other two neighbourhood groups are close to Staten Island in terms of availability.
Neighbourhood
data frame is also used for “The Most Popular Neighbourhood with their Group” plot to show the most popular neighbourhoods by their neighbourhood groups. Popularity is calculated as average number of reviews of posts for each neighbourhood.
neighbourhood%>%
arrange(desc(mean_popularity))%>%
head(40)%>%
ggplot(.,aes(x=mean_popularity,y=reorder(neighbourhood,mean_popularity),fill=neighbourhood_group)) +
geom_col()+
labs(fill="Neighbour Group",
x="Average Number of Reviews",
y="Neighbourhood Name",
title="The Most Popular Neighbourhood with their Group")+
theme_minimal()
Manhattan, the most expensive neighbourhood group, does not have any neigbourhood in top 40 list in terms of popularity. Neighbourhood group which has neighbourhoods in the list most is Staten Island while Queens and Bronnx have also great number.
In our analysis, we demonstrate:
[3]: Word Cloud
Lecture Notes: Berk Orbay IE48A
Lecture Notes: Berk Orbay IE48A