Group Members
1 Key Takeaways
2 Dataset
- 2.1 Loading Packages
- 2.2 Data Preprocessing
3 Data Visualization
4 Conclusion
5 Reference

Group Members

Alican Yılmaz
Egecan Esen
Fatma Nur Dumlupınar
Irmak Dai
Süheyla Şeker
Tümay Kır

1 Key Takeaways

We analyzed New York City Airbnb Open Data.
Comparison of hosts, prices, popularity, availability, number of rooms/houses, neighbourhoods, neighbourhood groups were made.
Some of the results from the visualization:
- Number of rooms/houses and price level are maximum in Manhattan and Brooklyn
- Number of available days of rooms/houses and number of reviews are more in other three neigbourhood groups
So, these results may imply:
- When there are too many rooms/houses to be rented, number of clicks,number of reviews, for each decreases
- Host profiles seems different in Manhattan and Brooklyn compared to other neighbourhood groups. Although number of rooms/houses is high, the availability of them is low. This may result from that people in expensive area want to earn money from their homes more although their availability is not so much.

2 Dataset

In this report, Airbnb data in New York will be examined using mainly ggplot2 and dplyr packages. More packages can be used for a better analysis.

2.1 Loading Packages

library(dplyr)
library(ggplot2)
library(tidyr)
library(tidyverse)
library(png)
library(grid)
library(tm)
library(SnowballC)
library("wordcloud")
library("RColorBrewer")
library(rio)
library(lubridate)

2.2 Data Preprocessing

Firstly, we read the csv file and saved it as rds. Throughout the analysis, we created the data frames from the rds file.

object_csv<-read.csv("AB_NYC_2019.csv",sep=",")

saveRDS(object_csv, file = "my_data.rds")
object_rds<-readRDS(file = "my_data.rds")
glimpse(object_rds)

## Rows: 48,895
## Columns: 16
## $ id                             <int> 2539, 2595, 3647, 3831, 5022, 5099, ...
## $ name                           <chr> "Clean & quiet apt home by the park"...
## $ host_id                        <int> 2787, 2845, 4632, 4869, 7192, 7322, ...
## $ host_name                      <chr> "John", "Jennifer", "Elisabeth", "Li...
## $ neighbourhood_group            <chr> "Brooklyn", "Manhattan", "Manhattan"...
## $ neighbourhood                  <chr> "Kensington", "Midtown", "Harlem", "...
## $ latitude                       <dbl> 40.64749, 40.75362, 40.80902, 40.685...
## $ longitude                      <dbl> -73.97237, -73.98377, -73.94190, -73...
## $ room_type                      <chr> "Private room", "Entire home/apt", "...
## $ price                          <int> 149, 225, 150, 89, 80, 200, 60, 79, ...
## $ minimum_nights                 <int> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1, 5, 2...
## $ number_of_reviews              <int> 9, 45, 0, 270, 9, 74, 49, 430, 118, ...
## $ last_review                    <chr> "2018-10-19", "2019-05-21", "", "201...
## $ reviews_per_month              <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.59, 0....
## $ calculated_host_listings_count <int> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, ...
## $ availability_365               <int> 365, 355, 365, 194, 0, 129, 0, 220, ...

When getting a glimpse of object_rds, we can see the data frame consists of 16 columns and 48895 rows. The variables have the following names:

id: Post ID
name: Short description of the accommodation option
host_id: Host ID
host_name: Name of the host
neighbourhood_group: One of the five neighbourhood groups in NYC
neighbourhood: Neighbourhood name
latitude: Location of the option as latitude
longitude: Location of the option as longitude
room_type: One of the three room types
price: Price
minimum_nights: Number of minimum nights to be rented
number_of_reviews: Number of reviews performed for a post
last_review: The date which the post is reviewed lastly
reviews_per_month: Average number of reviews per month performed for a post
calculated_host_listings_count: Calculated host listings count
availability_365: Number of available days in a year for a room/house

Variables which will be mainly used for data visualisation are price,latitude,longitude,host_id,neighbourhood,neighbourhood_group,name,room_type,number_of_reviews,availability_365.

It will be useful to change data types of certain variables. Also it is possible to handle with NA values of reviews_per_month by assigning zero to them because if the there is no entry, it means there is no review for related post.

object_rds$last_review<-as.POSIXct(object_rds$last_review,format="%Y-%m-%d")
object_rds$reviews_per_month[is.na(object_rds$reviews_per_month)] <- 0
object_rds$room_type<-as.factor(object_rds$room_type)

3 Data Visualization

3.1 Effect of Neighbourhood Groups on Prices

Creating a new data frame in order to see the effect of neighbourhood on prices

data1<-object_rds %>% group_by(neighbourhood_group)%>%summarise(AvgPrice=mean(price))

Visualization of the relation between price and neighbourhood groups.

ggplot(data1,aes(x=neighbourhood_group,y=AvgPrice,group=1,fill=(neighbourhood_group)))+
 ggtitle("Changes in Prices over Neigbourhood Groups")+ geom_point()+
 geom_line()+
 theme_minimal() +
 labs(x = "Neigbourhood Groups",y = "Average Price Values", fill="Neighbourhood Groups") +
 theme(axis.text.x = element_text(angle =90,size=7,vjust=0.4))

This plot makes it easier to see the outliers in room prices according to different neighbourhood groups.

ggplot(object_rds,aes(x=neighbourhood_group,y=price,group=1,color=room_type))+
  ggtitle(label="Prices over Neigbourhood Groups for Room Types")+
  theme(plot.title = element_text(hjust=1))+
  geom_point()+
  theme_minimal() +
  labs(x = "Neigbourhood Groups",y = "Price Values", color="Room Type") +
  theme(axis.text.x = element_text(angle =90,size=7,vjust=0.4))

Average price of rooms changes according to neighbourhood groups. Room prices are more expensive in Manhattan and Brooklyn than other neighbourhood groups.

3.2 Relation between Neighbourhood Groups and Availability of Room

ggplot(object_rds, aes(x=neighbourhood_group, y=availability_365, fill=neighbourhood_group)) +
  labs(x="Neighbourhood Group",y="Availability",title="Relation between Availability of Rooms and Neigbourhood Groups")+ 
  geom_boxplot()+
  theme(axis.text.x = element_text(angle=90,size=7,vjust=0.4))

Room availability is low in Manhattan and Brooklyn because the two boroughs are two of the top three highest-populated boroughs in New York City [1].

3.3 Relation between Number of Rooms and Different Neighbourhood Groups

piedata<-aggregate(cbind(count = neighbourhood_group) ~ neighbourhood_group, 
                   data = object_rds, 
                   FUN = function(x){NROW(x)})

Visualization of number of rooms in different neighbourhood groups.

ggplot(piedata,aes(x="",y=count,fill=neighbourhood_group)) + 
  labs(x="",y="",title="Number of Rooms in Neighbourhood Groups", fill="Neighbourhood Groups")+
  geom_bar(stat="identity",width=1) + coord_polar("y")

Number of rooms advertised in Manhattan, Brooklyn and Queens are more than other two boroughs of New York. It can be correlated with population of boroughs[1].

3.4 Relation between Room Types and Prices

Creating a new data frame in order to see the effect of room types on prices.

data2<-object_rds %>% group_by(room_type)%>%summarise(AvgPrice=mean(price))

Visualization of the relation between average room prices for different room types.

ggplot(data2,aes(x=room_type,y=AvgPrice,fill=(room_type))) +
  geom_bar(stat="identity",position="dodge") + 
  theme_minimal() + 
  labs(x="Room Types",y="Average Price Values",title="Average Prices for Different Room Types",
       fill="Room Type") + 
  theme(axis.text.x = element_text(angle=90,size=7,vjust=0.4))

Room prices vary depending on the number of people who can stay and the quality of the room. Since the number of people who can stay in the entire building is large, the most expensive price belongs here. Private rooms are also the second most expensive, as they are more luxurious than shared rooms.

3.5 Relation between Average Prices of Rooms According to Their Room Type and Neighbourhood Groups

Creating new data frames in order to visualize average prices of rooms according to their room type and their neighbourhood groups.

data3<-object_rds%>%filter(neighbourhood_group=="Bronx")%>%summarise(neighbourhood_group,price,room_type)
data4<-data3%>%group_by(room_type)%>%summarise(AvgPriceBronx=mean(price),.groups = 'drop')

data5<-object_rds%>%filter(neighbourhood_group=="Brooklyn")%>%summarise(price,room_type)
data6<-data5%>%group_by(room_type)%>%summarise(AvgPriceBrooklyn=mean(price),.groups = 'drop')

data7<-object_rds%>%filter(neighbourhood_group=="Manhattan")%>%summarise(price,room_type)
data8<-data7%>%group_by(room_type)%>%summarise(AvgPriceManhattan=mean(price),.groups = 'drop')

data9<-object_rds%>%filter(neighbourhood_group=="Queens")%>%summarise(price,room_type)
data10<-data9%>%group_by(room_type)%>%summarise(AvgPriceQueens=mean(price),.groups = 'drop')

data11<-object_rds%>%filter(neighbourhood_group=="Staten Island")%>%summarise(price,room_type)
data12<-data11%>%group_by(room_type)%>%summarise(AvgPriceStatenIsland=mean(price),.groups = 'drop')

Creating newdata by using inner join technique.

newdata<-inner_join(data4,data6)
newdata<-inner_join(newdata,data8)
newdata<-inner_join(newdata,data10)
newdata<-inner_join(newdata,data12)

Transposing newdata and creating mydf,changing column names and row names in order to visualize data in a better way

mydf = setNames(data.frame(t(newdata[,-1])),newdata[,1])
colnames(mydf) <- c("Entire home/apt","Private room","Shared room")
mydf<-mydf%>%rownames_to_column(var = "NeighbourhoodGroups")
mydf[1,1]="Bronx"
mydf[2,1]="Brooklyn"
mydf[3,1]="Manhattan"
mydf[4,1]="Queens"
mydf[5,1]="StatenIsland"

Filtering mydf dataframe and implement pivot longer operation in order to visualize data.

mydf %>% filter(NeighbourhoodGroups %in% c("Bronx","Brooklyn","Manhattan","Queens","StatenIsland"))%>%pivot_longer(.,cols=c("Entire home/apt","Private room","Shared room"))%>%
 ggplot(.,aes(x=NeighbourhoodGroups,y=value,fill=name))+geom_bar(stat = "identity",position="stack")+theme_minimal() + 
  labs(x="Neigbourhood Groups",y="Average Prices",title="Average Prices for Different Room Types for Neighbourhood Groups") + 
  theme(axis.text.x = element_text(angle=90,size=7,vjust=0.4))

Average prices of entire room/apartment are higher than private room and shared room since the number of people who can stay in an entire apartment is greater than the number of people who can stay in a room. Average prices of all room types are highest in Manhattan, and the other highest boroughs are Brooklyn and Queens. It also reflects that average room prices of different room types are higher in the three boroughs with highest populations[1].

3.6 The Most Active Host Accounts by Number of Posts

Grouping the data frame by hosts is required to make further analysis about comparison of the hosts in different aspects. For this purpose, host_id is used because it is a unique value for each host. After grouping, total number of houses/rooms posted by each host is calculated to visualize activeness of the hosts.

by_host<-object_rds%>%select(id,neighbourhood,neighbourhood_group,
                             host_id,price,reviews_per_month,availability_365)%>%
  group_by(host_id)%>%
  tally()%>%
  arrange(desc(n))%>%
  head(10)

 ggplot(by_host, aes (x="", y = n, fill = factor(host_id))) + 
  geom_bar(width = 1, stat = "identity") + 
  geom_text(aes(label = paste(round(n / sum(n) * 100, 1), "%")),
            position = position_stack(vjust = 0.5)) +
  theme(axis.line = element_blank(),
        plot.title = element_text(hjust=0.5)) + 
  labs(fill = "Host ID",
       x = NULL,
       y = NULL,
       title = "The Most Active Host Accounts by Number of Posts")+ 
   coord_polar("y")

The most 10 active hosts are plotted on the pie chart by their ID number. It is possible to say that total number of posts by top 10 hosts is nearly 1250 quarter of which belongs to the #1 host with the ID 219517861. The #2 host also has a great proportion compared to other 8 hosts.

3.7 The Most Popular Host Accounts by Average Reviews per Month

Similar analysis can be made to show popularity of hosts. After the same grouping process, average value of reviews per month for each host is calculated to make interpretation about popularity. That is, the purpose is to find hosts who have the most clicked posts.

by_host_popularity<-object_rds%>%select(id,neighbourhood,neighbourhood_group,
                             host_id,price,reviews_per_month,availability_365)%>%
  group_by(host_id)%>%
  summarise(avg_score=mean(reviews_per_month))%>%
  arrange(desc(avg_score))%>%
  head(10)


 ggplot(by_host_popularity, aes (x="", y = avg_score, fill = factor(host_id))) + 
  geom_bar(width = 1, stat = "identity") + 
  geom_text(aes(label = paste(round(avg_score / sum(avg_score) * 100, 1), "%")),
            position = position_stack(vjust = 0.5)) +
  theme(axis.line = element_blank(),
        plot.title = element_text(hjust=0.5)) + 
  labs(fill = "Host ID",
       x = NULL,
       y = NULL,
       title = "The Most Popular Host Accounts by Average Reviews per Month")+ 
   coord_polar("y")

The most 10 popular hosts are plotted on the pie chart by their ID number. It is possible to say that total number of reviews per month by top 10 hosts is almost 150. The most popular host’s ID is 156684502. The proportions are close to each other in contrast to the most active hosts analysis.

3.8 House Locations by Neighbourhood Group

Location information can be received by using latitude and longitude data of each post. Thus, some of the parameters such as popularity, availability and expensiveness can be demonstrated on a NYC map. However, we have to show locations of neighbourhood group first to decide which posts belongs to which neighnourhood group when analyzing popularity, availability and expensiveness of the posts.

object_rds%>%
  select(latitude,longitude,neighbourhood_group)%>%
ggplot(.,aes(x=longitude,y=latitude,color=neighbourhood_group))+
  geom_point(alpha=0.5)+
  labs(title = "House Locations by Neighbourhood Group",
    color="Neighbourhood Group",
        x="Longitude",
        y="Latitude")

All the house/room posts are plotted on the scatter plot by their distinct neighbourhood groups and received a map with their location metrics.

3.9 Price Levels wrt Locations

To see price level change with location, latitude, longitude and price variables are taken into consideration. NYC map image is added to plot background for a better visualization. [2] Before starting to analysis, the upper limit of prices is found to eliminate outliers. In this way, extreme values cannot affect the analysis.

sorted<-sort(object_rds$price)
Q1<-0.25*48896
Q3<-0.75*48896
IQR<-sorted[Q3]-sorted[Q1]
UL<-sorted[Q3]+1.5*IQR
LL<-sorted[Q1]-1.5*IQR

price_map<-object_rds%>%
  select(longitude,latitude,price)%>%
  filter(price<UL)


img<-readPNG("New_York_City_.png")
  
ggplot(price_map,aes(x=longitude,y=latitude,color=price))+
  annotation_custom(rasterGrob(img, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf)+
  geom_point(alpha=0.5)+
  scale_color_gradient(low="green", high="red")+
  labs(title = "Price Levels wrt Locations",
    color="Price",
        x="Longitude",
        y="Latitude")

When we look at the House Locations by Neighbourhood Group and Price Levels wrt Locations plots at the same time, it is clear that price level is not so high in Bronx, Queens, Staten Island and south region of Brooklyn. However, the price color turns to red in Manhattan and north region of Brooklyn, which means price of houses/rooms are getting high in this area.

3.10 Availability of Accomomodation Options wrt Locations

To see number of available days change with location, latitude, longitude and availability_365 variables are taken into consideration. Similarly, the upper limit of availability_365 is found to eliminate outliers. In this way, extreme values cannot affect the analysis.

sorted_av<-sort(object_rds$availability_365)
Q1_av<-0.25*48896
Q3_av<-0.75*48896
IQR_av<-sorted_av[Q3_av]-sorted_av[Q1_av]
UL_av<-sorted_av[Q3_av]+1.5*IQR_av
LL_av<-sorted_av[Q1_av]-1.5*IQR_av

availability_map<-object_rds%>%
  select(longitude,latitude,availability_365)%>%
    filter(availability_365<=UL_av)


ggplot(availability_map,aes(x=longitude,y=latitude,color=availability_365))+
  annotation_custom(rasterGrob(img, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf)+
  geom_point(alpha=0.5)+
  scale_color_gradient(low="green", high="red")+
   labs(title = "Availability of Accomomodation Options wrt Locations",
    color="Available Days (yearly)",
        x="Longitude",
        y="Latitude")

There is no such big color difference between neighbourhood groups as in the price map. Nevertheless, green points appear more than red points in Manhattan and Brooklyn. Also, red points are distributed more in other 3 neighbourhood groups. So, it is possible to associate price map with availabilty map. We can say if availability of rooms/houses increases in a region, the prices decrease.

3.11 Popularity of Accomomodation Options wrt Locations

To see popularity with location, latitude, longitude and number_of_reviews variables are taken into consideration. Same as before, the upper limit of number_of_reviews is found to eliminate outliers. In this way, extreme values cannot affect the analysis.

sorted_pop<-sort(object_rds$number_of_reviews)
Q1_pop<-0.25*48896
Q3_pop<-0.75*48896
IQR_pop<-sorted_pop[Q3_pop]-sorted_pop[Q1_pop]
UL_pop<-sorted_pop[Q3_pop]+1.5*IQR_pop
LL_pop<-sorted_pop[Q1_pop]-1.5*IQR_pop

popularity_map<-object_rds%>%
  select(longitude,latitude,number_of_reviews)%>%
  filter(number_of_reviews<=UL_pop)

ggplot(popularity_map,aes(x=longitude,y=latitude,color=number_of_reviews))+
  annotation_custom(rasterGrob(img, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf)+
  geom_point(alpha=0.5)+
  scale_color_continuous(limits = c(0,150), 
                         breaks = c(150,125,100,75,50,25,0))+
  labs(title = "Popularity of Accomomodation Options wrt Locations",
    color="Number of Reviews",
        x="Longitude",
        y="Latitude")

Popularity of Accomomodation Options wrt Locations plot seems inversely correlated with price map. Expensive areas such as Manhattan and Brooklyn are colored with dark blue which denotes less popular areas. Less popular areas with light blue are the same with cheaper areas. This result may arise from their price level. People may prefer cheaper places to stay.

3.12 The Most Frequently Used Words in the Posts

Hosts usually choose attractive and descriptive words in their post names. We can analyze these text data with word cloud to see the most frequently words used in the posts. [3]

  # Create a corpus  
  text <- Corpus(VectorSource(object_rds$name))

  toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
  text <- tm_map(text, toSpace, "/")
  text <- tm_map(text, toSpace, ",")
  text <- tm_map(text, toSpace, "!")
  text <- tm_map(text, toSpace, "-")
  text <- tm_map(text, toSpace, ":")
  text <- tm_map(text, toSpace, "@")
  text <- tm_map(text, toSpace, "\\|")
  
  # Convert the text to lower case
  text <- tm_map(text, content_transformer(tolower))
  # Remove numbers
  text <- tm_map(text, removeNumbers)
  # Remove english common stopwords
  text <- tm_map(text, removeWords, stopwords("english"))
  # Remove your own stop word
  # specify your stopwords as a character vector
  text <- tm_map(text, removeWords, c("blabla1", "blabla2")) 
  # Remove punctuations
  text <- tm_map(text, removePunctuation)
  # Eliminate extra white spaces
  text <- tm_map(text, stripWhitespace)
  

  matrix <- as.matrix(TermDocumentMatrix(text))
  v <- sort(rowSums(matrix),decreasing=TRUE)
  d <- data.frame(word = names(v),freq=v)
  head(d, 10)

  set.seed(1176)
  wordcloud(words = d$word, freq = d$freq, min.freq = 1,
            max.words=200, random.order=FALSE, rot.per=0.35, 
            colors=brewer.pal(8, "Dark2"))

The most preferred words are room, bedroom, private, apartment. There are also significant words informing about location such as Manhattan, Brooklyn, west, east as well as some other words like sunny, beautiful, luxury, cozy, spacious arousing customers’ interests .

3.13 The Most Expensive Neighbourhood with their Group

Another analysis is to demonstrate which neighbourhoods are the most expensive ones and to which neighbourhood groups they belongs. For this purpose, neighbourhood data frame is created

neighbourhood<-object_rds%>%
  select(neighbourhood,neighbourhood_group,price,
         number_of_reviews,availability_365)%>%
  group_by(neighbourhood,neighbourhood_group)%>%
  summarise(mean_price=mean(price),
            mean_availability=mean(availability_365),
            mean_popularity=mean(number_of_reviews))
  

neighbourhood%>%
  arrange(desc(mean_price))%>%
  head(40)%>%
ggplot(.,aes(x=mean_price,y=reorder(neighbourhood,mean_price),fill=neighbourhood_group)) +
  geom_col()+
  labs(fill="Neighbour Group",
       x="Average Prices",
       y="Neighbourhood Name",
       title="The Most Expensive Neighbourhood with their Group")+
  theme_minimal()

Manhattan has the maximum number of neighbourhoods in top 40. However, the most 2 expensive neighbourhood belongs to Staten Island. So, Manhattan is the most expensive neighbourhood group in general whereas there are some other specific neighbourhoods in other neighbourhood groups.

3.14 The Most Available Neighbourhood with their Group

The same data frame created before for “The Most Expensive Neighbourhood with their Group” plot is used to demonstrate which neighbourhoods are the most available ones and to which neighbourhood groups they belongs. Availability is found by calculating average number of available days of a house/room in a year for each neighbourhood.

neighbourhood%>%
  arrange(desc(mean_availability))%>%
  head(40)%>%
ggplot(.,aes(x=mean_availability,y=reorder(neighbourhood,mean_availability),fill=neighbourhood_group)) +
  geom_col()+
  labs(fill="Neighbour Group",
       x="Average Number of Available Days (yearly)",
       y="Neighbourhood Name",
       title="The Most Available Neighbourhood with their Group")+
  theme_minimal()

Pink color is the most apparent in the plot, which means Staten Island has the maximum number of the most available neighbourhoods. However, except for Manhattan and Brooklyn, the other two neighbourhood groups are close to Staten Island in terms of availability.

3.15 The Most Popular Neighbourhood with their Group

Neighbourhood data frame is also used for “The Most Popular Neighbourhood with their Group” plot to show the most popular neighbourhoods by their neighbourhood groups. Popularity is calculated as average number of reviews of posts for each neighbourhood.

neighbourhood%>%
  arrange(desc(mean_popularity))%>%
  head(40)%>%
ggplot(.,aes(x=mean_popularity,y=reorder(neighbourhood,mean_popularity),fill=neighbourhood_group)) +
  geom_col()+
  labs(fill="Neighbour Group",
       x="Average Number of Reviews",
       y="Neighbourhood Name",
       title="The Most Popular Neighbourhood with their Group")+
  theme_minimal()

Manhattan, the most expensive neighbourhood group, does not have any neigbourhood in top 40 list in terms of popularity. Neighbourhood group which has neighbourhoods in the list most is Staten Island while Queens and Bronnx have also great number.

4 Conclusion

In our analysis, we demonstrate:

Effect of neighbourhood group on availability, popularity and prices of rooms/houses
Distribution number of rooms/houses and room types over the neighbourhood groups
Activeness, popularity of hosts
Availability, popularity and price level of rooms/houses on NYC map
The most frequently used words in posts description with word cloud
The most expensive, available and popular neighbourhoods associated with their neighbourhood groups

Group Assignment:

Analysis with New York City Airbnb Open Data - Kaggle

bıktık R’tık

August 30, 2020