Group Members

1 Key Takeaways

2 Dataset

In this report, Airbnb data in New York will be examined using mainly ggplot2 and dplyr packages. More packages can be used for a better analysis.

2.1 Loading Packages

library(dplyr)
library(ggplot2)
library(tidyr)
library(tidyverse)
library(png)
library(grid)
library(tm)
library(SnowballC)
library("wordcloud")
library("RColorBrewer")
library(rio)
library(lubridate)

2.2 Data Preprocessing

Firstly, we read the csv file and saved it as rds. Throughout the analysis, we created the data frames from the rds file.

object_csv<-read.csv("AB_NYC_2019.csv",sep=",")

saveRDS(object_csv, file = "my_data.rds")
object_rds<-readRDS(file = "my_data.rds")
glimpse(object_rds)
## Rows: 48,895
## Columns: 16
## $ id                             <int> 2539, 2595, 3647, 3831, 5022, 5099, ...
## $ name                           <chr> "Clean & quiet apt home by the park"...
## $ host_id                        <int> 2787, 2845, 4632, 4869, 7192, 7322, ...
## $ host_name                      <chr> "John", "Jennifer", "Elisabeth", "Li...
## $ neighbourhood_group            <chr> "Brooklyn", "Manhattan", "Manhattan"...
## $ neighbourhood                  <chr> "Kensington", "Midtown", "Harlem", "...
## $ latitude                       <dbl> 40.64749, 40.75362, 40.80902, 40.685...
## $ longitude                      <dbl> -73.97237, -73.98377, -73.94190, -73...
## $ room_type                      <chr> "Private room", "Entire home/apt", "...
## $ price                          <int> 149, 225, 150, 89, 80, 200, 60, 79, ...
## $ minimum_nights                 <int> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1, 5, 2...
## $ number_of_reviews              <int> 9, 45, 0, 270, 9, 74, 49, 430, 118, ...
## $ last_review                    <chr> "2018-10-19", "2019-05-21", "", "201...
## $ reviews_per_month              <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.59, 0....
## $ calculated_host_listings_count <int> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, ...
## $ availability_365               <int> 365, 355, 365, 194, 0, 129, 0, 220, ...

When getting a glimpse of object_rds, we can see the data frame consists of 16 columns and 48895 rows. The variables have the following names:

  • id: Post ID
  • name: Short description of the accommodation option
  • host_id: Host ID
  • host_name: Name of the host
  • neighbourhood_group: One of the five neighbourhood groups in NYC
  • neighbourhood: Neighbourhood name
  • latitude: Location of the option as latitude
  • longitude: Location of the option as longitude
  • room_type: One of the three room types
  • price: Price
  • minimum_nights: Number of minimum nights to be rented
  • number_of_reviews: Number of reviews performed for a post
  • last_review: The date which the post is reviewed lastly
  • reviews_per_month: Average number of reviews per month performed for a post
  • calculated_host_listings_count: Calculated host listings count
  • availability_365: Number of available days in a year for a room/house

Variables which will be mainly used for data visualisation are price,latitude,longitude,host_id,neighbourhood,neighbourhood_group,name,room_type,number_of_reviews,availability_365.

It will be useful to change data types of certain variables. Also it is possible to handle with NA values of reviews_per_month by assigning zero to them because if the there is no entry, it means there is no review for related post.

object_rds$last_review<-as.POSIXct(object_rds$last_review,format="%Y-%m-%d")
object_rds$reviews_per_month[is.na(object_rds$reviews_per_month)] <- 0
object_rds$room_type<-as.factor(object_rds$room_type)

3 Data Visualization

3.1 Effect of Neighbourhood Groups on Prices

Creating a new data frame in order to see the effect of neighbourhood on prices

data1<-object_rds %>% group_by(neighbourhood_group)%>%summarise(AvgPrice=mean(price))

Visualization of the relation between price and neighbourhood groups.

ggplot(data1,aes(x=neighbourhood_group,y=AvgPrice,group=1,fill=(neighbourhood_group)))+
 ggtitle("Changes in Prices over Neigbourhood Groups")+ geom_point()+
 geom_line()+
 theme_minimal() +
 labs(x = "Neigbourhood Groups",y = "Average Price Values", fill="Neighbourhood Groups") +
 theme(axis.text.x = element_text(angle =90,size=7,vjust=0.4))

This plot makes it easier to see the outliers in room prices according to different neighbourhood groups.

ggplot(object_rds,aes(x=neighbourhood_group,y=price,group=1,color=room_type))+
  ggtitle(label="Prices over Neigbourhood Groups for Room Types")+
  theme(plot.title = element_text(hjust=1))+
  geom_point()+
  theme_minimal() +
  labs(x = "Neigbourhood Groups",y = "Price Values", color="Room Type") +
  theme(axis.text.x = element_text(angle =90,size=7,vjust=0.4))

Average price of rooms changes according to neighbourhood groups. Room prices are more expensive in Manhattan and Brooklyn than other neighbourhood groups.

3.2 Relation between Neighbourhood Groups and Availability of Room

ggplot(object_rds, aes(x=neighbourhood_group, y=availability_365, fill=neighbourhood_group)) +
  labs(x="Neighbourhood Group",y="Availability",title="Relation between Availability of Rooms and Neigbourhood Groups")+ 
  geom_boxplot()+
  theme(axis.text.x = element_text(angle=90,size=7,vjust=0.4))

Room availability is low in Manhattan and Brooklyn because the two boroughs are two of the top three highest-populated boroughs in New York City [1].

3.3 Relation between Number of Rooms and Different Neighbourhood Groups

piedata<-aggregate(cbind(count = neighbourhood_group) ~ neighbourhood_group, 
                   data = object_rds, 
                   FUN = function(x){NROW(x)})

Visualization of number of rooms in different neighbourhood groups.

ggplot(piedata,aes(x="",y=count,fill=neighbourhood_group)) + 
  labs(x="",y="",title="Number of Rooms in Neighbourhood Groups", fill="Neighbourhood Groups")+
  geom_bar(stat="identity",width=1) + coord_polar("y")

Number of rooms advertised in Manhattan, Brooklyn and Queens are more than other two boroughs of New York. It can be correlated with population of boroughs[1].

3.4 Relation between Room Types and Prices

Creating a new data frame in order to see the effect of room types on prices.

data2<-object_rds %>% group_by(room_type)%>%summarise(AvgPrice=mean(price))

Visualization of the relation between average room prices for different room types.

ggplot(data2,aes(x=room_type,y=AvgPrice,fill=(room_type))) +
  geom_bar(stat="identity",position="dodge") + 
  theme_minimal() + 
  labs(x="Room Types",y="Average Price Values",title="Average Prices for Different Room Types",
       fill="Room Type") + 
  theme(axis.text.x = element_text(angle=90,size=7,vjust=0.4))

Room prices vary depending on the number of people who can stay and the quality of the room. Since the number of people who can stay in the entire building is large, the most expensive price belongs here. Private rooms are also the second most expensive, as they are more luxurious than shared rooms.

3.5 Relation between Average Prices of Rooms According to Their Room Type and Neighbourhood Groups

Creating new data frames in order to visualize average prices of rooms according to their room type and their neighbourhood groups.

data3<-object_rds%>%filter(neighbourhood_group=="Bronx")%>%summarise(neighbourhood_group,price,room_type)
data4<-data3%>%group_by(room_type)%>%summarise(AvgPriceBronx=mean(price),.groups = 'drop')

data5<-object_rds%>%filter(neighbourhood_group=="Brooklyn")%>%summarise(price,room_type)
data6<-data5%>%group_by(room_type)%>%summarise(AvgPriceBrooklyn=mean(price),.groups = 'drop')

data7<-object_rds%>%filter(neighbourhood_group=="Manhattan")%>%summarise(price,room_type)
data8<-data7%>%group_by(room_type)%>%summarise(AvgPriceManhattan=mean(price),.groups = 'drop')

data9<-object_rds%>%filter(neighbourhood_group=="Queens")%>%summarise(price,room_type)
data10<-data9%>%group_by(room_type)%>%summarise(AvgPriceQueens=mean(price),.groups = 'drop')

data11<-object_rds%>%filter(neighbourhood_group=="Staten Island")%>%summarise(price,room_type)
data12<-data11%>%group_by(room_type)%>%summarise(AvgPriceStatenIsland=mean(price),.groups = 'drop')

Creating newdata by using inner join technique.

newdata<-inner_join(data4,data6)
newdata<-inner_join(newdata,data8)
newdata<-inner_join(newdata,data10)
newdata<-inner_join(newdata,data12)

Transposing newdata and creating mydf,changing column names and row names in order to visualize data in a better way

mydf = setNames(data.frame(t(newdata[,-1])),newdata[,1])
colnames(mydf) <- c("Entire home/apt","Private room","Shared room")
mydf<-mydf%>%rownames_to_column(var = "NeighbourhoodGroups")
mydf[1,1]="Bronx"
mydf[2,1]="Brooklyn"
mydf[3,1]="Manhattan"
mydf[4,1]="Queens"
mydf[5,1]="StatenIsland"

Filtering mydf dataframe and implement pivot longer operation in order to visualize data.

mydf %>% filter(NeighbourhoodGroups %in% c("Bronx","Brooklyn","Manhattan","Queens","StatenIsland"))%>%pivot_longer(.,cols=c("Entire home/apt","Private room","Shared room"))%>%
 ggplot(.,aes(x=NeighbourhoodGroups,y=value,fill=name))+geom_bar(stat = "identity",position="stack")+theme_minimal() + 
  labs(x="Neigbourhood Groups",y="Average Prices",title="Average Prices for Different Room Types for Neighbourhood Groups") + 
  theme(axis.text.x = element_text(angle=90,size=7,vjust=0.4))

Average prices of entire room/apartment are higher than private room and shared room since the number of people who can stay in an entire apartment is greater than the number of people who can stay in a room. Average prices of all room types are highest in Manhattan, and the other highest boroughs are Brooklyn and Queens. It also reflects that average room prices of different room types are higher in the three boroughs with highest populations[1].

3.6 The Most Active Host Accounts by Number of Posts

Grouping the data frame by hosts is required to make further analysis about comparison of the hosts in different aspects. For this purpose, host_id is used because it is a unique value for each host. After grouping, total number of houses/rooms posted by each host is calculated to visualize activeness of the hosts.

by_host<-object_rds%>%select(id,neighbourhood,neighbourhood_group,
                             host_id,price,reviews_per_month,availability_365)%>%
  group_by(host_id)%>%
  tally()%>%
  arrange(desc(n))%>%
  head(10)

 ggplot(by_host, aes (x="", y = n, fill = factor(host_id))) + 
  geom_bar(width = 1, stat = "identity") + 
  geom_text(aes(label = paste(round(n / sum(n) * 100, 1), "%")),
            position = position_stack(vjust = 0.5)) +
  theme(axis.line = element_blank(),
        plot.title = element_text(hjust=0.5)) + 
  labs(fill = "Host ID",
       x = NULL,
       y = NULL,
       title = "The Most Active Host Accounts by Number of Posts")+ 
   coord_polar("y")

The most 10 active hosts are plotted on the pie chart by their ID number. It is possible to say that total number of posts by top 10 hosts is nearly 1250 quarter of which belongs to the #1 host with the ID 219517861. The #2 host also has a great proportion compared to other 8 hosts.

3.8 House Locations by Neighbourhood Group

Location information can be received by using latitude and longitude data of each post. Thus, some of the parameters such as popularity, availability and expensiveness can be demonstrated on a NYC map. However, we have to show locations of neighbourhood group first to decide which posts belongs to which neighnourhood group when analyzing popularity, availability and expensiveness of the posts.

object_rds%>%
  select(latitude,longitude,neighbourhood_group)%>%
ggplot(.,aes(x=longitude,y=latitude,color=neighbourhood_group))+
  geom_point(alpha=0.5)+
  labs(title = "House Locations by Neighbourhood Group",
    color="Neighbourhood Group",
        x="Longitude",
        y="Latitude")

All the house/room posts are plotted on the scatter plot by their distinct neighbourhood groups and received a map with their location metrics.

3.9 Price Levels wrt Locations

To see price level change with location, latitude, longitude and price variables are taken into consideration. NYC map image is added to plot background for a better visualization. [2] Before starting to analysis, the upper limit of prices is found to eliminate outliers. In this way, extreme values cannot affect the analysis.

sorted<-sort(object_rds$price)
Q1<-0.25*48896
Q3<-0.75*48896
IQR<-sorted[Q3]-sorted[Q1]
UL<-sorted[Q3]+1.5*IQR
LL<-sorted[Q1]-1.5*IQR
price_map<-object_rds%>%
  select(longitude,latitude,price)%>%
  filter(price<UL)


img<-readPNG("New_York_City_.png")
  
ggplot(price_map,aes(x=longitude,y=latitude,color=price))+
  annotation_custom(rasterGrob(img, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf)+
  geom_point(alpha=0.5)+
  scale_color_gradient(low="green", high="red")+
  labs(title = "Price Levels wrt Locations",
    color="Price",
        x="Longitude",
        y="Latitude")

When we look at the House Locations by Neighbourhood Group and Price Levels wrt Locations plots at the same time, it is clear that price level is not so high in Bronx, Queens, Staten Island and south region of Brooklyn. However, the price color turns to red in Manhattan and north region of Brooklyn, which means price of houses/rooms are getting high in this area.

3.10 Availability of Accomomodation Options wrt Locations

To see number of available days change with location, latitude, longitude and availability_365 variables are taken into consideration. Similarly, the upper limit of availability_365 is found to eliminate outliers. In this way, extreme values cannot affect the analysis.

sorted_av<-sort(object_rds$availability_365)
Q1_av<-0.25*48896
Q3_av<-0.75*48896
IQR_av<-sorted_av[Q3_av]-sorted_av[Q1_av]
UL_av<-sorted_av[Q3_av]+1.5*IQR_av
LL_av<-sorted_av[Q1_av]-1.5*IQR_av
availability_map<-object_rds%>%
  select(longitude,latitude,availability_365)%>%
    filter(availability_365<=UL_av)


ggplot(availability_map,aes(x=longitude,y=latitude,color=availability_365))+
  annotation_custom(rasterGrob(img, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf)+
  geom_point(alpha=0.5)+
  scale_color_gradient(low="green", high="red")+
   labs(title = "Availability of Accomomodation Options wrt Locations",
    color="Available Days (yearly)",
        x="Longitude",
        y="Latitude")

There is no such big color difference between neighbourhood groups as in the price map. Nevertheless, green points appear more than red points in Manhattan and Brooklyn. Also, red points are distributed more in other 3 neighbourhood groups. So, it is possible to associate price map with availabilty map. We can say if availability of rooms/houses increases in a region, the prices decrease.

3.11 Popularity of Accomomodation Options wrt Locations

To see popularity with location, latitude, longitude and number_of_reviews variables are taken into consideration. Same as before, the upper limit of number_of_reviews is found to eliminate outliers. In this way, extreme values cannot affect the analysis.

sorted_pop<-sort(object_rds$number_of_reviews)
Q1_pop<-0.25*48896
Q3_pop<-0.75*48896
IQR_pop<-sorted_pop[Q3_pop]-sorted_pop[Q1_pop]
UL_pop<-sorted_pop[Q3_pop]+1.5*IQR_pop
LL_pop<-sorted_pop[Q1_pop]-1.5*IQR_pop
popularity_map<-object_rds%>%
  select(longitude,latitude,number_of_reviews)%>%
  filter(number_of_reviews<=UL_pop)

ggplot(popularity_map,aes(x=longitude,y=latitude,color=number_of_reviews))+
  annotation_custom(rasterGrob(img, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf)+
  geom_point(alpha=0.5)+
  scale_color_continuous(limits = c(0,150), 
                         breaks = c(150,125,100,75,50,25,0))+
  labs(title = "Popularity of Accomomodation Options wrt Locations",
    color="Number of Reviews",
        x="Longitude",
        y="Latitude")

Popularity of Accomomodation Options wrt Locations plot seems inversely correlated with price map. Expensive areas such as Manhattan and Brooklyn are colored with dark blue which denotes less popular areas. Less popular areas with light blue are the same with cheaper areas. This result may arise from their price level. People may prefer cheaper places to stay.

3.12 The Most Frequently Used Words in the Posts

Hosts usually choose attractive and descriptive words in their post names. We can analyze these text data with word cloud to see the most frequently words used in the posts. [3]

  # Create a corpus  
  text <- Corpus(VectorSource(object_rds$name))

  toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
  text <- tm_map(text, toSpace, "/")
  text <- tm_map(text, toSpace, ",")
  text <- tm_map(text, toSpace, "!")
  text <- tm_map(text, toSpace, "-")
  text <- tm_map(text, toSpace, ":")
  text <- tm_map(text, toSpace, "@")
  text <- tm_map(text, toSpace, "\\|")
  
  # Convert the text to lower case
  text <- tm_map(text, content_transformer(tolower))
  # Remove numbers
  text <- tm_map(text, removeNumbers)
  # Remove english common stopwords
  text <- tm_map(text, removeWords, stopwords("english"))
  # Remove your own stop word
  # specify your stopwords as a character vector
  text <- tm_map(text, removeWords, c("blabla1", "blabla2")) 
  # Remove punctuations
  text <- tm_map(text, removePunctuation)
  # Eliminate extra white spaces
  text <- tm_map(text, stripWhitespace)
  

  matrix <- as.matrix(TermDocumentMatrix(text))
  v <- sort(rowSums(matrix),decreasing=TRUE)
  d <- data.frame(word = names(v),freq=v)
  head(d, 10)
  set.seed(1176)
  wordcloud(words = d$word, freq = d$freq, min.freq = 1,
            max.words=200, random.order=FALSE, rot.per=0.35, 
            colors=brewer.pal(8, "Dark2"))

The most preferred words are room, bedroom, private, apartment. There are also significant words informing about location such as Manhattan, Brooklyn, west, east as well as some other words like sunny, beautiful, luxury, cozy, spacious arousing customers’ interests .

3.13 The Most Expensive Neighbourhood with their Group

Another analysis is to demonstrate which neighbourhoods are the most expensive ones and to which neighbourhood groups they belongs. For this purpose, neighbourhood data frame is created

neighbourhood<-object_rds%>%
  select(neighbourhood,neighbourhood_group,price,
         number_of_reviews,availability_365)%>%
  group_by(neighbourhood,neighbourhood_group)%>%
  summarise(mean_price=mean(price),
            mean_availability=mean(availability_365),
            mean_popularity=mean(number_of_reviews))
  

neighbourhood%>%
  arrange(desc(mean_price))%>%
  head(40)%>%
ggplot(.,aes(x=mean_price,y=reorder(neighbourhood,mean_price),fill=neighbourhood_group)) +
  geom_col()+
  labs(fill="Neighbour Group",
       x="Average Prices",
       y="Neighbourhood Name",
       title="The Most Expensive Neighbourhood with their Group")+
  theme_minimal()

Manhattan has the maximum number of neighbourhoods in top 40. However, the most 2 expensive neighbourhood belongs to Staten Island. So, Manhattan is the most expensive neighbourhood group in general whereas there are some other specific neighbourhoods in other neighbourhood groups.

3.14 The Most Available Neighbourhood with their Group

The same data frame created before for “The Most Expensive Neighbourhood with their Group” plot is used to demonstrate which neighbourhoods are the most available ones and to which neighbourhood groups they belongs. Availability is found by calculating average number of available days of a house/room in a year for each neighbourhood.

neighbourhood%>%
  arrange(desc(mean_availability))%>%
  head(40)%>%
ggplot(.,aes(x=mean_availability,y=reorder(neighbourhood,mean_availability),fill=neighbourhood_group)) +
  geom_col()+
  labs(fill="Neighbour Group",
       x="Average Number of Available Days (yearly)",
       y="Neighbourhood Name",
       title="The Most Available Neighbourhood with their Group")+
  theme_minimal()

Pink color is the most apparent in the plot, which means Staten Island has the maximum number of the most available neighbourhoods. However, except for Manhattan and Brooklyn, the other two neighbourhood groups are close to Staten Island in terms of availability.

4 Conclusion

In our analysis, we demonstrate:

5 Reference