In-Class Exercise

Introduction

The aim of this document is to analyze Istanbul Property Dataset by using dplyr and ggplot2 mainly. You may find the libraries that used during the analysis in the below code chunk. You can use the button in the right upper corner of this document to show all code chunks or simply just press the code button of every code chunk that you want to see.

library(dplyr)
library(tidyverse)
library(ggplot2)
library(readxl)
library(lubridate)
library(reshape2)

Data Cleaning

There were some explanations at the end of our Excel file so n_max parameter was used to limit the number of rows to be read. Tarih column converted to Date format by using paste0 and as.Date functions. All column names changed to make them explanatory. Also we can see that there are some missing values which ggplot will help us to ignore them.

data<-read_xlsx(file.choose(),n_max = 130)
data$Tarih<-paste0(data$Tarih,as.character("-01"))
data$Tarih<-as.Date(data$Tarih, format="%Y-%m-%d")

data<-rename(data,Tarih=Tarih,
             Toplam_Konut=`TP AKONUTSAT1 T40`,
             Ipotek= `TP AKONUTSAT2 T40`,
             Ilk_El=`TP AKONUTSAT3 T40`,
             Ikinci_El=`TP AKONUTSAT4 T40`,
             Yabanci=`TP DISKONSAT ISTANBUL`,
             Yeni_Endeks=`TP HEDONIKYKFE IST`,
             TR10=`TP HKFE02`,
             Birim_Fiyat=`TP TCBF02 ISTANBUL`)

head(data)

## # A tibble: 6 x 9
##   Tarih      Toplam_Konut Ipotek Ilk_El Ikinci_El Yabanci Yeni_Endeks  TR10
##   <date>            <dbl>  <dbl>  <dbl>     <dbl>   <dbl>       <dbl> <dbl>
## 1 2010-01-01           NA     NA     NA        NA      NA        35.9  36  
## 2 2010-02-01           NA     NA     NA        NA      NA        36.6  36.2
## 3 2010-03-01           NA     NA     NA        NA      NA        37.4  36.5
## 4 2010-04-01           NA     NA     NA        NA      NA        38    36.9
## 5 2010-05-01           NA     NA     NA        NA      NA        38    37.1
## 6 2010-06-01           NA     NA     NA        NA      NA        37.6  37  
## # ... with 1 more variable: Birim_Fiyat <dbl>

sapply(data, function(x) sum(is.na(x)))

##        Tarih Toplam_Konut       Ipotek       Ilk_El    Ikinci_El      Yabanci 
##            0           36           36           36           36           36 
##  Yeni_Endeks         TR10  Birim_Fiyat 
##            1            1            1

Distribution of Variables

Below you can see the distribution of each variable. We can see that most of our variables skewed and there is no normally distributed variable. There are some outliers on the high-end of Yabanci_Satis and it may be the impact of confering citizenship to the foreign people in return of buying property/land.

data %>%
  select(Toplam_Konut:Birim_Fiyat) %>%
  gather() %>%
  ggplot(aes(x=value)) +
  geom_histogram(mapping = aes(y=..density..,fill=key), color="darkgray") +
  facet_wrap(~ key, scales = "free") +
  theme_minimal() +
  theme(legend.position = 'none',axis.text.x = element_text(angle=45,hjust=1,vjust=1))+
  scale_y_continuous(labels=function(x) format(x, big.mark = ".", decimal.mark = ",", scientific = FALSE))

Yearly Change of Variables

We can see that Toplam_Konut is the sum of Ikinci_El and Ilk_El. Since Ikinci_El increased after 2017, we can say that the decrease in Toplam_Konut after 2018 was affected by Ilk_El. Also, We can see that Birim_Fiyat, Yeni_Endeks and TR10 index increases year by year in a correlated manner.

yearly_data<-data%>%
                group_by(Year=year(Tarih)) %>% 
                summarise_at(vars(Toplam_Konut:Birim_Fiyat), sum, na.rm = FALSE)


plot1<-melt(yearly_data, id.vars="Year")%>%
  ggplot(aes(Year,value)) + 
  geom_point() + 
  stat_smooth(method="loess") +
  facet_wrap(~variable,scales = "free") +theme_minimal()


plot1+
theme(legend.position="top",axis.text.x = element_text(angle=60,hjust=1,vjust=1))+
scale_y_continuous(labels=function(x) format(x, big.mark = ".", decimal.mark = ",", scientific = FALSE))+
scale_x_continuous(breaks = c(2010:2020))

In-Class Exercise

Eren Melih Altun

11/4/2020

Introduction

Data Cleaning

Distribution of Variables

Yearly Change of Variables