Data Cleaning & Transforming Process Detailed Steps

Let’s read imported data set. We have 901894 rows and 35 columns at the imported dataset.

df_import = readRDS("df_import.rds") %>% glimpse()

## Rows: 901,894
## Columns: 35
## $ date                                <dttm> 2015-12-31, 2015-12-31, 2015-1...
## $ code                                <chr> "ABE", "AEA", "AEB", "AEC", "AE...
## $ fund_type                           <chr> "pension", "pension", "pension"...
## $ category                            <chr> "Variable Fund", "Gold Fund", "...
## $ name                                <chr> "ANADOLU HAYAT EMEKLILIK A.Ş.B....
## $ price                               <dbl> 0.012212, 0.011026, 0.033588, 0...
## $ shares                              <dbl> 3876175452, 10761804554, 219965...
## $ people                              <dbl> 25887, 97857, 13269, 101, 5030,...
## $ total_value                         <dbl> 47336763, 118662196, 73881771, ...
## $ p_Bank_Bills                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Other                             <dbl> 3.02, 94.31, 1.90, 0.39, 1.79, ...
## $ p_FX_Payable_Bills                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Government_Bond                   <dbl> 0.66, 0.00, 0.84, 60.90, 66.88,...
## $ p_Foreign_Currency_Bills            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Eurobonds                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Commercial_Paper                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Fund_Participation_Certificate    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Real_Estate_Certificate           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Treasury_Bill                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Stock                             <dbl> 0.00, 0.88, 82.55, 4.76, 29.47,...
## $ p_Government_Bonds_and_Bills_FX     <dbl> 0.00, 0.00, 0.00, 11.09, 0.00, ...
## $ p_Participation_Account             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Government_Lease_Certificates     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Precious_Metals                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Private_Sector_Lease_Certificates <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Private_Sector_Bond               <dbl> 0.00, 4.81, 0.00, 22.78, 0.00, ...
## $ p_Repo                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Derivatives                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_TMM                               <dbl> 0.00, 0.00, 6.61, 0.08, 0.00, 7...
## $ p_Reverse_Repo                      <dbl> 0.40, 0.00, 8.10, 0.00, 1.86, 2...
## $ p_Asset_Backed_Securities           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Term_Deposit                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Debt_Instruments          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Equity                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Securities                <dbl> 95.92, 0.00, 0.00, 0.00, 0.00, ...

unique number of funds : 993

df_import %>% distinct(code) %>% summarise(count=n())

## # A tibble: 1 x 1
##   count
##   <int>
## 1   993

Unique number of funds by types and categories

df_import %>% group_by(fund_type, category)%>%  distinct(code) %>% summarise(count=n())

## `summarise()` regrouping output by 'fund_type' (override with `.groups` argument)

## # A tibble: 32 x 3
## # Groups:   fund_type [2]
##    fund_type category                      count
##    <chr>     <chr>                         <int>
##  1 mutual    Debt Securities Umbrella Fund    71
##  2 mutual    Fund Of Funds Umbrella Fund      18
##  3 mutual    Hedge Umbrella Fund             295
##  4 mutual    Mixed Umbrella Fund               4
##  5 mutual    Money Market Umbrella Fund       27
##  6 mutual    Participation Umbrella Fund      32
##  7 mutual    Precious Metals Umbrella Fund    11
##  8 mutual    Stock Umbrella Fund              63
##  9 mutual    Variable Umbrella Fund           67
## 10 mutual    <NA>                              1
## # ... with 22 more rows

A mutual fund has no category, it is “FPR”. This fund category was not in the data set

This fund has not any data, it should be an obsolete entry.

df_import %>% filter(is.na(category)==TRUE)%>%arrange(price)

## # A tibble: 488 x 35
##    date                code  fund_type category name  price shares people
##    <dttm>              <chr> <chr>     <chr>    <chr> <dbl>  <dbl>  <dbl>
##  1 2018-12-31 00:00:00 FPR   mutual    <NA>     <NA>     NA     NA     NA
##  2 2018-12-28 00:00:00 FPR   mutual    <NA>     <NA>     NA     NA     NA
##  3 2018-12-27 00:00:00 FPR   mutual    <NA>     <NA>     NA     NA     NA
##  4 2018-12-26 00:00:00 FPR   mutual    <NA>     <NA>     NA     NA     NA
##  5 2018-12-25 00:00:00 FPR   mutual    <NA>     <NA>     NA     NA     NA
##  6 2018-12-24 00:00:00 FPR   mutual    <NA>     <NA>     NA     NA     NA
##  7 2018-12-21 00:00:00 FPR   mutual    <NA>     <NA>     NA     NA     NA
##  8 2018-12-20 00:00:00 FPR   mutual    <NA>     <NA>     NA     NA     NA
##  9 2018-12-19 00:00:00 FPR   mutual    <NA>     <NA>     NA     NA     NA
## 10 2018-12-18 00:00:00 FPR   mutual    <NA>     <NA>     NA     NA     NA
## # ... with 478 more rows, and 27 more variables: total_value <dbl>,
## #   p_Bank_Bills <dbl>, p_Other <dbl>, p_FX_Payable_Bills <dbl>,
## #   p_Government_Bond <dbl>, p_Foreign_Currency_Bills <dbl>, p_Eurobonds <dbl>,
## #   p_Commercial_Paper <dbl>, p_Fund_Participation_Certificate <dbl>,
## #   p_Real_Estate_Certificate <dbl>, p_Treasury_Bill <dbl>, p_Stock <dbl>,
## #   p_Government_Bonds_and_Bills_FX <dbl>, p_Participation_Account <dbl>,
## #   p_Government_Lease_Certificates <dbl>, p_Precious_Metals <dbl>,
## #   p_Private_Sector_Lease_Certificates <dbl>, p_Private_Sector_Bond <dbl>,
## #   p_Repo <dbl>, p_Derivatives <dbl>, p_TMM <dbl>, p_Reverse_Repo <dbl>,
## #   p_Asset_Backed_Securities <dbl>, p_Term_Deposit <dbl>,
## #   p_Foreign_Debt_Instruments <dbl>, p_Foreign_Equity <dbl>,
## #   p_Foreign_Securities <dbl>

Let’s update our dataset and remove this fund altogether. 901894-488=901406 rows left

df_import=df_import %>% filter(code!="FPR")%>%glimpse()

## Rows: 901,406
## Columns: 35
## $ date                                <dttm> 2015-12-31, 2015-12-31, 2015-1...
## $ code                                <chr> "ABE", "AEA", "AEB", "AEC", "AE...
## $ fund_type                           <chr> "pension", "pension", "pension"...
## $ category                            <chr> "Variable Fund", "Gold Fund", "...
## $ name                                <chr> "ANADOLU HAYAT EMEKLILIK A.Ş.B....
## $ price                               <dbl> 0.012212, 0.011026, 0.033588, 0...
## $ shares                              <dbl> 3876175452, 10761804554, 219965...
## $ people                              <dbl> 25887, 97857, 13269, 101, 5030,...
## $ total_value                         <dbl> 47336763, 118662196, 73881771, ...
## $ p_Bank_Bills                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Other                             <dbl> 3.02, 94.31, 1.90, 0.39, 1.79, ...
## $ p_FX_Payable_Bills                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Government_Bond                   <dbl> 0.66, 0.00, 0.84, 60.90, 66.88,...
## $ p_Foreign_Currency_Bills            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Eurobonds                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Commercial_Paper                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Fund_Participation_Certificate    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Real_Estate_Certificate           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Treasury_Bill                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Stock                             <dbl> 0.00, 0.88, 82.55, 4.76, 29.47,...
## $ p_Government_Bonds_and_Bills_FX     <dbl> 0.00, 0.00, 0.00, 11.09, 0.00, ...
## $ p_Participation_Account             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Government_Lease_Certificates     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Precious_Metals                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Private_Sector_Lease_Certificates <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Private_Sector_Bond               <dbl> 0.00, 4.81, 0.00, 22.78, 0.00, ...
## $ p_Repo                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Derivatives                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_TMM                               <dbl> 0.00, 0.00, 6.61, 0.08, 0.00, 7...
## $ p_Reverse_Repo                      <dbl> 0.40, 0.00, 8.10, 0.00, 1.86, 2...
## $ p_Asset_Backed_Securities           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Term_Deposit                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Debt_Instruments          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Equity                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Securities                <dbl> 95.92, 0.00, 0.00, 0.00, 0.00, ...

Are any other NA values left in fund code, name or category fields? Let’s find out.

fund codes OK

df_import %>% filter(is.na(code)==TRUE)

## # A tibble: 0 x 35
## # ... with 35 variables: date <dttm>, code <chr>, fund_type <chr>,
## #   category <chr>, name <chr>, price <dbl>, shares <dbl>, people <dbl>,
## #   total_value <dbl>, p_Bank_Bills <dbl>, p_Other <dbl>,
## #   p_FX_Payable_Bills <dbl>, p_Government_Bond <dbl>,
## #   p_Foreign_Currency_Bills <dbl>, p_Eurobonds <dbl>,
## #   p_Commercial_Paper <dbl>, p_Fund_Participation_Certificate <dbl>,
## #   p_Real_Estate_Certificate <dbl>, p_Treasury_Bill <dbl>, p_Stock <dbl>,
## #   p_Government_Bonds_and_Bills_FX <dbl>, p_Participation_Account <dbl>,
## #   p_Government_Lease_Certificates <dbl>, p_Precious_Metals <dbl>,
## #   p_Private_Sector_Lease_Certificates <dbl>, p_Private_Sector_Bond <dbl>,
## #   p_Repo <dbl>, p_Derivatives <dbl>, p_TMM <dbl>, p_Reverse_Repo <dbl>,
## #   p_Asset_Backed_Securities <dbl>, p_Term_Deposit <dbl>,
## #   p_Foreign_Debt_Instruments <dbl>, p_Foreign_Equity <dbl>,
## #   p_Foreign_Securities <dbl>

2 rows missing fund names, but their data seems obsolete too.

df_import %>% filter(is.na(name)==TRUE)

## # A tibble: 2 x 35
##   date                code  fund_type category name  price shares people
##   <dttm>              <chr> <chr>     <chr>    <chr> <dbl>  <dbl>  <dbl>
## 1 2015-11-30 00:00:00 EIC   mutual    Variabl~ <NA>     NA     NA     NA
## 2 2015-11-16 00:00:00 AHI   mutual    Stock U~ <NA>     NA     NA     NA
## # ... with 27 more variables: total_value <dbl>, p_Bank_Bills <dbl>,
## #   p_Other <dbl>, p_FX_Payable_Bills <dbl>, p_Government_Bond <dbl>,
## #   p_Foreign_Currency_Bills <dbl>, p_Eurobonds <dbl>,
## #   p_Commercial_Paper <dbl>, p_Fund_Participation_Certificate <dbl>,
## #   p_Real_Estate_Certificate <dbl>, p_Treasury_Bill <dbl>, p_Stock <dbl>,
## #   p_Government_Bonds_and_Bills_FX <dbl>, p_Participation_Account <dbl>,
## #   p_Government_Lease_Certificates <dbl>, p_Precious_Metals <dbl>,
## #   p_Private_Sector_Lease_Certificates <dbl>, p_Private_Sector_Bond <dbl>,
## #   p_Repo <dbl>, p_Derivatives <dbl>, p_TMM <dbl>, p_Reverse_Repo <dbl>,
## #   p_Asset_Backed_Securities <dbl>, p_Term_Deposit <dbl>,
## #   p_Foreign_Debt_Instruments <dbl>, p_Foreign_Equity <dbl>,
## #   p_Foreign_Securities <dbl>

We need to remove them from dataset 901406-2=901404 rows left

df_import=df_import %>% filter(is.na(name)==FALSE)%>%glimpse()

## Rows: 901,404
## Columns: 35
## $ date                                <dttm> 2015-12-31, 2015-12-31, 2015-1...
## $ code                                <chr> "ABE", "AEA", "AEB", "AEC", "AE...
## $ fund_type                           <chr> "pension", "pension", "pension"...
## $ category                            <chr> "Variable Fund", "Gold Fund", "...
## $ name                                <chr> "ANADOLU HAYAT EMEKLILIK A.Ş.B....
## $ price                               <dbl> 0.012212, 0.011026, 0.033588, 0...
## $ shares                              <dbl> 3876175452, 10761804554, 219965...
## $ people                              <dbl> 25887, 97857, 13269, 101, 5030,...
## $ total_value                         <dbl> 47336763, 118662196, 73881771, ...
## $ p_Bank_Bills                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Other                             <dbl> 3.02, 94.31, 1.90, 0.39, 1.79, ...
## $ p_FX_Payable_Bills                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Government_Bond                   <dbl> 0.66, 0.00, 0.84, 60.90, 66.88,...
## $ p_Foreign_Currency_Bills            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Eurobonds                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Commercial_Paper                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Fund_Participation_Certificate    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Real_Estate_Certificate           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Treasury_Bill                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Stock                             <dbl> 0.00, 0.88, 82.55, 4.76, 29.47,...
## $ p_Government_Bonds_and_Bills_FX     <dbl> 0.00, 0.00, 0.00, 11.09, 0.00, ...
## $ p_Participation_Account             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Government_Lease_Certificates     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Precious_Metals                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Private_Sector_Lease_Certificates <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Private_Sector_Bond               <dbl> 0.00, 4.81, 0.00, 22.78, 0.00, ...
## $ p_Repo                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Derivatives                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_TMM                               <dbl> 0.00, 0.00, 6.61, 0.08, 0.00, 7...
## $ p_Reverse_Repo                      <dbl> 0.40, 0.00, 8.10, 0.00, 1.86, 2...
## $ p_Asset_Backed_Securities           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Term_Deposit                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Debt_Instruments          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Equity                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Securities                <dbl> 95.92, 0.00, 0.00, 0.00, 0.00, ...

Let’s check price, no NA values

df_import %>% filter(is.na(price)==TRUE)

## # A tibble: 0 x 35
## # ... with 35 variables: date <dttm>, code <chr>, fund_type <chr>,
## #   category <chr>, name <chr>, price <dbl>, shares <dbl>, people <dbl>,
## #   total_value <dbl>, p_Bank_Bills <dbl>, p_Other <dbl>,
## #   p_FX_Payable_Bills <dbl>, p_Government_Bond <dbl>,
## #   p_Foreign_Currency_Bills <dbl>, p_Eurobonds <dbl>,
## #   p_Commercial_Paper <dbl>, p_Fund_Participation_Certificate <dbl>,
## #   p_Real_Estate_Certificate <dbl>, p_Treasury_Bill <dbl>, p_Stock <dbl>,
## #   p_Government_Bonds_and_Bills_FX <dbl>, p_Participation_Account <dbl>,
## #   p_Government_Lease_Certificates <dbl>, p_Precious_Metals <dbl>,
## #   p_Private_Sector_Lease_Certificates <dbl>, p_Private_Sector_Bond <dbl>,
## #   p_Repo <dbl>, p_Derivatives <dbl>, p_TMM <dbl>, p_Reverse_Repo <dbl>,
## #   p_Asset_Backed_Securities <dbl>, p_Term_Deposit <dbl>,
## #   p_Foreign_Debt_Instruments <dbl>, p_Foreign_Equity <dbl>,
## #   p_Foreign_Securities <dbl>

Let’s check shares, no NA values

df_import %>% filter(is.na(shares)==TRUE)

## # A tibble: 0 x 35
## # ... with 35 variables: date <dttm>, code <chr>, fund_type <chr>,
## #   category <chr>, name <chr>, price <dbl>, shares <dbl>, people <dbl>,
## #   total_value <dbl>, p_Bank_Bills <dbl>, p_Other <dbl>,
## #   p_FX_Payable_Bills <dbl>, p_Government_Bond <dbl>,
## #   p_Foreign_Currency_Bills <dbl>, p_Eurobonds <dbl>,
## #   p_Commercial_Paper <dbl>, p_Fund_Participation_Certificate <dbl>,
## #   p_Real_Estate_Certificate <dbl>, p_Treasury_Bill <dbl>, p_Stock <dbl>,
## #   p_Government_Bonds_and_Bills_FX <dbl>, p_Participation_Account <dbl>,
## #   p_Government_Lease_Certificates <dbl>, p_Precious_Metals <dbl>,
## #   p_Private_Sector_Lease_Certificates <dbl>, p_Private_Sector_Bond <dbl>,
## #   p_Repo <dbl>, p_Derivatives <dbl>, p_TMM <dbl>, p_Reverse_Repo <dbl>,
## #   p_Asset_Backed_Securities <dbl>, p_Term_Deposit <dbl>,
## #   p_Foreign_Debt_Instruments <dbl>, p_Foreign_Equity <dbl>,
## #   p_Foreign_Securities <dbl>

Let’s check dates, no NA values, no 0 values

df_import %>% filter(is.na(date)==TRUE|date==0)

## # A tibble: 0 x 35
## # ... with 35 variables: date <dttm>, code <chr>, fund_type <chr>,
## #   category <chr>, name <chr>, price <dbl>, shares <dbl>, people <dbl>,
## #   total_value <dbl>, p_Bank_Bills <dbl>, p_Other <dbl>,
## #   p_FX_Payable_Bills <dbl>, p_Government_Bond <dbl>,
## #   p_Foreign_Currency_Bills <dbl>, p_Eurobonds <dbl>,
## #   p_Commercial_Paper <dbl>, p_Fund_Participation_Certificate <dbl>,
## #   p_Real_Estate_Certificate <dbl>, p_Treasury_Bill <dbl>, p_Stock <dbl>,
## #   p_Government_Bonds_and_Bills_FX <dbl>, p_Participation_Account <dbl>,
## #   p_Government_Lease_Certificates <dbl>, p_Precious_Metals <dbl>,
## #   p_Private_Sector_Lease_Certificates <dbl>, p_Private_Sector_Bond <dbl>,
## #   p_Repo <dbl>, p_Derivatives <dbl>, p_TMM <dbl>, p_Reverse_Repo <dbl>,
## #   p_Asset_Backed_Securities <dbl>, p_Term_Deposit <dbl>,
## #   p_Foreign_Debt_Instruments <dbl>, p_Foreign_Equity <dbl>,
## #   p_Foreign_Securities <dbl>

Are there any 0 price or share values? There are 68109 rows with 0 values

df_import%>%filter(price==0|shares==0|is.na(price)|is.na(shares))%>%relocate(date,code,price,shares,people)%>%arrange(desc(price))

## # A tibble: 68,109 x 35
##    date                code    price shares people fund_type category name 
##    <dttm>              <chr>   <dbl>  <dbl>  <dbl> <chr>     <chr>    <chr>
##  1 2017-05-08 00:00:00 YAY   36.4         0   3103 mutual    Stock U~ YAPI~
##  2 2015-12-07 00:00:00 AZB    0.0505      0      0 pension   Variabl~ ALLI~
##  3 2015-12-07 00:00:00 AZO    0.0279      0      0 pension   Mixed F~ ALLI~
##  4 2016-12-30 00:00:00 FEO    0           0      0 pension   Variabl~ FİBA~
##  5 2016-12-30 00:00:00 FEV    0           0      0 pension   Variabl~ FİBA~
##  6 2016-12-30 00:00:00 FEZ    0           0      0 pension   Variabl~ FİBA~
##  7 2016-12-30 00:00:00 FYL    0           0      0 pension   Partici~ AVİV~
##  8 2016-12-30 00:00:00 FYN    0           0      0 pension   Partici~ AVİV~
##  9 2016-12-30 00:00:00 FYU    0           0      0 pension   Partici~ ALLİ~
## 10 2016-12-30 00:00:00 FYY    0           0      0 pension   Partici~ ALLİ~
## # ... with 68,099 more rows, and 27 more variables: total_value <dbl>,
## #   p_Bank_Bills <dbl>, p_Other <dbl>, p_FX_Payable_Bills <dbl>,
## #   p_Government_Bond <dbl>, p_Foreign_Currency_Bills <dbl>, p_Eurobonds <dbl>,
## #   p_Commercial_Paper <dbl>, p_Fund_Participation_Certificate <dbl>,
## #   p_Real_Estate_Certificate <dbl>, p_Treasury_Bill <dbl>, p_Stock <dbl>,
## #   p_Government_Bonds_and_Bills_FX <dbl>, p_Participation_Account <dbl>,
## #   p_Government_Lease_Certificates <dbl>, p_Precious_Metals <dbl>,
## #   p_Private_Sector_Lease_Certificates <dbl>, p_Private_Sector_Bond <dbl>,
## #   p_Repo <dbl>, p_Derivatives <dbl>, p_TMM <dbl>, p_Reverse_Repo <dbl>,
## #   p_Asset_Backed_Securities <dbl>, p_Term_Deposit <dbl>,
## #   p_Foreign_Debt_Instruments <dbl>, p_Foreign_Equity <dbl>,
## #   p_Foreign_Securities <dbl>

We will remove 68109 rows with obsolete entries, 833295 rows left

df_import=df_import%>%filter(price>0 & shares>0 &!is.na(price)&!is.na(shares))%>%glimpse()

## Rows: 833,295
## Columns: 35
## $ date                                <dttm> 2015-12-31, 2015-12-31, 2015-1...
## $ code                                <chr> "ABE", "AEA", "AEB", "AEC", "AE...
## $ fund_type                           <chr> "pension", "pension", "pension"...
## $ category                            <chr> "Variable Fund", "Gold Fund", "...
## $ name                                <chr> "ANADOLU HAYAT EMEKLILIK A.Ş.B....
## $ price                               <dbl> 0.012212, 0.011026, 0.033588, 0...
## $ shares                              <dbl> 3876175452, 10761804554, 219965...
## $ people                              <dbl> 25887, 97857, 13269, 101, 5030,...
## $ total_value                         <dbl> 47336763, 118662196, 73881771, ...
## $ p_Bank_Bills                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Other                             <dbl> 3.02, 94.31, 1.90, 0.39, 1.79, ...
## $ p_FX_Payable_Bills                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Government_Bond                   <dbl> 0.66, 0.00, 0.84, 60.90, 66.88,...
## $ p_Foreign_Currency_Bills            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Eurobonds                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Commercial_Paper                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Fund_Participation_Certificate    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Real_Estate_Certificate           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Treasury_Bill                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Stock                             <dbl> 0.00, 0.88, 82.55, 4.76, 29.47,...
## $ p_Government_Bonds_and_Bills_FX     <dbl> 0.00, 0.00, 0.00, 11.09, 0.00, ...
## $ p_Participation_Account             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Government_Lease_Certificates     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Precious_Metals                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Private_Sector_Lease_Certificates <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Private_Sector_Bond               <dbl> 0.00, 4.81, 0.00, 22.78, 0.00, ...
## $ p_Repo                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Derivatives                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_TMM                               <dbl> 0.00, 0.00, 6.61, 0.08, 0.00, 7...
## $ p_Reverse_Repo                      <dbl> 0.40, 0.00, 8.10, 0.00, 1.86, 2...
## $ p_Asset_Backed_Securities           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Term_Deposit                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Debt_Instruments          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Equity                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Securities                <dbl> 95.92, 0.00, 0.00, 0.00, 0.00, ...

total number of funds decreased from 993 to 952 after removing obsolete data rows.

df_import%>%distinct(code)%>%count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1   952

Are fund names unique among fund codes? Are there more than 1 name for a fund code? We checked them and did not find any problem.

df_import %>% group_by(code, name)%>%  distinct(code) %>% summarise(count=n())%>%filter(count>1)%>%arrange(desc(count))

## `summarise()` regrouping output by 'code' (override with `.groups` argument)

## # A tibble: 0 x 3
## # Groups:   code [0]
## # ... with 3 variables: code <chr>, name <chr>, count <int>

Are there any funds which are not active any more? Let’s check how many funds have data in the last day of our data set.

Out of the 952, only 920 funds have up to date price data.

df_import %>% filter(date==max(date))%>%glimpse()

## Rows: 920
## Columns: 35
## $ date                                <dttm> 2020-11-16, 2020-11-16, 2020-1...
## $ code                                <chr> "AAJ", "ABE", "ACV", "AEA", "AE...
## $ fund_type                           <chr> "pension", "pension", "pension"...
## $ category                            <chr> "AES Standard Fund", "Variable ...
## $ name                                <chr> "AVİVASA EMEKLİLİK VE HAYAT A.Ş...
## $ price                               <dbl> 0.014412, 0.046772, 0.016043, 0...
## $ shares                              <dbl> 35933708934, 12476111398, 22690...
## $ people                              <dbl> 366643, 107295, 103359, 526168,...
## $ total_value                         <dbl> 517888264, 583533145, 36404063,...
## $ p_Bank_Bills                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Other                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_FX_Payable_Bills                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Government_Bond                   <dbl> 50.21, 0.00, 0.00, 0.00, 0.00, ...
## $ p_Foreign_Currency_Bills            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Eurobonds                         <dbl> 0.00, 0.03, 0.00, 0.00, 0.00, 1...
## $ p_Commercial_Paper                  <dbl> 15.08, 0.00, 0.00, 0.00, 0.00, ...
## $ p_Fund_Participation_Certificate    <dbl> 13.10, 4.02, 0.00, 0.00, 2.59, ...
## $ p_Real_Estate_Certificate           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Treasury_Bill                     <dbl> 1.41, 0.00, 0.00, 0.00, 0.00, 0...
## $ p_Stock                             <dbl> 9.93, 0.00, 0.00, 0.00, 97.21, ...
## $ p_Government_Bonds_and_Bills_FX     <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0...
## $ p_Participation_Account             <dbl> 0.00, 0.00, 55.79, 0.00, 0.00, ...
## $ p_Government_Lease_Certificates     <dbl> 0.00, 0.00, 21.96, 84.05, 0.00,...
## $ p_Precious_Metals                   <dbl> 1.15, 0.00, 0.00, 15.73, 0.00, ...
## $ p_Private_Sector_Lease_Certificates <dbl> 0.00, 0.00, 22.25, 0.00, 0.00, ...
## $ p_Private_Sector_Bond               <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0...
## $ p_Repo                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Derivatives                       <dbl> 0.00, 0.52, 0.00, 0.00, 0.20, 0...
## $ p_TMM                               <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0...
## $ p_Reverse_Repo                      <dbl> 1.94, 1.82, 0.00, 0.00, 0.00, 7...
## $ p_Asset_Backed_Securities           <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0...
## $ p_Term_Deposit                      <dbl> 7.18, 0.00, 0.00, 0.00, 0.00, 0...
## $ p_Foreign_Debt_Instruments          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Equity                    <dbl> 0.00, 93.05, 0.00, 0.22, 0.00, ...
## $ p_Foreign_Securities                <dbl> 0.00, 0.56, 0.00, 0.00, 0.00, 0...

Let’s locate these 32 funds, they have 6657 rows in the data set

df_uptodatefunds=df_import %>% filter(date==max(date))%>%distinct(code)#%>%glimpse()
df_inactive=anti_join(df_import,df_uptodatefunds)%>%glimpse()

## Joining, by = "code"

## Rows: 6,657
## Columns: 35
## $ date                                <dttm> 2015-12-01, 2015-12-01, 2016-1...
## $ code                                <chr> "KRS", "OSV", "AZZ", "AZZ", "KR...
## $ fund_type                           <chr> "mutual", "mutual", "mutual", "...
## $ category                            <chr> "Hedge Umbrella Fund", "Hedge U...
## $ name                                <chr> "KARE PORTFÖY SERBEST (DÖVİZ) F...
## $ price                               <dbl> 0.019549, 0.008313, 3.549100, 3...
## $ shares                              <dbl> 1080769072, 9032358776, 5685319...
## $ people                              <dbl> 0, 0, 0, 0, 4, 19, 4, 0, 0, 0, ...
## $ total_value                         <dbl> 21127451, 75087720, 20177765, 1...
## $ p_Bank_Bills                        <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Other                             <dbl> NA, NA, 82.09, 82.10, 9.89, 0.5...
## $ p_FX_Payable_Bills                  <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Government_Bond                   <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Foreign_Currency_Bills            <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Eurobonds                         <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Commercial_Paper                  <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Fund_Participation_Certificate    <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Real_Estate_Certificate           <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Treasury_Bill                     <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Stock                             <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Government_Bonds_and_Bills_FX     <dbl> NA, NA, 0.00, 0.00, 42.36, 54.2...
## $ p_Participation_Account             <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Government_Lease_Certificates     <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Precious_Metals                   <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Private_Sector_Lease_Certificates <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Private_Sector_Bond               <dbl> NA, NA, 0.00, 0.00, 14.96, 1.14...
## $ p_Repo                              <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Derivatives                       <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_TMM                               <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Reverse_Repo                      <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Asset_Backed_Securities           <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Term_Deposit                      <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Foreign_Debt_Instruments          <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Foreign_Equity                    <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ p_Foreign_Securities                <dbl> NA, NA, 17.91, 17.90, 32.79, 44...

Let’s check the total number of data points and date coverage for these funds. These are inactive funds all from umbrella hedge funds

df_inactive %>% group_by(code, fund_type, category) %>% summarise(count=n(), latest=max(date), earliest=min(date), length=max(date)-min(date))%>% arrange(desc(latest))

## `summarise()` regrouping output by 'code', 'fund_type' (override with `.groups` argument)

## # A tibble: 32 x 7
## # Groups:   code, fund_type [32]
##    code  fund_type category count latest              earliest            length
##    <chr> <chr>     <chr>    <int> <dttm>              <dttm>              <drtn>
##  1 UAP   mutual    Hedge U~   116 2020-11-13 00:00:00 2020-06-01 00:00:00 165 d~
##  2 CEY   mutual    Hedge U~   400 2020-11-10 00:00:00 2019-04-02 00:00:00 588 d~
##  3 CTP   mutual    Hedge U~    78 2020-10-27 00:00:00 2020-07-07 00:00:00 112 d~
##  4 NBF   mutual    Hedge U~    51 2020-10-27 00:00:00 2020-08-18 00:00:00  70 d~
##  5 IMV   mutual    Hedge U~   286 2020-10-21 00:00:00 2019-09-04 00:00:00 413 d~
##  6 GDL   mutual    Hedge U~   163 2020-10-16 00:00:00 2020-02-21 00:00:00 238 d~
##  7 IMY   mutual    Hedge U~   519 2020-10-13 00:00:00 2018-09-13 00:00:00 761 d~
##  8 IMZ   mutual    Hedge U~   343 2020-10-13 00:00:00 2019-05-28 00:00:00 504 d~
##  9 IMT   mutual    Hedge U~   319 2020-10-09 00:00:00 2019-07-02 00:00:00 465 d~
## 10 ICG   mutual    Hedge U~   272 2020-10-08 00:00:00 2019-09-11 00:00:00 393 d~
## # ... with 22 more rows

These are all “Serbest”/“hedge fund” types that are only accessible by “niteliklik yatırımcı”/“qualified investor” mostly institutional investors. This subset don’t have reliable data and behave very different than typical mutual funds which are accessible to general public. We will remove the ones with unreliable data from our analysis data set.

When we remove 6657 rows from our data set, we now have 826638 rows

df_import=semi_join(df_import,df_uptodatefunds)%>%glimpse()

## Joining, by = "code"

## Rows: 826,638
## Columns: 35
## $ date                                <dttm> 2015-12-31, 2015-12-31, 2015-1...
## $ code                                <chr> "ABE", "AEA", "AEB", "AEC", "AE...
## $ fund_type                           <chr> "pension", "pension", "pension"...
## $ category                            <chr> "Variable Fund", "Gold Fund", "...
## $ name                                <chr> "ANADOLU HAYAT EMEKLILIK A.Ş.B....
## $ price                               <dbl> 0.012212, 0.011026, 0.033588, 0...
## $ shares                              <dbl> 3876175452, 10761804554, 219965...
## $ people                              <dbl> 25887, 97857, 13269, 101, 5030,...
## $ total_value                         <dbl> 47336763, 118662196, 73881771, ...
## $ p_Bank_Bills                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Other                             <dbl> 3.02, 94.31, 1.90, 0.39, 1.79, ...
## $ p_FX_Payable_Bills                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Government_Bond                   <dbl> 0.66, 0.00, 0.84, 60.90, 66.88,...
## $ p_Foreign_Currency_Bills            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Eurobonds                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Commercial_Paper                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Fund_Participation_Certificate    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Real_Estate_Certificate           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Treasury_Bill                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Stock                             <dbl> 0.00, 0.88, 82.55, 4.76, 29.47,...
## $ p_Government_Bonds_and_Bills_FX     <dbl> 0.00, 0.00, 0.00, 11.09, 0.00, ...
## $ p_Participation_Account             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Government_Lease_Certificates     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Precious_Metals                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Private_Sector_Lease_Certificates <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Private_Sector_Bond               <dbl> 0.00, 4.81, 0.00, 22.78, 0.00, ...
## $ p_Repo                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Derivatives                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_TMM                               <dbl> 0.00, 0.00, 6.61, 0.08, 0.00, 7...
## $ p_Reverse_Repo                      <dbl> 0.40, 0.00, 8.10, 0.00, 1.86, 2...
## $ p_Asset_Backed_Securities           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Term_Deposit                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Debt_Instruments          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Equity                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Securities                <dbl> 95.92, 0.00, 0.00, 0.00, 0.00, ...

We need to add some additional data to better classify funds in our future analysis.

86 Pension Funds which has OKS in their names are “Otomatik Katılım sistemi” funds. these are special funds forf Automatic Participation system pension contracts.

165 funds which have “katılım” or “participation” either in their names or category are special funds which only invest in securities without interest returns.

28 Pension funds which have “katkı”, “devlet katkısı”, “contribution” or “state contribution” either in their names or category are special funds which government contributions are invested.

df_import=df_import%>%mutate(contribution=(grepl(" KATKI ",name)|grepl(" contribution ",category)), participation=(grepl(" KATILIM ",name)|grepl(" participation  ",category)), OKS=grepl(" OKS ",name))%>%relocate(OKS, participation, contribution)%>%glimpse()

## Rows: 826,638
## Columns: 38
## $ OKS                                 <lgl> FALSE, FALSE, FALSE, FALSE, FAL...
## $ participation                       <lgl> FALSE, TRUE, FALSE, FALSE, FALS...
## $ contribution                        <lgl> FALSE, FALSE, FALSE, FALSE, FAL...
## $ date                                <dttm> 2015-12-31, 2015-12-31, 2015-1...
## $ code                                <chr> "ABE", "AEA", "AEB", "AEC", "AE...
## $ fund_type                           <chr> "pension", "pension", "pension"...
## $ category                            <chr> "Variable Fund", "Gold Fund", "...
## $ name                                <chr> "ANADOLU HAYAT EMEKLILIK A.Ş.B....
## $ price                               <dbl> 0.012212, 0.011026, 0.033588, 0...
## $ shares                              <dbl> 3876175452, 10761804554, 219965...
## $ people                              <dbl> 25887, 97857, 13269, 101, 5030,...
## $ total_value                         <dbl> 47336763, 118662196, 73881771, ...
## $ p_Bank_Bills                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Other                             <dbl> 3.02, 94.31, 1.90, 0.39, 1.79, ...
## $ p_FX_Payable_Bills                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Government_Bond                   <dbl> 0.66, 0.00, 0.84, 60.90, 66.88,...
## $ p_Foreign_Currency_Bills            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Eurobonds                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Commercial_Paper                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Fund_Participation_Certificate    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Real_Estate_Certificate           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Treasury_Bill                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Stock                             <dbl> 0.00, 0.88, 82.55, 4.76, 29.47,...
## $ p_Government_Bonds_and_Bills_FX     <dbl> 0.00, 0.00, 0.00, 11.09, 0.00, ...
## $ p_Participation_Account             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Government_Lease_Certificates     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Precious_Metals                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Private_Sector_Lease_Certificates <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Private_Sector_Bond               <dbl> 0.00, 4.81, 0.00, 22.78, 0.00, ...
## $ p_Repo                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Derivatives                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_TMM                               <dbl> 0.00, 0.00, 6.61, 0.08, 0.00, 7...
## $ p_Reverse_Repo                      <dbl> 0.40, 0.00, 8.10, 0.00, 1.86, 2...
## $ p_Asset_Backed_Securities           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Term_Deposit                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Debt_Instruments          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Equity                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Securities                <dbl> 95.92, 0.00, 0.00, 0.00, 0.00, ...

df_import%>%group_by(OKS,participation,contribution)%>%distinct(code)%>%summarize(n())

## `summarise()` regrouping output by 'OKS', 'participation' (override with `.groups` argument)

## # A tibble: 6 x 4
## # Groups:   OKS, participation [4]
##   OKS   participation contribution `n()`
##   <lgl> <lgl>         <lgl>        <int>
## 1 FALSE FALSE         FALSE          698
## 2 FALSE FALSE         TRUE            18
## 3 FALSE TRUE          FALSE          108
## 4 FALSE TRUE          TRUE            10
## 5 TRUE  FALSE         FALSE           49
## 6 TRUE  TRUE          FALSE           37

We don’t have the names or ids of portfolio management companies in our data set. We need to parse their names from th “name” field. We can also simplify the long names of the funds which would help us during visualisation stage

There are 920 unique funds.

df_import%>%distinct(name)%>%count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1   920

From the TEFAS web site, names of the companies are inside the html code of the page. We just copied them to an excel file. Import the names of the owner companies.

df_companies=read_excel("companies.xlsx")
df_companies%>%arrange(company_name)

## # A tibble: 53 x 3
##    company_code company_name                    company_type
##    <chr>        <chr>                           <chr>       
##  1 PPY          ACTUS PORTFÖY YÖNETİMİ A.Ş.     mutual      
##  2 ANM          AEGON EMEKLİLİK VE HAYAT A.Ş.   pension     
##  3 AKP          AK PORTFÖY YÖNETİMİ A.Ş.        mutual      
##  4 PGA          ALBARAKA PORTFÖY YÖNETİMİ A.Ş.  mutual      
##  5 KHM          ALLİANZ HAYAT VE EMEKLİLİK A.Ş. pension     
##  6 HYS          ALLIANZ YAŞAM VE EMEKLİLİK A.Ş. pension     
##  7 AEM          ANADOLU HAYAT EMEKLİLİK A.Ş.    pension     
##  8 APY          ATA PORTFÖY YÖNETİMİ A.Ş.       mutual      
##  9 ALA          ATLAS PORTFÖY YÖNETİMİ A.Ş.     mutual      
## 10 TLR          AURA PORTFÖY YÖNETİMİ A.Ş.      mutual      
## # ... with 43 more rows

There are 53 entries. 36 of them mutual funds, and 17 of them are pension fund companies.

df_companies%>%group_by(company_type)%>%summarise(n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   company_type `n()`
##   <chr>        <int>
## 1 mutual          36
## 2 pension         17

TÜRKİYE HAYAT VE EMEKLİLİK A.Ş. has triple entries.Other company entries are unique.

df_companies%>%group_by(company_name)%>%summarise(count=n())%>%arrange(desc(count))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 51 x 2
##    company_name                    count
##    <chr>                           <int>
##  1 TÜRKİYE HAYAT VE EMEKLİLİK A.Ş.     3
##  2 ACTUS PORTFÖY YÖNETİMİ A.Ş.         1
##  3 AEGON EMEKLİLİK VE HAYAT A.Ş.       1
##  4 AK PORTFÖY YÖNETİMİ A.Ş.            1
##  5 ALBARAKA PORTFÖY YÖNETİMİ A.Ş.      1
##  6 ALLİANZ HAYAT VE EMEKLİLİK A.Ş.     1
##  7 ALLIANZ YAŞAM VE EMEKLİLİK A.Ş.     1
##  8 ANADOLU HAYAT EMEKLİLİK A.Ş.        1
##  9 ATA PORTFÖY YÖNETİMİ A.Ş.           1
## 10 ATLAS PORTFÖY YÖNETİMİ A.Ş.         1
## # ... with 41 more rows

TURKIYE HAYAT VE EMEKLILIK A.S. is the merger of 3 government owned Pension companies: ZEM: Ziraat Emeklilik VEM: Vakif Emeklilik HEM:Halk Emeklilik

For our initial analysis we will combine them under the code: “TEM” During our advanced phases of our analysis if we decide to dig deeper to compare performances of the original pension fund companies we may look for ways to differentiate them.In the fund names, ZEM, VEM and HEM are included, we can transform them if we need to.

df_companies%>%filter(company_name=="TÜRKİYE HAYAT VE EMEKLİLİK A.Ş.")

## # A tibble: 3 x 3
##   company_code company_name                    company_type
##   <chr>        <chr>                           <chr>       
## 1 ZEM          TÜRKİYE HAYAT VE EMEKLİLİK A.Ş. pension     
## 2 VEM          TÜRKİYE HAYAT VE EMEKLİLİK A.Ş. pension     
## 3 HEM          TÜRKİYE HAYAT VE EMEKLİLİK A.Ş. pension

df_import%>%distinct(name)%>%filter(grepl("VEM|ZEM|HEM",name))

## # A tibble: 75 x 1
##    name                                                                         
##    <chr>                                                                        
##  1 TÜRKİYE HAYAT VE EMEKLİLİK A.Ş. HEM KATILIM KATKI EMEKLİLİK YATIRIM FONU     
##  2 TÜRKİYE HAYAT VE EMEKLİLİK A.Ş. HEM HİSSE SENEDİ EMEKLİLİK YATIRIM FONU      
##  3 TÜRKİYE HAYAT VE EMEKLİLİK A.Ş. HEM KATILIM STANDART EMEKLİLİK YATIRIM FONU  
##  4 TÜRKİYE HAYAT VE EMEKLİLİK A.Ş. HEM DİNAMİK DEĞİŞKEN EMEKLİLİK YATIRIM FONU  
##  5 TÜRKİYE HAYAT VE EMEKLİLİK A.Ş. HEM KAMU DIŞ BORÇLANMA ARAÇLARI EMEKLİLİK YA~
##  6 TÜRKİYE HAYAT VE EMEKLİLİK A.Ş. HEM KAMU BORÇLANMA ARAÇLARI STANDART EMEKLİL~
##  7 TÜRKİYE HAYAT VE EMEKLİLİK A.Ş. HEM KATILIM HİSSE SENEDİ EMEKLİLİK YATIRIM F~
##  8 TÜRKİYE HAYAT VE EMEKLİLİK A.Ş. HEM KATILIM DİNAMİK DEĞİŞKEN EMEKLİLİK YATIR~
##  9 TÜRKİYE HAYAT VE EMEKLİLİK A.Ş. HEM PARA PİYASASI EMEKLİLİK YATIRIM FONU     
## 10 TÜRKİYE HAYAT VE EMEKLİLİK A.Ş. HEM KATKI EMEKLİLİK YATIRIM FONU             
## # ... with 65 more rows

Let’s remove rows with VEM and ZEM then change HEM as TEM. Now we have 51 unique companies and codes

df_companies=df_companies%>%filter(company_code!="ZEM" & company_code!="VEM")%>%mutate(company_code=ifelse(company_code=="HEM","TEM",company_code))

df_companies%>%arrange(company_code)

## # A tibble: 51 x 3
##    company_code company_name                    company_type
##    <chr>        <chr>                           <chr>       
##  1 AEM          ANADOLU HAYAT EMEKLİLİK A.Ş.    pension     
##  2 AHS          AVİVASA EMEKLİLİK VE HAYAT A.Ş. pension     
##  3 AHY          AXA HAYAT VE EMEKLİLİK A.Ş.     pension     
##  4 AKP          AK PORTFÖY YÖNETİMİ A.Ş.        mutual      
##  5 ALA          ATLAS PORTFÖY YÖNETİMİ A.Ş.     mutual      
##  6 ANM          AEGON EMEKLİLİK VE HAYAT A.Ş.   pension     
##  7 APY          ATA PORTFÖY YÖNETİMİ A.Ş.       mutual      
##  8 ASH          BEREKET EMEKLİLİK VE HAYAT A.Ş. pension     
##  9 DEN          METLİFE EMEKLİLİK VE HAYAT A.Ş. pension     
## 10 EPY          DENİZ PORTFÖY YÖNETİMİ A.Ş      mutual      
## # ... with 41 more rows

Can we parse the names of the companies from fund names? Data seems messy, names on fund name fields and company name fields are not consistent.

df_import%>%distinct(name)%>%filter(!grepl("A.Ş.|AŞ|PORTFÖY",name))

## # A tibble: 23 x 1
##    name                                                                 
##    <chr>                                                                
##  1 ANADOLU HAYAT EMEKLİLİK KATKI EMEK.YAT.FONU                          
##  2 BNP PARİBAS CARDİF EMEKLİLİK ALTIN EMEKLİLİK YATIRIM  FONU           
##  3 FİBA EMEKLİLİK VE HAYAT OKS KATILIM STANDART EMEKLİLİK YATIRIM FONU  
##  4 AZİMUT PYŞ YABANCI BYF FON SEPETİ FONU                               
##  5 AZİMUT PYŞ KISA VADELİ BORÇLANMA ARAÇLARI FONU                       
##  6 AZİMUT PYŞ KİRA SERTİFİKALARI (SUKUK) KATILIM FONU                   
##  7 AZİMUT PYŞ BİRİNCİ HİSSE SENEDİ FONU (HİSSE SENEDİ YOĞUN FON)        
##  8 AZİMUT PYŞ ÇOKLU VARLIK DEĞİŞKEN FON                                 
##  9 AZİMUT PYŞ KAR PAYI ÖDEYEN HİSSE SENEDİ FONU (HİSSE SENEDİ YOĞUN FON)
## 10 AZİMUT PYŞ BİRİNCİ BORÇLANMA ARAÇLARI FONU                           
## # ... with 13 more rows

Let’s get the starting X characters to be able to parse them by company name and matche them with company list. It seems sometimes “İ” and sometime “I” used in the same name.

df_import%>%distinct(name)%>%mutate(name_start=str_sub(name,1,10))%>%distinct(name_start)%>%arrange(name_start)

## # A tibble: 63 x 1
##    name_start
##    <chr>     
##  1 ACTUS PORT
##  2 AEGON  EME
##  3 AEGON EMEK
##  4 AK PORTFÖY
##  5 ALBARAKA P
##  6 ALLIANZ HA
##  7 ALLİANZ HA
##  8 ALLIANZ YA
##  9 ALLİANZ YA
## 10 ANADOLU HA
## # ... with 53 more rows

Get rid of “İ”, “I” duplications and other anoying stuff by changing all turkish characters to lating ascii. We used stri_trans_general function of stringi package.

df_import=df_import%>%mutate(name=stri_trans_general(name,"latin-ascii"))

df_companies=df_companies%>%mutate(company_name=stri_trans_general(company_name,"latin-ascii"))

Let’s try again to extract first several characters to see if we can match them. It seems 9 characters is the sweet spot with 54 just a few above unique number 51

for (i in 1:25) {
  


df_temp=df_import%>%distinct(name)%>%mutate(name_start=str_sub(name,1,i))%>%distinct(name_start)
print(paste("First ", i,"characters. Distinct counts: ",count(df_temp)[[1,1]]))
}

## [1] "First  1 characters. Distinct counts:  20"
## [1] "First  2 characters. Distinct counts:  41"
## [1] "First  3 characters. Distinct counts:  47"
## [1] "First  4 characters. Distinct counts:  47"
## [1] "First  5 characters. Distinct counts:  47"
## [1] "First  6 characters. Distinct counts:  48"
## [1] "First  7 characters. Distinct counts:  50"
## [1] "First  8 characters. Distinct counts:  50"
## [1] "First  9 characters. Distinct counts:  54"
## [1] "First  10 characters. Distinct counts:  57"
## [1] "First  11 characters. Distinct counts:  57"
## [1] "First  12 characters. Distinct counts:  102"
## [1] "First  13 characters. Distinct counts:  163"
## [1] "First  14 characters. Distinct counts:  220"
## [1] "First  15 characters. Distinct counts:  270"
## [1] "First  16 characters. Distinct counts:  324"
## [1] "First  17 characters. Distinct counts:  376"
## [1] "First  18 characters. Distinct counts:  403"
## [1] "First  19 characters. Distinct counts:  423"
## [1] "First  20 characters. Distinct counts:  453"
## [1] "First  21 characters. Distinct counts:  475"
## [1] "First  22 characters. Distinct counts:  486"
## [1] "First  23 characters. Distinct counts:  496"
## [1] "First  24 characters. Distinct counts:  504"
## [1] "First  25 characters. Distinct counts:  516"

First 9 characters also sufficient to differentiate 51 unique companies

for (i in 1:25) {
  


df_temp2=df_companies%>%distinct(company_name)%>%mutate(name_start=str_sub(company_name,1,i))%>%distinct(name_start)
print(paste("First ", i,"characters. Distinct counts: ",count(df_temp2)[[1,1]]))
}

## [1] "First  1 characters. Distinct counts:  20"
## [1] "First  2 characters. Distinct counts:  42"
## [1] "First  3 characters. Distinct counts:  48"
## [1] "First  4 characters. Distinct counts:  48"
## [1] "First  5 characters. Distinct counts:  48"
## [1] "First  6 characters. Distinct counts:  49"
## [1] "First  7 characters. Distinct counts:  49"
## [1] "First  8 characters. Distinct counts:  49"
## [1] "First  9 characters. Distinct counts:  51"
## [1] "First  10 characters. Distinct counts:  51"
## [1] "First  11 characters. Distinct counts:  51"
## [1] "First  12 characters. Distinct counts:  51"
## [1] "First  13 characters. Distinct counts:  51"
## [1] "First  14 characters. Distinct counts:  51"
## [1] "First  15 characters. Distinct counts:  51"
## [1] "First  16 characters. Distinct counts:  51"
## [1] "First  17 characters. Distinct counts:  51"
## [1] "First  18 characters. Distinct counts:  51"
## [1] "First  19 characters. Distinct counts:  51"
## [1] "First  20 characters. Distinct counts:  51"
## [1] "First  21 characters. Distinct counts:  51"
## [1] "First  22 characters. Distinct counts:  51"
## [1] "First  23 characters. Distinct counts:  51"
## [1] "First  24 characters. Distinct counts:  51"
## [1] "First  25 characters. Distinct counts:  51"

There are few unmatches, but it is quite obvious how to align them to our data when we merge them on df_companies

df_import_9=df_import%>%distinct(name)%>%mutate(name_start=str_sub(name,1,9))%>%distinct(name_start)%>%arrange(name_start)%>%glimpse()

## Rows: 54
## Columns: 1
## $ name_start <chr> "ACTUS POR", "AEGON  EM", "AEGON EME", "AK PORTFO", "ALB...

df_companies=df_companies%>%mutate(name_start=str_sub(company_name,1,9))%>%arrange(name_start)%>%glimpse()

## Rows: 51
## Columns: 4
## $ company_code <chr> "PPY", "ANM", "AKP", "PGA", "KHM", "HYS", "AEM", "APY"...
## $ company_name <chr> "ACTUS PORTFOY YONETIMI A.S.", "AEGON EMEKLILIK VE HAY...
## $ company_type <chr> "mutual", "pension", "mutual", "mutual", "pension", "p...
## $ name_start   <chr> "ACTUS POR", "AEGON EME", "AK PORTFO", "ALBARAKA ", "A...

df_companies=full_join(df_import_9,df_companies, by="name_start")%>%arrange(name_start)
df_companies

## # A tibble: 56 x 4
##    name_start  company_code company_name                    company_type
##    <chr>       <chr>        <chr>                           <chr>       
##  1 "ACTUS POR" PPY          ACTUS PORTFOY YONETIMI A.S.     mutual      
##  2 "AEGON  EM" <NA>         <NA>                            <NA>        
##  3 "AEGON EME" ANM          AEGON EMEKLILIK VE HAYAT A.S.   pension     
##  4 "AK PORTFO" AKP          AK PORTFOY YONETIMI A.S.        mutual      
##  5 "ALBARAKA " PGA          ALBARAKA PORTFOY YONETIMI A.S.  mutual      
##  6 "ALLIANZ H" KHM          ALLIANZ HAYAT VE EMEKLILIK A.S. pension     
##  7 "ALLIANZ Y" HYS          ALLIANZ YASAM VE EMEKLILIK A.S. pension     
##  8 "ANADOLU H" AEM          ANADOLU HAYAT EMEKLILIK A.S.    pension     
##  9 "ATA PORTF" APY          ATA PORTFOY YONETIMI A.S.       mutual      
## 10 "ATLAS POR" ALA          ATLAS PORTFOY YONETIMI A.S.     mutual      
## # ... with 46 more rows

AEGON EM = AEGON EME=ANM (spaces!) BEREKET= BEREKET E=ASH (spaces!)

AZIMUT PY = AZIMUT PO = GPO HSBC PYS = HSBC PORT=HPY CIGNA FIN=CIGNA SAG=FEM

df_companies=df_companies%>%mutate(company_code=ifelse(grepl("AEGON",name_start),"ANM",company_code), company_name=ifelse(grepl("AEGON",name_start),"AEGON EMEKLILIK VE HAYAT A.S.",company_name), company_type=ifelse(grepl("AEGON",name_start),"pension",company_type) )%>%mutate(company_code=ifelse(grepl("AZIMUT",name_start),"GPO",company_code), company_name=ifelse(grepl("AZIMUT",name_start),"AZIMUT PORTFOY YONETIMI A.S.",company_name), company_type=ifelse(grepl("AZIMUT",name_start),"mutual",company_type) )%>%mutate(company_code=ifelse(grepl("HSBC",name_start),"HPY",company_code), company_name=ifelse(grepl("HSBC",name_start),"HSBC PORTFOY YONETIMI A.S.",company_name), company_type=ifelse(grepl("HSBC",name_start),"mutual",company_type) )%>%mutate(company_code=ifelse(grepl("CIGNA",name_start),"FEM",company_code), company_name=ifelse(grepl("CIGNA",name_start),"CIGNA SAGLIK HAYAT VE EMEKLILIK A.S.",company_name), company_type=ifelse(grepl("CIGNA",name_start),"pension",company_type) )%>%mutate(company_code=ifelse(grepl("BEREKET",name_start),"ASH",company_code), company_name=ifelse(grepl("BEREKET",name_start),"BEREKET EMEKLILIK VE HAYAT A.S.",company_name), company_type=ifelse(grepl("BEREKET",name_start),"pension",company_type) )%>%arrange(name_start)

df_companies

## # A tibble: 56 x 4
##    name_start  company_code company_name                    company_type
##    <chr>       <chr>        <chr>                           <chr>       
##  1 "ACTUS POR" PPY          ACTUS PORTFOY YONETIMI A.S.     mutual      
##  2 "AEGON  EM" ANM          AEGON EMEKLILIK VE HAYAT A.S.   pension     
##  3 "AEGON EME" ANM          AEGON EMEKLILIK VE HAYAT A.S.   pension     
##  4 "AK PORTFO" AKP          AK PORTFOY YONETIMI A.S.        mutual      
##  5 "ALBARAKA " PGA          ALBARAKA PORTFOY YONETIMI A.S.  mutual      
##  6 "ALLIANZ H" KHM          ALLIANZ HAYAT VE EMEKLILIK A.S. pension     
##  7 "ALLIANZ Y" HYS          ALLIANZ YASAM VE EMEKLILIK A.S. pension     
##  8 "ANADOLU H" AEM          ANADOLU HAYAT EMEKLILIK A.S.    pension     
##  9 "ATA PORTF" APY          ATA PORTFOY YONETIMI A.S.       mutual      
## 10 "ATLAS POR" ALA          ATLAS PORTFOY YONETIMI A.S.     mutual      
## # ... with 46 more rows

Now add this company name and code information to our main data set. 826638 rows are the same as before join operation

df_import=left_join(df_import%>%mutate(name_start=str_sub(name,1,9)),df_companies, by="name_start")%>%select(-name_start, -company_type)%>%glimpse()

## Rows: 826,638
## Columns: 40
## $ OKS                                 <lgl> FALSE, FALSE, FALSE, FALSE, FAL...
## $ participation                       <lgl> FALSE, TRUE, FALSE, FALSE, FALS...
## $ contribution                        <lgl> FALSE, FALSE, FALSE, FALSE, FAL...
## $ date                                <dttm> 2015-12-31, 2015-12-31, 2015-1...
## $ code                                <chr> "ABE", "AEA", "AEB", "AEC", "AE...
## $ fund_type                           <chr> "pension", "pension", "pension"...
## $ category                            <chr> "Variable Fund", "Gold Fund", "...
## $ name                                <chr> "ANADOLU HAYAT EMEKLILIK A.S.B....
## $ price                               <dbl> 0.012212, 0.011026, 0.033588, 0...
## $ shares                              <dbl> 3876175452, 10761804554, 219965...
## $ people                              <dbl> 25887, 97857, 13269, 101, 5030,...
## $ total_value                         <dbl> 47336763, 118662196, 73881771, ...
## $ p_Bank_Bills                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Other                             <dbl> 3.02, 94.31, 1.90, 0.39, 1.79, ...
## $ p_FX_Payable_Bills                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Government_Bond                   <dbl> 0.66, 0.00, 0.84, 60.90, 66.88,...
## $ p_Foreign_Currency_Bills            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Eurobonds                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Commercial_Paper                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Fund_Participation_Certificate    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Real_Estate_Certificate           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Treasury_Bill                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Stock                             <dbl> 0.00, 0.88, 82.55, 4.76, 29.47,...
## $ p_Government_Bonds_and_Bills_FX     <dbl> 0.00, 0.00, 0.00, 11.09, 0.00, ...
## $ p_Participation_Account             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Government_Lease_Certificates     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Precious_Metals                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Private_Sector_Lease_Certificates <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Private_Sector_Bond               <dbl> 0.00, 4.81, 0.00, 22.78, 0.00, ...
## $ p_Repo                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Derivatives                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_TMM                               <dbl> 0.00, 0.00, 6.61, 0.08, 0.00, 7...
## $ p_Reverse_Repo                      <dbl> 0.40, 0.00, 8.10, 0.00, 1.86, 2...
## $ p_Asset_Backed_Securities           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Term_Deposit                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Debt_Instruments          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Equity                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ p_Foreign_Securities                <dbl> 95.92, 0.00, 0.00, 0.00, 0.00, ...
## $ company_code                        <chr> "AEM", "AEM", "AHS", "HYS", "KH...
## $ company_name                        <chr> "ANADOLU HAYAT EMEKLILIK A.S.",...

It matched perfectly, no NA in company related fields.

print(df_import%>%filter(is.na(company_name)))

## # A tibble: 0 x 40
## # ... with 40 variables: OKS <lgl>, participation <lgl>, contribution <lgl>,
## #   date <dttm>, code <chr>, fund_type <chr>, category <chr>, name <chr>,
## #   price <dbl>, shares <dbl>, people <dbl>, total_value <dbl>,
## #   p_Bank_Bills <dbl>, p_Other <dbl>, p_FX_Payable_Bills <dbl>,
## #   p_Government_Bond <dbl>, p_Foreign_Currency_Bills <dbl>, p_Eurobonds <dbl>,
## #   p_Commercial_Paper <dbl>, p_Fund_Participation_Certificate <dbl>,
## #   p_Real_Estate_Certificate <dbl>, p_Treasury_Bill <dbl>, p_Stock <dbl>,
## #   p_Government_Bonds_and_Bills_FX <dbl>, p_Participation_Account <dbl>,
## #   p_Government_Lease_Certificates <dbl>, p_Precious_Metals <dbl>,
## #   p_Private_Sector_Lease_Certificates <dbl>, p_Private_Sector_Bond <dbl>,
## #   p_Repo <dbl>, p_Derivatives <dbl>, p_TMM <dbl>, p_Reverse_Repo <dbl>,
## #   p_Asset_Backed_Securities <dbl>, p_Term_Deposit <dbl>,
## #   p_Foreign_Debt_Instruments <dbl>, p_Foreign_Equity <dbl>,
## #   p_Foreign_Securities <dbl>, company_code <chr>, company_name <chr>

print(df_import%>%filter(is.na(company_code)))

## # A tibble: 0 x 40
## # ... with 40 variables: OKS <lgl>, participation <lgl>, contribution <lgl>,
## #   date <dttm>, code <chr>, fund_type <chr>, category <chr>, name <chr>,
## #   price <dbl>, shares <dbl>, people <dbl>, total_value <dbl>,
## #   p_Bank_Bills <dbl>, p_Other <dbl>, p_FX_Payable_Bills <dbl>,
## #   p_Government_Bond <dbl>, p_Foreign_Currency_Bills <dbl>, p_Eurobonds <dbl>,
## #   p_Commercial_Paper <dbl>, p_Fund_Participation_Certificate <dbl>,
## #   p_Real_Estate_Certificate <dbl>, p_Treasury_Bill <dbl>, p_Stock <dbl>,
## #   p_Government_Bonds_and_Bills_FX <dbl>, p_Participation_Account <dbl>,
## #   p_Government_Lease_Certificates <dbl>, p_Precious_Metals <dbl>,
## #   p_Private_Sector_Lease_Certificates <dbl>, p_Private_Sector_Bond <dbl>,
## #   p_Repo <dbl>, p_Derivatives <dbl>, p_TMM <dbl>, p_Reverse_Repo <dbl>,
## #   p_Asset_Backed_Securities <dbl>, p_Term_Deposit <dbl>,
## #   p_Foreign_Debt_Instruments <dbl>, p_Foreign_Equity <dbl>,
## #   p_Foreign_Securities <dbl>, company_code <chr>, company_name <chr>

We have completed one more milestone. Let’s export final, clean dataframe as RDS for exploratory analysis.

saveRDS(df_import,"df_clean.rds")

We have uploaded the df_clean.rds file and other material for this milestone to the google drive. You can review them from this link: https://drive.google.com/drive/folders/1SJPoQqPOXHdLEUReML9zpWw7NaNrcJUY?usp=sharing

TEFAS Fund Data Clean & Transform Process

MEF BDA 503 - Champions Project Group

29 11 2020

Data Cleaning & Transforming Process Summary

Data Cleaning & Transforming Process Detailed Steps