Personal Information


      My name is Emirhan Şahin. I was born in Bursa, in 1998 and lived there for the first 18 years of my life.

      I graduated from The Faculty of Economics in English at Istanbul University in August 2020. During my bachelor’s degree, I participated in the Erasmus programme to promote myself in terms of sociability and a better vision for the world. I went to Barcelona, one of my favorite countries, the most beautiful city in Spain. I also started to work as a cashier in Zara, Inditex. Just in 3 months, I got promoted to accountant position then to the administrative services responsible position. After working there for almost 3 years, I quit my job to complete my degree.

      Upon completing my bachelor’s degree, I moved to Izmir in the summer of 2020 for a job. In which I was able to improve myself in so many ways. I also had a chance to earn money while doing something I love, using computers and other technologies. Therefore, I learned applications like Microsoft Office, Microsoft PowerPoint, various accounting and management programs. That is also the way that I met data science and machine learning. So I started to learn about programming and data science on my own, but I have decided that I need to study professionally, therefore, and get a postgraduate program in order to be ready for the industry. For this purpose, I began my master’s degree in Big Data Analytics at MEF University in September 2021.

      I am focusing on professionalizing the theoretical knowledge I received during my academic career. I hope to prove all the knowledge I have at a company that utilizes machine learning and big data. I would love to work on real-life artificial intelligence and machine learning projects.

      For further information, you can take a look at my LinkedIn page or ask me on Twitter.

Tree-Based Machine Learning for Insurance Pricing

      This part includes a review and summary of a R Consortium video.

      I have chosen this video to review because the topic is closely related to the real world and the main focus is on big data and machine learning.

      The presenter first starting to talk about the methods and formulas they use to assess the price of a premium. Then he explains the goal of the presentation. He states that the Generalized Linear Model by J.Nelder and R.Wedderburn is a widely used model in classical premium calculations. He later continues to explain the machine learning methods and parameters that are used. He tweaks the way frequency and severity are calculated in the Generalized Linear Model by replacing count distribution in frequency with Poisson deviance and skewed function in severity with Gamma deviance as loss function.

      In the second half of the video, the presenter illustrates the results of the new ML techniques and compares them with the classical methods. He states even though both of the classical methods are better than 2 of the new ML technique, Gradient Boosting Machine returns the most precise results out of all 5 tests. Later he continues to demonstrate the results and graphs of the tests and to explain how profitable it can be once the company implements GBM technique.

      In the end, he concludes by summarizing the pros and cons of using machine learning techniques.

Interesting R Posts

      In this part, I will review and summarize 3 R posts.

    1. Improving a Visualization

                This is a summary of Improving a Visualization by Jonathan Carroll

                 Visualization is a crucial way to make sense of the trillions of rows of data generated every day. Data visualization helps to tell stories by curating data into a form easier to understand, highlighting the trends and outliers. As data scientists and analysts, we will often in our career face an issue that requires visualization, therefore, it is important to learn alternative ways to visualize. Exactly in this blog, the blogger demonstrates alternative visualization techniques for streaming services market share, comparing 2020 to 2021.

                 He found that the original visualization is built on PowerPoint but he was going to reproduce it in R. First he gathered the data then built a simple barplot using ggplot2. He continued to edit the barplot by changing the font, colors and adding the logos.

                 After achieving the original visualization with better colors and font, he proceeds to illustrate the same data on the pie chart. And then on the horizontal bar graph. Even though the prior visualizations does the job to show the current and previous market shares, in order to get a better idea about the one-year changes, loss and growth, he centered the 2020 percentages and demonstrated the differences with 2021. The blogger continues to explain the data on the dumbbell plot, aiming to show the separation between the two values. Even animates the changes between the two years. But the one that is interesting to me was the alluvial plot. Because it easily illustrated the transitions between service providers.

                 This blog helps us to understand how easy it is to code and to demonstrate the alternative ways of visualization in R.

    2. 5 Ways to Subset a Data Frame in R

                This is a review of 5 Ways to Subset a Data Frame in R by Douglas E. Rice

                 Subsetting a data frame is the process of selecting a set of rows and columns from the data frame. Subsetting is very useful because we often want to perform operations on subsets of our data, especially if it is big data. Also one of the important purposes of subsetting is to save bandwidth on the network and storage space on the computer.


                 1. The blogger at first explaining the most basic way of subsetting a data frame in R. This method is done by using square brackets.

                                                                          Subset


                 Then, he continues with a hypothetical scenario. And retrieving the required data from the data frame by subsetting. With that subsetted data, he creates a new data frame.


                 2. In the second method, he mentions that in order to get the required data in the same data frame we can easily omit the unnecessary data. It is quite similar to the first method but instead, he uses “-” sign before vector function.

subtracting_data_frame <- education[-c(1:9,22:50),-c(1,3:5)]


                 3. Often we work on a large data set that it is not possible to count row and column numbers. In that case the blogger suggests to use the code below :

new_data_frame <- education[which(education$Region == 2),names(education) %in% c("State","Minor.Population","Education.Expenditures")]

                 In this method, by using the which() function we are able to get the returns the indices where the Region column of the education data from is 2. By using the %in% operator he retrieved the columns of the subset on the names of the education data frame.


                 4. In the next method, he explains an easier way by using subset() built-in function in R.

easier_data_frame <- subset(education, Region == 2, select = c("State","Minor.Population","Education.Expenditures"))


                 5. He states that the last method is the most useful in manipulating data once you get a grasp of it. This method is not initially included basic R, therefore, we need to download the dplyr package.

install.packages("dplyr")
library(dplyr)
dplyr_data_frame <- select(filter(education, Region == 2),c(State,Minor.Population:Education.Expenditures))

                 Once we’ve downloaded dplyr, we use filter and select functions in the package. Even though this method requires an external package we can see that, it is the easier and faster way to achieve the required output.

                 This blog helps us to understand multiple ways of subsetting in R in different situations.

    3. How to Remove Outliers in R

                This is a summary of How to Remove Outliers in R by Syed A. Hadi

                 An outlier is a data point that differs from other data points in a data set. Even though it sounds easy, determining what is or isn’t an outlier is pretty subjective, depending on the study. In this blog, the blogger goes into details about identifying, visualizing and removing outliers from a dataset. Removing an outlier is crucial for data analysis since it can dramatically affect the model, the plot or the data output.

                 1. Looking at Outliers in R

                 Statisticians use and prefer different ways to locate the outliers in a dataset. The most common methods include the Z-score method and the Interquartile Range (IQR) method. In this blog, the blogger uses the IQR method. In this method, outliers are considered points that are below [Q1 - (1.5)IQR] or above [Q3 + (1.5)IQR].

interquartile


                 He is starting with loading warpbreaks built-in dataset on R using the data function.

data("warpbreaks")


                 2. Visualizing Outliers in R

                 Secondly, he is creating the boxplot to identify the outliers.

boxplot(warpbreaks)$out

[1] 70 67


                 3. Finding Outliers – Statistical Methods

                 Generally, the visualization method is considered easy but it can become a real burden for the system, therefore, we will use statistical methods a lot in big data analytics. He is using the quantile() function to find the 25th and the 75th percentile of the dataset, and the IQR() function which gives him the difference of the 75th and 25th percentiles. Then, the cut-off ranges beyond which all data points are outliers.

Q <- quantile(warpbreaks$breaks, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(warpbreaks$breaks)
up <-  Q[2]+1.5*iqr # Upper Range
low<- Q[1]-1.5*iqr # Lower Range


                 3. Eliminating Outliers in R

                 Using the subset() function, he is extracting the data points that is not outliers. Then visualizing it on boxplot.

eliminated <- subset(warpbreaks, warpbreaks$breaks > (Q[1] - 1.5*iqr) & warpbreaks$breaks < (Q[2]+1.5*iqr))


                 R also has other ways of removing outliers, one of them done by using the boxplot() function to identify the outliers and the which() function to find and remove them from the dataset.

boxplot(warpbreaks$breaks, plot=FALSE)$out # identifying the outliers
outliers <- boxplot(warpbreaks$breaks, plot=FALSE)$out # saving the outliers in a vector
x <-warpbreaks
x <- x[-which(x$breaks %in% outliers),]


                 Even though, it requires a little bit more R knowledge, this is maybe the more efficient way to remove outliers in R.

                 This blog explains us why and how to remove outliers in R.


                                               Thanks for reading…



RMarkdown

