Assignment 1: RMarkdown Homework

About myself

My name is Can Aytöre and I’m from Eskişehir. I’m interested in basketball and watching tennis. I like to travel and take photographs. I’m currently in a sailing team, participating sailing regettas on the weekends.

I graduated from Istanbul Technical University Industrial Engineering department in June 2019. I’ve started master’s degree in Industrial Engineering at Boğaziçi University last year in order to have more expertise on the areas of operations research, stochastic processes and data science. In next semester, I will be working on my master’s thesis.

I have been working in the marketing department of BSH company for more than a year now. In addition to the usual work of the marketing department, I’m working on the big data comes from market research companies and making meaningful analyzes and useful predictions. Of course, I use advanced statistical methods for that.

After my internships/work experiences, I’ve decided that I could work on the data science. Since data science attracts me, I pay attention to make my course choices and studies in this field. I will try to fill this personal journal carefully with this perspective.

Here is my Linkedin profile

Review of useR 2020: Visualization of missing data and imputations in time series (S. Moritz)

As it is well-known that the missing data is nearly everywhere. That’s also true for time series. The reasons are usually manifold; the problem may occur in data recording, data transmission or data processing. There are some useful packages such as “imputeTS package” for overcome with missing values. With the feature of imputation, the missing values are replacing with reasonable values. With the feature of visualization, the missing data are easily detected. First step is to get first impression of where missing values are via visualization of NA distribution. Therefore, we can perfectly see where the missing values are located in time series. Once we’re done with doing the first analyzes, we can explore the imputation results. With the visualization, the replaced missing value in time series can be easily seen. Thus, we can easily check whether the fitted values are consistent or not. One more nice feature of this package is that it can be visualized with ggplot outputs. Since missing data might create problems for analyzing data, imputation method is seen as a good way to overcome. Besides, this practice can be said to be basically a simple application of machine learning.

Link of “Review of useR 2020: Visualization of missing data and imputations in time series (S. Moritz)”

Three R-posts relevant to my interests:

1- Data visualization in R

Using visual elements such as graphics and maps to present data in a more understandable form is called data visualization. Today, the data visualization is very critical because it plays an important role in analyzing large amounts of information in big data world and making decisions based on this analysis. There are many ways to make data visualization. Scatter charts, line charts, bar charts, and histograms are the examples. The purpose of each tool is different. It is also important to use correct tool at the right time. For example, the scatter charts are used to understand the relationship between two numerical variables. Histograms are used to show the distribution of numerical values and are very significant in data science. Data visualization also facilitates the detection of outliers. Data visualization tools can be used in R programming language. Even more advanced packages are available, such as the ggplot2 package. As a result, data visualization is the fastest and most powerful method of understanding available information.

Link of “Data visualization in R”

2- Cluster analysis in R

Arranging a series of objects so that those with similar properties are in the same group is called the cluster analysis. Unlike traditional methods, this process can be easily applied via statistical tools. Cluster analysis enables the subgroups in a data set to be found. There are many ways to do this analysis. The most common is hierarchical cluster analysis. Many methods can also be used while hierarchical clustering. For example, the closest neighbor approach can be used to form clusters from pairs that give the smallest distance by listing the distances from the smallest to the largest. Then their distances are compared with the remaining observations and this process ends with the matching of all objects. The dendrogram graphic is used to visualize such analyzes. Another method is K-means. In this analysis, it is already known how many clusters the data will be divided into. Before this, we can also know the optimum number of clusters with some methods. To sum up, the cluster analysis is a statistical method used in many fields such as data mining, image analysis and machine learning.

Link of “Cluster analysis in R”

3- Network Analysis in R

Network models exist almost everywhere in our lives. Traffic infrastructure, internet, food chains and artificial neural networks can be given as examples. However, in network models, we see that some nodes are more important than others. So what does the most important node mean and does it not differ? There are many ways to make sense of this. The first is degree centrality. This method finds the most connected node. The second is the closeness centrality. This method determines which node in the network can spread the flow fastest. It is found by summing the paths from one node to the other nodes. The most recent example of this is the super spreaders of highly contagious diseases such as COVID-19. The third method is betweenness centrality. In this method, it finds out which node is most important in maintaining the connection across the network. As it seems, there is more than one definition of “most important”. To sum up, the network analysis plays an important role in our understanding of relationships between objects.

Link of “Network Analysis in R”

Other applications related to Data Science:

4- Optimizing Maritime Shipping Routes

HEC University of Montreal conducted a data analysis study to optimize the routes of logistics companies in maritime transport. CSL Group, which the university has provided consultancy, owns a large and diverse fleet. Therefore, creating GANT charts for ships is quite complicated. At this point, an algorithm to optimize these processes is required by developing a mathematical model. With this model, processes will be easier, transportation costs will be minimized and customer needs will be met better and on time. In order to establish this model, some data need to be collected. Transportation costs were calculated by measuring the distances between the ports. Information such as capacity and consumption of ships were obtained. In addition, this model includes weather and cruise delays, which can be considered external factors. After testing the robustness of the model, it was subjected to real-time simulations. Since this model was introduced to the company, it has been used quite efficiently by planning teams. As the historical data are added to the system, this model will give more and more accurate results. This problem, which is very difficult to solve with traditional methods, has been practically overcome by the data science.

Link of “Optimizing Maritime Shipping Routes”

5- Data Science for Targeted Advertising

The subject of this article is how effectively the relevant ads are delivered to users based on their past behavior. It becomes the recipient of a good advertisement, but it is very critical not only with the message given, but also delivering it to the right people on the right channels at the right time. For example, throwing brochures from the aircraft over the crowd may seem like an effective idea at first, but advertising does not mean it will attract crowd’s attention. At this point, both money and labor are wasted. For this reason, it has become important for marketers and advertisers to analyze big data from users and present ads to the right people at the right time. With machine learning, he understands the behavior of users and his future behavior and aspirations can be predicted. There are several methods for this. The first is demographic targeting. In this approach, it categorizes the target audience according to their characteristics such as age, gender income and location. In this method, which is very easy to apply, the control of the target audience can also be easy. The second method is property targeting. This method is quite simple and popular. The advertiser can identify the pages where the advertisement is deemed appropriate. In this way, he reaches the right ad to his target audience. The third method is behavioral targeting. In this method, unlike the others, it takes advantage of the user’s past behavior. Based on the marks left by the user, his future behaviors and requests are also predictable. The more data about the user, the better targeting results can be obtained about that user. User traces are most clearly made on the network. So advertiser companies can easily provide this data by companies like Google Yahoo. In addition, with machine learning, advertisements can be delivered to users in real time to meet their demands. Data science is widely and effectively used in this field.

Link of “Data Science for Targeted Advertising”