1 Assignment 1

Published

October 17, 2022

1.1 About Me

Hello everyone. I am Gülhan and I am currently working as a mathematics teacher. I have a mathematics education undergraduate degree from Boğaziçi University. I started to interest data as a hobby. In my trip of data, there were several areas to learn. One of them was Python which is a computer language to analyze data. Last February, I started the “100 Days of Code: The Complete Python Pro Bootcamp for 2022” certification program on Udemy. In addition to Python, HTML and CSS are also explained in the training. I was taking the training for a few hours every day after work. As the days progressed in the training, I started to enjoy it more and I spent 4-5 hours each evening and completed the training. Then, I started different training courses on data analytics on Udemy. I started two different courses, first in the statistics section and then in the field of SQL to improve my knowledge in both data extraction and analysis. As time passed, I was sure that I should do this job. However, after 5 years of teaching experience and because I do not have an academic background in data, I decided to pursue a master’s degree. So that’s why I’m here :)

Here is my Linkedin profile

1.2 rstudio::conf(2022)

1.2.1 Garbage Data, And What To Do About Them by Jim Kloet

This video is about bad data. According to video bad data is the data that cannot be used for analysis or decision making. It identifies the challenges that you face while dealing with garbage data and suggest a framework to examine the bad data. Here are the steps: 1. Identify the garbage data: They are not usefull for the things you want to do, but maybe they are usefull for other stuff. You need to decide. If you chose none of them, you can destroy it. 2. Identify the costs of garbage data: Some costs are unavoidable and some costs are avoidable, but these can be catastrophic. 3. Take action when you see garbage data: There are options like being honest to your stakehoulders and being emphatatic to your stakehoulders to continue to have stakehoulders.

1.3 R Posts

1.3.1 Creating your career with R – data science based jobs!

It indicates that your industry and geographical location are the crucial factors that affect your career. That’s why a person who wants to improve modern data analytics skills like Hadoop, Weka, Natural Language Processing or Machine Learning should be in the countries that is open to new things. However, if you are living in a conservative country, SQL helps you to find a job. It says that R is the best way to start data based career and it should be supported by some other programs like SQL and Hadoop.

Post Link

1.3.2 Top Job Boards for R users to take your career to the next level!

The post says that R is one of the most important programming languages for Data Science applications. According to post, R is used for data manipulation, business analytics, machine learning algorithms and data visualization techniques. Here are some job boards that offers jobs in the fields of data science with R programming skills: R-users.com Dice CareerBuilder LinkedIn Indeed Monster Naukri Timesjobs

Post Link

1.3.3 Three Different Ways in How to Store Your Datasets in R

data.frame, data.table and data_frame are the common ways to store data in R.

data.frame is an internal package that is already in your computer. The drawback about data.frame is that strings are imported as factor. If you look at the example below, you can see the Names and Classes columns are imported as factor. Example:

mydf = data.frame(Names = c("Paul", "Kim", "Nora", "Sue", "Paul", "Kim"),
                  Classes = c("A", "A", "B", "B", "B", "C"),
                  Points = rnorm(2)) 
mydf

  Names Classes    Points
1  Paul       A -0.290930
2   Kim       A -1.960523
3  Nora       B -0.290930
4   Sue       B -1.960523
5  Paul       B -0.290930
6   Kim       C -1.960523

sapply(mydf, class)

      Names     Classes      Points 
"character" "character"   "numeric"

data.table saves time and helps to reduce errors because it is manipulated with less code and function calls. It has own structure like SQL. Example:

library(data.table,warn.conflicts = FALSE)
mytable = data.table(Names = c("Paul", "Kim", "Nora", "Sue", "Paul", "Kim"),
                  Classes = c("A", "A", "B", "B", "B", "C"),
                  Points = rnorm(2)) 
mytable

   Names Classes     Points
1:  Paul       A -0.8496979
2:   Kim       A -1.0848712
3:  Nora       B -0.8496979
4:   Sue       B -1.0848712
5:  Paul       B -0.8496979
6:   Kim       C -1.0848712

sapply(mytable, class)

      Names     Classes      Points 
"character" "character"   "numeric"

If you are using data_frame, you need columns of equal length. If not so, you will get an error messagge.

After I run the code I got an error message that says “you should use tibble() instead of data_frame()”. So that is why I used tibble() in the following example. Example:

library(tibble)
my_df = tibble(Names = c("Paul", "Kim", "Nora", "Sue", "Paul", "Kim"),
                  Classes = c("A", "A", "B", "B", "B", "C"),
                  Points = rnorm(6)) 
my_df

# A tibble: 6 × 3
  Names Classes Points
  <chr> <chr>    <dbl>
1 Paul  A        1.12 
2 Kim   A       -0.976
3 Nora  B       -1.57 
4 Sue   B        1.11 
5 Paul  B       -0.629
6 Kim   C        0.393

sapply(my_df, class)

      Names     Classes      Points 
"character" "character"   "numeric"

Post Link