Interesting R Posts
In this part, I will review and summarize 3 R posts.
1. Improving a Visualization
This is a summary of Improving a Visualization by Jonathan Carroll
Visualization is a crucial way to make sense of the trillions of rows of data generated every day. Data visualization helps to tell stories by curating data into a form easier to understand, highlighting the trends and outliers. As data scientists and analysts, we will often in our career face an issue that requires visualization, therefore, it is important to learn alternative ways to visualize. Exactly in this blog, the blogger demonstrates alternative visualization techniques for streaming services market share, comparing 2020 to 2021.
He found that the original visualization is built on PowerPoint but he was going to reproduce it in R. First he gathered the data then built a simple barplot using ggplot2. He continued to edit the barplot by changing the font, colors and adding the logos.
After achieving the original visualization with better colors and font, he proceeds to illustrate the same data on the pie chart. And then on the horizontal bar graph. Even though the prior visualizations does the job to show the current and previous market shares, in order to get a better idea about the one-year changes, loss and growth, he centered the 2020 percentages and demonstrated the differences with 2021. The blogger continues to explain the data on the dumbbell plot, aiming to show the separation between the two values. Even animates the changes between the two years. But the one that is interesting to me was the alluvial plot. Because it easily illustrated the transitions between service providers.
This blog helps us to understand how easy it is to code and to demonstrate the alternative ways of visualization in R.
2. 5 Ways to Subset a Data Frame in R
This is a review of 5 Ways to Subset a Data Frame in R by Douglas E. Rice
Subsetting a data frame is the process of selecting a set of rows and columns from the data frame. Subsetting is very useful because we often want to perform operations on subsets of our data, especially if it is big data. Also one of the important purposes of subsetting is to save bandwidth on the network and storage space on the computer.
1. The blogger at first explaining the most basic way of subsetting a data frame in R. This method is done by using square brackets.
Then, he continues with a hypothetical scenario. And retrieving the required data from the data frame by subsetting. With that subsetted data, he creates a new data frame.
2. In the second method, he mentions that in order to get the required data in the same data frame we can easily omit the unnecessary data. It is quite similar to the first method but instead, he uses “-” sign before vector function.
subtracting_data_frame <- education[-c(1:9,22:50),-c(1,3:5)]
3. Often we work on a large data set that it is not possible to count row and column numbers. In that case the blogger suggests to use the code below :
new_data_frame <- education[which(education$Region == 2),names(education) %in% c("State","Minor.Population","Education.Expenditures")]
In this method, by using the which() function we are able to get the returns the indices where the Region column of the education data from is 2. By using the %in% operator he retrieved the columns of the subset on the names of the education data frame.
4. In the next method, he explains an easier way by using subset() built-in function in R.
easier_data_frame <- subset(education, Region == 2, select = c("State","Minor.Population","Education.Expenditures"))
5. He states that the last method is the most useful in manipulating data once you get a grasp of it. This method is not initially included basic R, therefore, we need to download the dplyr package.
install.packages("dplyr")
library(dplyr)
dplyr_data_frame <- select(filter(education, Region == 2),c(State,Minor.Population:Education.Expenditures))
Once we’ve downloaded dplyr, we use filter and select functions in the package. Even though this method requires an external package we can see that, it is the easier and faster way to achieve the required output.
This blog helps us to understand multiple ways of subsetting in R in different situations.
3. How to Remove Outliers in R
This is a summary of How to Remove Outliers in R by Syed A. Hadi
An outlier is a data point that differs from other data points in a data set. Even though it sounds easy, determining what is or isn’t an outlier is pretty subjective, depending on the study. In this blog, the blogger goes into details about identifying, visualizing and removing outliers from a dataset. Removing an outlier is crucial for data analysis since it can dramatically affect the model, the plot or the data output.
1. Looking at Outliers in R
Statisticians use and prefer different ways to locate the outliers in a dataset. The most common methods include the Z-score method and the Interquartile Range (IQR) method. In this blog, the blogger uses the IQR method. In this method, outliers are considered points that are below [Q1 - (1.5)IQR] or above [Q3 + (1.5)IQR].
He is starting with loading warpbreaks
built-in dataset on R using the data function.
data("warpbreaks")
2. Visualizing Outliers in R
Secondly, he is creating the boxplot to identify the outliers.
boxplot(warpbreaks)$out
[1] 70 67
3. Finding Outliers – Statistical Methods
Generally, the visualization method is considered easy but it can become a real burden for the system, therefore, we will use statistical methods a lot in big data analytics. He is using the quantile() function to find the 25th and the 75th percentile of the dataset, and the IQR() function which gives him the difference of the 75th and 25th percentiles. Then, the cut-off ranges beyond which all data points are outliers.
Q <- quantile(warpbreaks$breaks, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(warpbreaks$breaks)
up <- Q[2]+1.5*iqr # Upper Range
low<- Q[1]-1.5*iqr # Lower Range
3. Eliminating Outliers in R
Using the subset() function, he is extracting the data points that is not outliers. Then visualizing it on boxplot.
eliminated <- subset(warpbreaks, warpbreaks$breaks > (Q[1] - 1.5*iqr) & warpbreaks$breaks < (Q[2]+1.5*iqr))
R also has other ways of removing outliers, one of them done by using the boxplot() function to identify the outliers and the which() function to find and remove them from the dataset.
boxplot(warpbreaks$breaks, plot=FALSE)$out # identifying the outliers
outliers <- boxplot(warpbreaks$breaks, plot=FALSE)$out # saving the outliers in a vector
x <-warpbreaks
x <- x[-which(x$breaks %in% outliers),]
Even though, it requires a little bit more R knowledge, this is maybe the more efficient way to remove outliers in R.
This blog explains us why and how to remove outliers in R.
Thanks for reading…
