1 Assigment 1
About Me and Data Science in R
1.1 About Me
Hello everyone! My names is Yudum Paçin. I was born in Bursa, Mudanya, i’ve lived in Mudanya until I went to Ankara for my university education. Now, I have been working as an IT Business Analyst in VakifBank for 6 years. Nowadays, I am about to switch my career path to be a Data Scientist 😀. My interest of data science area started in 2019 with the famous Coursera course “Machine Learning” from Andrew Ng, which i think no longer exists as one course but is converted to a Machine Learning Specialization. But before that, i was in this area before realizing that i am. Let me explain this!
First of all, I am a Mathematician. I graduated from Mathematics department of Middle East Technical University in 2012. After graduation, i was sure that my profession will be in the IT sector, but i didn’t know where to start. To improve my knowledge in software development, I completed my master’s degree in Information Systems at METU. I also worked as a full-time Research Assistant while i was studying. During my graduate years, i was doing data manipulation, statistical tests, like correlation tests, ANOVA etc., However, this whole process was only a tool for me, i was focused on results not the process. Simply, IBM SPSS was doing it for me. When i first understood the idea behind linear regression with the Machine Learning Course from Coursera i was fascinated with the idea of using Math for this way. As a data science enthusiast and future-data scientist, i want to use machine learning skills to automate the repetitive tasks. I think if there is a work that computers can handle better or equal than humans, then we should find a new way to automate it. Deep learning models are doing great at computer vision and NLP areas, they are almost as successful as the human cognitive level: Alphago, Ensembled deep learning model outperforms human experts in diagnosing.., Microsoft’s Deep Learning Project Outperforms Humans In Image Recognition
I am also planning to work on time series models and marketing analytics.
You can reach out to me on my linkedin page.
1.2 Advocating for Automation
This section is from the RStudio talk “Advocating for Automation: Adapting Current Tools in Environmental Science through R” by Hannah Podzorski, GSI Envronmental. Podzorski points out the importance of automation for diverse skilled teams and how it can be applied with R. You can reach the full version of the talk from rstudio::conf(2022) link
The reason why I chose this topic is that I am also trying to find a new way to automate the repetitive tasks and create time for more important ones.
Automation has many pros like
- Reproducibility
- Simplicity
- Saving time
- Less human interaction which means less errors
But, where to start automation? Podzorski suggests to start small. She and other team members first decide to start to automate the process with Microsoft Excel Products using R.
{openxlsx}
: an R package designed to edit, read and create Excel files.
openxlsx::write.xlsx(data,"data.xlsx")
The other useful package is {officer}
. Officer is used to manipulate word documents and PowerPoints in R. The github page shows how to use it.
The last important package is {rvg}
, rvg has a function dml which enables the edit ggplot object option before exporting it to a PowerPoint file.
The team of Podzorski later decides to use their automation skills to use ProUCL in a more efficient way. ProUCL is a comprehensive statistical software designed for analyzing enviromental data. The team uses the ProUCL by loading the data and exporting the results for further analysis or reports. In case of loading large data inputs, the software crashes or runs slowly. Because of these reasons, the team decides to automate this process with “Mini-Mouse Macro”. First a function in R, subsets the large data in more manageable chunks, then the mouse macro takes these input files, clicks though the software for getting statistics and finally saves the output file. This can be seen as a small task to automate but when the number of files increases, the time for a human to do this task one by one becomes unmanageable. Also, the team automate the work of copy-paste ProUCL outputs to Excel. As a result of this automation process, the team saves more time for data analysis and creating insights from the results.
1.3 3 R Posts
In this section, 3 R posts I chose will be summarized and discussed.
1.3.1 PCA vs Autoencoders for Dimensionality Reduction
You can reach the full version of article from the link
When our data set has too many dimensions, it is a wise desicion to go on with the important ones and leave others. But how to choose important ones? This article compares two methods for dimension reduction, PCA and Autoencoders with R using the Australian Institute of Sport data set.
Principal Components Analysis (PCA)
PCA is a process of reducing the data frame by orthogonality transforming the data into a set of principal components. The first principal component explains the most amount of the variation in the data in a single component, the second component explains the second most amount of the variation, and so on. By choosing the top k principal components that explain say 80-90% of the variation, the other components can be dropped since they do not significantly benefit the model.
To investigate the variation in the data. Plotting the data points in 3 dimensions gives a better indication of the structure of the data. However, there can be still many dimensions which explain some of the variation that are not visualized. To do so they would all need to be plotted in their various combinations. This is a draw back of PCA.
# standardise
minmax <- function(x) (x - min(x))/(max(x) - min(x))
x_train <- apply(ais[,1:11], 2, minmax)
# PCA
pca <- prcomp(x_train)
Autoencoder
The autoencoder can be constructed using the keras package. As with any neural network there is a lot of flexibility in how autoencoders can be constructed such as the number of hidden layers and the number of nodes in each. With each hidden layer the network will attempt to find new structures in the data. In general autoencoders are symmetric with the middle layer being the bottleneck. The first half of the autoencoder is considered the encoder and the second half is considered the decoder.
# set training data
x_train <- as.matrix(x_train)
# set model
model <- keras_model_sequential()
model %>%
layer_dense(units = 6, activation = "tanh", input_shape = ncol(x_train)) %>%
layer_dense(units = 2, activation = "tanh", name = "bottleneck") %>%
layer_dense(units = 6, activation = "tanh") %>%
layer_dense(units = ncol(x_train))
# view model layers
summary(model)
# compile model
model %>% compile(
loss = "mean_squared_error",
optimizer = "adam"
)
# fit model
model %>% fit(
x = x_train,
y = x_train,
epochs = 2000,
verbose = 0
)
# evaluate the performance of the model
mse.ae2 <- evaluate(model, x_train, x_train)
mse.ae2
Image source: https://www.geeksforgeeks.org/ml-auto-encoders/
To summarize, some key differences for consideration between PCA and autoencoders are:
1- There are no guidelines to choose the size of the bottleneck layer in the autoencoder unlike PCA. With PCA, the top k components can be chosen to factor in x% of the variation. Often PCA can be used as a guide to choose k.
2- The autoencoder tends to perform better when k is small.
3- Autoencoders require more computation than PCA..
1.3.2 Audio classification with torch
In this article classification of audio using R torch is examined. I have tried audio classification with spectograms using Keras tensorflows with Python before, so this topic got my attention.
You can reach the full version of article from the link
The dataset for this study holds recordings of thirty different one- or two-syllable words, uttered by different speakers. There are about 65,000 audio files. 4 properties are give for each file,
waveform, sample_rate, label_index, and label.
This representation is actually difference of loudness of voice over time. Author suggests that even the domain experts can conclude that this is not enough for sound classification. To get more meaningful transformation, fourier-transform method is applied for getting a representation of sound in a way that had no information about time at all and have as just as much information as original signal. In R torch_fft_fft
function is used, where fft stands for Fast Fourier Transform.
From this alternate representation, original sound wave can be calculated by taking the frequencies present in the signal, weighting them according to their coefficients, and adding them up. But in sound classification, timing information must surely matter, for this reason another represenaiton is necessary.
We can divide the signal into small chunks, and run the Fourier Transform on each of them. This representation is called the spectrogram.
The spectrogram is a two-dimensional representation: an image. From now on, we can use convolutional neural networks for image recognition using library(torch)
.
1.3.3 Update Your Machine Learning Pipeline With vetiver and Quarto
The reason I chose this article is I am curious about how the models are created and deployed in real-life. How is the Machine Learning Pipeline process works?
You can reach the full version of article from the link
Machine learing operations (MLOps) are a set of practices for running ML models in production environments. Vertier, an open-source framework for entire morel life cycle, provides R an and Pyhton developers a unified way for working with ML models.
In this article, Bike Prediction App is used. The app provides real-time predictions of the number of bikes at stations across the city, Washington D.C.. The end to end machine learning pipeline uses R to modify and import the data, saves it in a pin, which is a package publishes data, models, and other R objects. Then the pipeline developes a model and moves the model to a deployable location.
The topic of this article focuses on updating the MLOps using the new vertier framework and Quarto.
Creating An End-to-End Machine Learning Pipeline
1. Create a custom package for pulling data
The data is pulled from Capital Bikeshare API. The team developed a R package to reuse the functions pulling the data from API when it is requested.
2. Extract, transform, load process in R
The data from API is raw data, is written on Database. The station info is also written to a pin. This pin will be accessed by the Shiny app so that it can extract the bike station info without connecting to the database.
This step is also called ETL Step 1 - Raw Data Refresh
3. Tidy and join datasets
In this phase raw bike data is preprocessed with tidyverse package. Then the bike data is joint with station data. The output data is written to DataBase as a new table.
This step is also called ETL Step 2 - Tidy Data
4. Train and deploy the model
The table resulted from Step 3 is trained with Random Forest Model. The model is saved to RStudio Connect as a pin (using vetiver) and then it is converted into an API endpoint (also using vetiver). Then, the team deployed the API to RStudio Connect.
5. Create a model card
This step includes, evaluation of training and evaluation data by different methods. Vetiver’s model card template helps document essential facts and considerations of the deployed model.
6. Monitor model metrics
To ensure the consistency of the moddel, the metrics should be monitored. For this prurpose, model performance is documented using vetiver and the metrics are written to a pin on RStudio Connect.
7. Deploy a Shiny app that displays real-time predictions
API endpoint is used to serve predictions to a Shiny app interactively. Clicking on a station shows us a line graph of the time and predicted number of bikes.
8. Create project dashboard
The team has created a dashboard for sharing the full context of this project.
Thanks for reading…