Stack OverFlow Data Preparation

Stackoverflow Survey data of the last 5 years have been brought together. While some of the questions had the same meaning, they were asked with different naming. For this reason, data fusion was done by analyzing the columns. Generally, a study was carried out on combining the variables to be analyzed.

In the data, some variables were answered on an interval basis in different years, while point answers were given in other years, such as age.

After variables were analyzed, the last five years data of Stack OverFlow Survey uploaded Github

Firstly we start with reading csv files to compare the years and information. There are many different types of columns therefore we had to handle that shows many different columns. Therefore we determined data that we want to analyze.

“ResponseId” “Year” “MainBranch” “Age” “Age1stCode” “CompFreq” “CompTotal” “Country” “DevType” “EdLevel” “Employment” “Ethnicity” “Gender” “OpSys” “OrgSize” “Sexuality” “SOAccount” “SOComm” “SOPartFreq” “SOVisitFreq” “SurveyEase” “SurveyLength” “Trans” “YearsCode” “YearsCodePro” “mainbranch” “age” “agefirst” “Edlevel” “country” “LanguageDesireNextYear” “LanguageWorkedWith” “databaseworkedwith”

After that point we started to review the data frames by download and uploading it to system. We have changed frameworks’ column name to merge them correctly such as ResponseID and Respondent.There were many different column names.

As you can see here the table column names also as a vector.

Stackof2021 <- read.csv("survey_results_public.csv")

Stackof2021["Year"] <- 2021
Stackof2021 <- Stackof2021 %>% select(ResponseId, Year, everything())
colnames2021<-colnames(Stackof2021)

#For 2020

Stackof2020 <- read.csv("survey_results_public.csv")
Stackof2020["Year"] <- 2020
Stackof2020 <- Stackof2020 %>% select(Respondent, Year, everything())
colnames(Stackof2020)[1] <- "ResponseId"
colnames2020<-colnames(Stackof2020)

#2019

Stackof2019 <- read.csv("survey_results_public.csv")
Stackof2019["Year"] <- 2019
Stackof2019 <- Stackof2019 %>% select(Respondent, Year, everything())
colnames(Stackof2019)[1] <- "ResponseId"
colnames2019<-colnames(Stackof2019)

#2018

Stackof2018 <- read.csv("survey_results_public.csv")
Stackof2018["Year"] <- 2018
Stackof2018 <- Stackof2018 %>% select(Respondent, Year, everything())
colnames(Stackof2018)[1] <- "ResponseId"
colnames2018<-colnames(Stackof2018)

#2017

Stackof2017 <- read.csv("survey_results_public.csv")
Stackof2017["Year"] <- 2017
Stackof2017 <- Stackof2017 %>% select(Respondent, Year, everything())
colnames(Stackof2017)[1] <- "ResponseId"
colnames2017<-colnames(Stackof2017)

Reduce(intersect, list(colnames2021,colnames2020,colnames2019, colnames2018))  

##Here we look the data that has same colnames.

Reduce(intersect, list(colnames2021,colnames2020,colnames2019, colnames2018, colnames2017))  

Here we started to changing and merging different years data.

Stackof2020$Age<-cut(Stackof2020$Age, c(0, 18, 24, 35, 44, 54, 65, Inf),
                     c("Under 18 years old", "18 - 24 years old", "25 - 34 years old", "35 - 44 years old", "45 - 54 years old","55 - 64 Years Old","65 years or older"),
                     include.lowest=TRUE)

Stackof2019$Age<-cut(Stackof2019$Age,
                     c(0, 18, 24, 35, 44, 54, 65, Inf), c("Under 18 years old", "18 - 24 years old", "25 - 34 years old", "35 - 44 years old", "45 - 54 years old",
                                                          "55 - 64 Years Old","65 years or older"),
                     include.lowest=TRUE)

Stackof2019$Age<-as.character(Stackof2019$Age)

Stackof2020$Age<-as.character(Stackof2020$Age)

Stackof2017$CareerSatisfaction<-cut(Stackof2017$CareerSatisfaction, c(0, 1, 2, 4, 5, 6, 9, 10),
                                    c("Extremely dissatisfied", "Moderately dissatisfied", "Slightly dissatisfied", "Neither satisfied nor dissatisfied", "Slightly
                                      satisfied","Moderately satisfied","Extremely satisfied"),
                                    include.lowest=TRUE)

Stackof2017$CareerSatisfaction<-as.character(Stackof2017$CareerSatisfaction)

Stackof2017 %>% dplyr::distinct(Stackof2017$CareerSatisfaction)
Stackof2018 %>% dplyr::distinct(Stackof2018$CareerSatisfaction)


Stackof2017$JobSatisfaction<-cut(Stackof2017$JobSatisfaction,
                                 c(0, 1, 2, 4, 5, 6, 9, 10),
                                 c("Extremely dissatisfied", "Moderately dissatisfied", "Slightly dissatisfied", "Neither satisfied nor dissatisfied", 
                                   "Slightly satisfied","Moderately satisfied","Extremely satisfied"),
                                 include.lowest=TRUE)

Stackof2017$JobSatisfaction<-as.character(Stackof2017$JobSatisfaction)

Here we have changed the type of tdata because of inconsistancy of dataframes.

Stackof2017$Salary<-as.numeric(Stackof2017$Salary)
Stackof2018$Salary<-as.numeric(Stackof2018$Salary)

Stackof2017 %>% dplyr::distinct(Stackof2017$Salary)

We have controlled if there is any blank column or row after merging them.

Stackofbinded <- bind_rows(Stackof2017,Stackof2018,Stackof2019,Stackof2020,Stackof2021)

Stackofbinded <- Stackofbinded[, colSums(is.na(Stackofbinded)) != nrow(Stackofbinded)]

Then we write it our computers to back up.

fwrite(Stackofbinded,file="Stackofbinded2.csv", sep=",")

Stackofbinded <- read.csv("Stackofbinded.csv")

colnamesall<-colnames(Stackofbinded)

colnamesall

unique_rows <- !duplicated(Stackofbinded[c("Year")])
unique.df <- Stackofbinded[unique_rows,]

And our data is ready to go.