stackover_full includes the last 5 years stack overflow survey results. This data consist of 342 variable and 387030 observations.
When the number of participants by years is examined
%>%group_by(Year)%>%count() stackover_full
## # A tibble: 5 x 2
## # Groups: Year [5]
## Year n
## <int> <int>
## 1 2017 51392
## 2 2018 98855
## 3 2019 88883
## 4 2020 64461
## 5 2021 83439
The survey participants have different countries.
That there are looking at the unique value of countries, same country name spelled differently such as “United States” vs “United States of America”. This type of data should be converted.
head(unique(stackover_full$Country),20)
## [1] "United States" "United Kingdom" "Switzerland" "New Zealand"
## [5] "Poland" "Colombia" "France" "Canada"
## [9] "Germany" "Greece" "Brazil" "Israel"
## [13] "Italy" "Belgium" "India" "Chile"
## [17] "Croatia" "Argentina" "Netherlands" "Denmark"
=stackover_full%>%
stackover_fullmutate(Country=case_when(Country=="United States of America" ~ "United States",
=="United Kingdom of Great Britain and Northern Ireland" ~"United Kingdom",
Country=="Viet Nam" ~"Vietnam",
Country=="Iran, Islamic Republic of..." ~"Iran",
Country=="Congo, Republic of the..." ~"Democratic Republic of the Congo",
Country=="Venezuela, Bolivarian Republic of..." ~"Venezuela",
Country=="Dominican Republic" ~"Dominica",
Country=="The former Yugoslav Republic of Macedonia" ~"Macedonia",
Country=="Libyan Arab Jamahiriya" ~"Libya",
Country=="Slovak Republic" ~"Slovakia",
Country=="Azerbaidjan" ~"Azerbaijan",
Country=="Tadjikistan" ~"Tajikistan",
Country=="Syrian Arab Republic" ~"Syria",
Country=="Democratic People's Republic of Korea" ~"North Korea",
Country=="Republic of Korea" ~"South Korea",
Country=="United Republic of Tanzania" ~"Tanzania",
Country=="Hong Kong (S.A.R.)" ~"Hong Kong",
Country=="" ~"Not Responding",
Country=="I prefer not to say" ~"Not Responding",
CountryTRUE ~Country))
After data cleaning according to Country,
Let see the distirbution participants.
=stackover_full%>%
countrydistgroup_by(Country) %>%
count()%>%
arrange(desc(Country))
countrydist
## # A tibble: 233 x 2
## # Groups: Country [233]
## Country n
## <chr> <int>
## 1 Zimbabwe 165
## 2 Zambia 74
## 3 Zaire 2
## 4 Yemen 62
## 5 Virgin Islands (USA) 5
## 6 Virgin Islands (British) 2
## 7 Vietnam 1433
## 8 Venezuela 444
## 9 Vatican City State 1
## 10 Uzbekistan 232
## # ... with 223 more rows
Some country participant count is very low, so these data’s are eliminated.
Countries with a below-average number of participants were excluded from the analysis.
So, mean of participant count is 1653.974. After elimination, countrydist_elimination
data frame have 44 countries that have above-average number of participants.
Most of the participants are from United States, India, Germany, United Kingdom and Canada.
=countrydist %>%
countrydist_eliminationmutate(avg=mean(countrydist$n)) %>%
filter(n>avg) %>%
select(Country,n)%>%
arrange(desc(n))
countrydist_elimination
## # A tibble: 44 x 2
## # Groups: Country [44]
## Country n
## <chr> <int>
## 1 United States 80470
## 2 India 46893
## 3 Germany 25983
## 4 United Kingdom 24724
## 5 Canada 14224
## 6 France 11309
## 7 Brazil 9302
## 8 Poland 8417
## 9 Russian Federation 8002
## 10 Australia 7688
## # ... with 34 more rows
Let see the graph number of participants by top 10 country.
=countrydist %>%
countrydist_graphmutate(avg=mean(countrydist$n)) %>%
filter(n>avg) %>%
select(Country,n)%>%
arrange(desc(n)) %>%
head(15)
ggplot(countrydist_graph,
aes(fill = Country,
area = n,
label = Country)) +
geom_treemap() +
geom_treemap_text(colour = "white",
place = "centre") +
labs(title = "Country Distribution") +
theme(legend.position = "none")
The survey participants have different branches. That there are looking at the unique value of branches:
unique(stackover_full$MainBranch)
## [1] "Student"
## [2] "Professional developer"
## [3] "Professional non-developer who sometimes writes code"
## [4] "Used to be a professional developer"
## [5] "None of these"
## [6] NA
## [7] "I am a student who is learning to code"
## [8] "I am not primarily a developer, but I write code sometimes as part of my work"
## [9] "I am a developer by profession"
## [10] "I code primarily as a hobby"
## [11] "I used to be a developer by profession, but no longer am"
## [12] ""
There are 6 unique branches. But since the response of the survey to this variable changes over the years, some values need to be converted. For example “I am a developer by profession” is the same as “Professional developer”.
11 different categories reduced to 5 categories by doing the following conversion.
=stackover_full%>%
stackover_fullmutate(MainBranch=case_when(MainBranch=="I am a developer by profession" ~ "Professional developer",
=="Used to be a professional developer" ~ "Professional developer",
MainBranch=="I used to be a developer by profession, but no longer am" ~ "Professional developer",
MainBranch=="I am a student who is learning to code" ~"Student",
MainBranch=="Professional non-developer who sometimes writes code" ~ "Hobby",
MainBranch=="I code primarily as a hobby" ~ "Hobby",
MainBranch=="I am not primarily a developer, but I write code sometimes as part of my work" ~"Sometimes Coding",
MainBranch=="" ~"None of these",
MainBranchis.na(MainBranch)==TRUE~"None of these",
TRUE ~MainBranch))
unique(stackover_full$MainBranch)
## [1] "Student" "Professional developer" "Hobby"
## [4] "None of these" "Sometimes Coding"
Looking at the distribution of categories over all participants:
Stackoverflow is used by professional developers by %54.
=stackover_full%>%
mainbranchdistgroup_by(MainBranch)%>%
count()%>%
mutate(perc_n=round(n/NROW(stackover_full)*100,2))%>%
arrange(desc(n))
mainbranchdist
## # A tibble: 5 x 3
## # Groups: MainBranch [5]
## MainBranch n perc_n
## <chr> <int> <dbl>
## 1 Professional developer 212143 54.8
## 2 None of these 101133 26.1
## 3 Student 38412 9.92
## 4 Sometimes Coding 19619 5.07
## 5 Hobby 15723 4.06
The branch of the participants’ distribution chart:
hchart(mainbranchdist,hcaes(x=MainBranch,y=perc_n),type="column",name="Percentage",color="#80FF40") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Distrubition of Participants",align="center") %>%
hc_add_theme(hc_theme_elementary())
In Turkey’s situation:
=stackover_full%>%filter(Country=='Turkey')%>%
mainbranchdist2group_by(MainBranch)%>%
count()%>%
mutate(perc_n=round(n/NROW(stackover_full%>%filter(Country=='Turkey'))*100,2))%>%
arrange(desc(n))
mainbranchdist2
## # A tibble: 5 x 3
## # Groups: MainBranch [5]
## MainBranch n perc_n
## <chr> <int> <dbl>
## 1 Professional developer 2250 54.3
## 2 None of these 1024 24.7
## 3 Student 533 12.8
## 4 Hobby 185 4.46
## 5 Sometimes Coding 155 3.74
The branch of the participants’ distribution chart in Turkey:
hchart(mainbranchdist2,hcaes(x=MainBranch,y=perc_n),type="column",name="Percentage",color="#00FFFF") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Distrubition of Participants in Turkey",align="center") %>%
hc_add_theme(hc_theme_elementary())
This site is used by people from different age groups. By looking at the age distribution of the participants, it can be determined which age range is used more actively.
unique(stackover_full$Age)
## [1] NA "25 - 34 years old" "35 - 44 years old"
## [4] "" "18 - 24 years old" "45 - 54 years old"
## [7] "55 - 64 years old" "Under 18 years old" "65 years or older"
## [10] "55 - 64 Years Old" "25-34 years old" "18-24 years old"
## [13] "35-44 years old" "Prefer not to say" "45-54 years old"
## [16] "55-64 years old"
Considering the age values, there is a need for regrouping.
=stackover_full%>%
stackover_fullmutate(Age=gsub(" years old", "", tolower(Age))) %>%
mutate(Age=gsub(" - ", "-", tolower(Age))) %>%
mutate(Age=case_when(Age=="" ~"Prefer not to say",Age=="prefer not to say" ~"Prefer not to say",TRUE ~Age))
unique(stackover_full$Age)
## [1] NA "25-34" "35-44"
## [4] "Prefer not to say" "18-24" "45-54"
## [7] "55-64" "under 18" "65 years or older"
When the age distribution by years is investigated:
=stackover_full %>%
agedistgroup_by(Year,Age) %>%
count() %>%
arrange(desc(Year,Age))
agedist
## # A tibble: 33 x 3
## # Groups: Year, Age [33]
## Year Age n
## <int> <chr> <int>
## 1 2021 18-24 20993
## 2 2021 25-34 32568
## 3 2021 35-44 15183
## 4 2021 45-54 5472
## 5 2021 55-64 1819
## 6 2021 65 years or older 421
## 7 2021 Prefer not to say 1607
## 8 2021 under 18 5376
## 9 2020 18-24 10026
## 10 2020 25-34 22352
## # ... with 23 more rows
Since the participants did not specify their ages in 2017, no inference can be made for this year.
When the graph is examined, it can be said that the 24-34 age range is actively using stackoverflow.
In recent years, it has been seen that the “18-24” and “under 18” age group uses stackoverflow more. In this way, it can be said that there is a tendency towards programming in the younger generation.
highchart() %>%
hc_add_series(agedist, type = "bar", hcaes(x = Year, group = Age, y = n)) %>%
hc_xAxis(categories = agedist$Age)
In Turkey, Age Distribution
=stackover_full %>%filter(Country=='Turkey')%>%
agedist_Turkeygroup_by(Year,Age) %>%
count() %>%
arrange(desc(Year,Age))
agedist_Turkey
## # A tibble: 31 x 3
## # Groups: Year, Age [31]
## Year Age n
## <int> <chr> <int>
## 1 2021 18-24 332
## 2 2021 25-34 438
## 3 2021 35-44 182
## 4 2021 45-54 20
## 5 2021 55-64 4
## 6 2021 Prefer not to say 9
## 7 2021 under 18 69
## 8 2020 18-24 95
## 9 2020 25-34 302
## 10 2020 35-44 76
## # ... with 21 more rows
Turkey’s age distribution graph:
highchart() %>%
hc_add_series(agedist_Turkey, type = "bar", hcaes(x = Year, group = Age, y = n)) %>%
hc_xAxis(categories = agedist_Turkey$Age)
When the education levels of the participants are examined, 21 different categories are in data. Actually, some categories are the same. But it seems different due to missing punctuation marks or spelling. For example “Master’s degree (MA, MS, M.Eng., MBA, etc.)” is the same “Master’s degree (M.A., M.S., M.Eng., MBA, etc.)”, only punctuation marks are different.
unique(stackover_full$EdLevel)
## [1] "Secondary school"
## [2] "Some college/university study without earning a bachelor's degree"
## [3] "Bachelor's degree"
## [4] "Doctoral degree"
## [5] "Master's degree"
## [6] "Professional degree"
## [7] "Primary/elementary school"
## [8] "I prefer not to answer"
## [9] "I never completed any formal education"
## [10] "Bachelor<U+0092>s degree (BA, BS, B.Eng., etc.)"
## [11] "Associate degree"
## [12] "Some college/university study without earning a degree"
## [13] "Master<U+0092>s degree (MA, MS, M.Eng., MBA, etc.)"
## [14] "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)"
## [15] ""
## [16] "Professional degree (JD, MD, etc.)"
## [17] "Other doctoral degree (Ph.D, Ed.D., etc.)"
## [18] "Master<U+0092>s degree (M.A., M.S., M.Eng., MBA, etc.)"
## [19] "Bachelor<U+0092>s degree (B.A., B.S., B.Eng., etc.)"
## [20] "Associate degree (A.A., A.S., etc.)"
## [21] "Other doctoral degree (Ph.D., Ed.D., etc.)"
## [22] "Something else"
This data should be rearrange .
After data cleaning process, 21 different education level responses were reduced to 13.
$EdLevel=gsub("\\([^)]*)", "",stackover_full$EdLevel)
stackover_full$EdLevel=gsub("degree ","degree",stackover_full$EdLevel)
stackover_full$EdLevel=gsub("’","'",stackover_full$EdLevel)
stackover_full$EdLevel=gsub("without earning a bachelor's degree","without earning a degree",stackover_full$EdLevel)
stackover_full$EdLevel=gsub("school ","school",stackover_full$EdLevel)
stackover_fullunique(stackover_full$EdLevel)
## [1] "Secondary school"
## [2] "Some college/university study without earning a degree"
## [3] "Bachelor's degree"
## [4] "Doctoral degree"
## [5] "Master's degree"
## [6] "Professional degree"
## [7] "Primary/elementary school"
## [8] "I prefer not to answer"
## [9] "I never completed any formal education"
## [10] "Bachelor<U+393C><U+3E32>s degree"
## [11] "Associate degree"
## [12] "Master<U+393C><U+3E32>s degree"
## [13] ""
## [14] "Other doctoral degree"
## [15] "Something else"
These values were re-categorized according to the level of education.
=stackover_full %>%
stackover_fullmutate(EdLevel_Bucket=case_when(EdLevel %in% c("Secondary school","Primary/elementary school","Something else")~1,
%in% c("Associate degre","Bachelor's degree","Bachelor's degree ")~2,
EdLevel %in% c("Doctoral degree","Master's degree","Other doctoral degree","Professional degree")~3,
EdLevel %in% c("I never completed any formal education")~4,
EdLevel %in% c("Some college/university study without earning a degree")~4,
EdLevel %in% c("Some college/university study without earning a bachelor's degree")~4,
EdLevel %in% c("I prefer not to answer","")~5,
EdLevel TRUE ~6)) %>%
mutate(EdLevel_Bucket=ifelse(str_detect(EdLevel,"^Bachelor"),2,ifelse(str_detect(EdLevel,"^Associate"),2,EdLevel_Bucket)))
=stackover_full%>%group_by(Year,EdLevel_Bucket)%>%count() edleveldist
When the graph is examined, Stack Overflow mostly is used by people that have Associate degree or Bachelor’s degree.
highchart() %>%
hc_add_series(edleveldist, type = "bar", hcaes(x = Year, group = EdLevel_Bucket, y = n)) %>%
hc_xAxis(categories = edleveldist$EdLevel_Bucket)
In Turkey, Education Level Distribution: When the graph is examined, Turkey distribution is as same as global’s.
=stackover_full%>%filter(Country=='Turkey')%>%group_by(Year,EdLevel_Bucket)%>%count()
edleveldist_turkeyhighchart() %>%
hc_add_series(edleveldist_turkey, type = "bar", hcaes(x = Year, group = EdLevel_Bucket, y = n)) %>%
hc_xAxis(categories = edleveldist_turkey$EdLevel_Bucket)
In this section we will analyze respondents’ answers to questions in regards to database environments they have worked with and database environments that they wish to work. We wish to observe database trends and see the top database environments preferred in the last 5 years. Respondents’ answers to these questions have been kept in the columns specified below. Respondents were able to give more than one answer for each question.
Question: Which database environments have you done extensive development work in over the past year?
Related column: DatabaseWorkedWith
Question: Which database environments do you want to work in over the next year?
Related column: DatabaseDesireNextYear
We display values of the columns we are interested in to see if we need to make any further pre-processing.
Column: DatabaseWorkedWith
head(unique(stackover_full$DatabaseWorkedWith))
## [1] ""
## [2] "MySQL; SQLite"
## [3] "MySQL"
## [4] "MongoDB; Redis; SQL Server; MySQL; SQLite"
## [5] "SQLite"
## [6] "Redis; MySQL; PostgreSQL"
Column: DatabaseDesireNextYear
head(unique(stackover_full$DatabaseDesireNextYear))
## [1] ""
## [2] "MySQL; SQLite"
## [3] "MongoDB; Redis; SQL Server; MySQL; SQLite"
## [4] "MongoDB; SQL Server; PostgreSQL; SQLite"
## [5] "Cassandra; Redis"
## [6] "SQL Server"
We notice that there is a need to do some coding to split the data and trim some values due to redundant spacings.
In visualization below we see top 5 database environments that respondents have used most in the last 5 years worldwide.
%>%
stackover_full filter(!is.na(DatabaseWorkedWith)) %>%
mutate(DatabaseWorkedWith= trim(str_split(toupper(DatabaseWorkedWith),pattern=';'))) %>%
unnest(DatabaseWorkedWith) %>%
group_by(DatabaseWorkedWith) %>%
count() %>%
arrange(desc(n)) %>%
head(n=5) %>%
ggplot(aes(x= reorder(DatabaseWorkedWith,n),y=n,fill=DatabaseWorkedWith)) +
geom_bar(stat='identity') + coord_flip() +theme_classic() +
labs(title='Top 5 DBs respondents worked (2017-2021) - WW',y='frequency',x=NULL) + theme(legend.position ='none')
We were curious to see if top DB environments repondents in Turkey worked differ from preferences worldwide.
%>%
stackover_full filter(!is.na(DatabaseWorkedWith)) %>%
filter(Country == "Turkey") %>%
mutate(DatabaseWorkedWith= trim(str_split(toupper(DatabaseWorkedWith),pattern=';'))) %>%
unnest(DatabaseWorkedWith) %>%
group_by(DatabaseWorkedWith) %>%
count() %>%
arrange(desc(n)) %>%
head(n=5) %>%
ggplot(aes(x= reorder(DatabaseWorkedWith,n),y=n,fill=DatabaseWorkedWith)) +
geom_bar(stat='identity') + coord_flip() +theme_classic() +
labs(title='Top 5 DBs respondents worked (2017-2021) - TR',y='frequency',x=NULL) + theme(legend.position ='none')
We see that top 5 database environments that attendants worked with in Turkey are in line with what’s popular globally.
1. MYSQL
2. POSTGRESQL
3. SQLITE
4. MONGODB
5. MICROSOFT SQL SERVER
In visualization below we can observe popularity of top 3 database environments over time. We notice that MYSQL has been always at the top of the list over the last 5 years.
<- stackover_full %>%
db_use_2017 filter(!is.na(DatabaseWorkedWith)) %>%
filter(Year==2017) %>%
mutate(DatabaseWorkedWith= trim(str_split(toupper(DatabaseWorkedWith),pattern=';'))) %>%
unnest(DatabaseWorkedWith) %>%
group_by(DatabaseWorkedWith,Year) %>%
%>%
count arrange(desc(n)) %>%
head(n=3)
<- stackover_full %>%
db_use_2018 filter(!is.na(DatabaseWorkedWith)) %>%
filter(Year==2018) %>%
mutate(DatabaseWorkedWith= trim(str_split(toupper(DatabaseWorkedWith),pattern=';'))) %>%
unnest(DatabaseWorkedWith) %>%
group_by(DatabaseWorkedWith,Year) %>%
%>%
count arrange(desc(n)) %>%
head(n=3)
<- stackover_full %>%
db_use_2019 filter(!is.na(DatabaseWorkedWith)) %>%
filter(Year==2019) %>%
mutate(DatabaseWorkedWith= trim(str_split(toupper(DatabaseWorkedWith),pattern=';'))) %>%
unnest(DatabaseWorkedWith) %>%
group_by(DatabaseWorkedWith,Year) %>%
%>%
count arrange(desc(n)) %>%
head(n=3)
<- stackover_full %>%
db_use_2020 filter(!is.na(DatabaseWorkedWith)) %>%
filter(Year==2020) %>%
mutate(DatabaseWorkedWith= trim(str_split(toupper(DatabaseWorkedWith),pattern=';'))) %>%
unnest(DatabaseWorkedWith) %>%
group_by(DatabaseWorkedWith,Year) %>%
%>%
count arrange(desc(n)) %>%
head(n=3)
<- stackover_full %>%
db_use_2021 filter(!is.na(DatabaseWorkedWith)) %>%
filter(Year==2021) %>%
mutate(DatabaseWorkedWith= trim(str_split(toupper(DatabaseWorkedWith),pattern=';'))) %>%
unnest(DatabaseWorkedWith) %>%
group_by(DatabaseWorkedWith,Year) %>%
%>%
count arrange(desc(n)) %>%
head(n=3)
= rbind(db_use_2017,db_use_2018,db_use_2019,db_use_2020,db_use_2021)
db_use_all
%>%
db_use_all ggplot(aes(x = Year, ,y=n, fill = DatabaseWorkedWith)) + geom_col(position=position_dodge()) + scale_fill_discrete()
In below graph we analyze the most recent data we have from survey that has been conducted in 2021, based on respondents’ answers on which DB environment they wish to work next year, in 2022. You can see top 5 environments desired. According to the survey results, POSTGRESQL may be more popular than MYSQL in 2022.
%>%
stackover_full filter(!is.na(DatabaseDesireNextYear)) %>%
filter(Year==2021) %>%
mutate(DatabaseDesireNextYear= trim(str_split(toupper(DatabaseDesireNextYear),pattern=';'))) %>%
unnest(DatabaseDesireNextYear) %>%
group_by(DatabaseDesireNextYear,Year) %>%
%>%
count arrange(desc(n)) %>%
head(n=5) %>%
ggplot(aes(x=DatabaseDesireNextYear,y=n, fill = DatabaseDesireNextYear)) + geom_col(position=position_dodge()) + scale_fill_discrete()
This part of the Stack OverFlow Survey Analysis, The languages worked and wanted to be learned were examined.
Firstly, we made data preprocessing 3 parts. 1. LanguageWorked and LanguageDesire columns were splitted. 2. Spaces in values are removed. 3. Blank (Nas) valus are removed.
=stackover_full %>%
Main_Laguagefilter(!is.na(LanguageWorkedWith)) %>%
mutate(LanguageWorked = str_split(LanguageWorkedWith,pattern=';')) %>%
select(ResponseId,LanguageWorked)%>%
unnest(LanguageWorked) %>%
mutate(LanguageWorked=gsub("\\([^)]*)", "",LanguageWorked))%>%
filter(LanguageWorked !="")
=stackover_full %>%
Desired_Languagefilter(!is.na(LanguageDesireNextYear)) %>%
mutate(LanguageDesire = str_split(LanguageDesireNextYear,pattern=';')) %>%
select(ResponseId,LanguageDesire)%>%
unnest(LanguageDesire) %>%
mutate(LanguageDesire=gsub("\\([^)]*)", "",LanguageDesire))%>%
filter(LanguageDesire !="")
Grouping by Language and showing them as Percantage:
= Main_Laguage %>%mutate(Language = toupper(str_trim(LanguageWorked))) %>%
Main_Laguage group_by(Language) %>%
summarize(Perc_worked=round(n()/nrow(stackover_full)*100,2)) %>%
arrange(desc(Perc_worked))
= Desired_Language%>%mutate(Language = toupper(str_trim(LanguageDesire))) %>%
Desired_Language group_by(Language) %>%
summarize(Perc_desire=round(n()/nrow(stackover_full)*100,2)) %>%
arrange(desc(Perc_desire))
Merging Top20 desaired and main laguage columns;
=merge(x=Main_Laguage,y=Desired_Language,by="Language", all=TRUE )%>%arrange(desc(Perc_worked))%>% head(21)
final_language final_language
## Language Perc_worked Perc_desire
## 1 JAVASCRIPT 59.22 42.05
## 2 SQL 46.82 30.85
## 3 PYTHON 37.10 37.98
## 4 JAVA 35.71 22.36
## 5 HTML/CSS 35.63 22.42
## 6 C# 27.79 21.89
## 7 PHP 23.39 11.99
## 8 C++ 21.27 16.74
## 9 TYPESCRIPT 19.41 23.35
## 10 C 18.82 10.88
## 11 HTML 13.86 8.21
## 12 BASH/SHELL 13.84 8.99
## 13 BASH/SHELL/POWERSHELL 13.17 8.53
## 14 CSS 13.17 7.80
## 15 RUBY 7.28 7.28
## 16 NODE.JS 7.23 6.23
## 17 GO 7.03 17.25
## 18 SWIFT 5.69 9.05
## 19 ASSEMBLY 5.58 4.34
## 20 KOTLIN 5.29 11.76
## 21 R 4.90 6.38
As a final, visualizing it with pyramid;
=max(nchar(final_language$Language))
len_language
par(mar = pyramid.plot(final_language$Perc_worked, final_language$Perc_desire, labels =tolower(final_language$Language),
top.labels=c("Language Worked","","Language Desired"), main = "Language Worked with and will desire to work next",
gap=len_language, show.values = T))
## 60 60
In this part of study, we examined the salaries and programming language relations.
Because of the size of our Data, I started to select the columns that we need. After selecting them it is always good to quick look at data with head and summary function.
<-stackover_full[,c('Year','Salary','LanguageWorkedWith','Country','MainBranch','CompTotal','CompFreq','SalaryType','ConvertedComp', 'Currency')]
SalaryofStack
head(SalaryofStack)
## Year Salary LanguageWorkedWith Country
## 1 2017 NA Swift United States
## 2 2017 NA JavaScript; Python; Ruby; SQL United Kingdom
## 3 2017 113750 Java; PHP; Python United Kingdom
## 4 2017 NA Matlab; Python; R; SQL United States
## 5 2017 NA Switzerland
## 6 2017 NA JavaScript; PHP; Rust New Zealand
## MainBranch CompTotal CompFreq SalaryType ConvertedComp
## 1 Student NA <NA> <NA> NA
## 2 Student NA <NA> <NA> NA
## 3 Professional developer NA <NA> <NA> NA
## 4 Hobby NA <NA> <NA> NA
## 5 Professional developer NA <NA> <NA> NA
## 6 Student NA <NA> <NA> NA
## Currency
## 1
## 2 British pounds sterling (<U+00A3>)
## 3 British pounds sterling (<U+00A3>)
## 4
## 5
## 6
summary(SalaryofStack)
## Year Salary LanguageWorkedWith Country
## Min. :2017 Min. : 0.000e+00 Length:387030 Length:387030
## 1st Qu.:2018 1st Qu.: 1.320e+04 Class :character Class :character
## Median :2019 Median : 5.000e+04 Mode :character Mode :character
## Mean :2019 Mean :1.582e+100
## 3rd Qu.:2020 3rd Qu.: 9.950e+04
## Max. :2021 Max. :1.000e+105
## NA's :323812
## MainBranch CompTotal CompFreq SalaryType
## Length:387030 Min. : 0.000e+00 Length:387030 Length:387030
## Class :character 1st Qu.: 1.850e+04 Class :character Class :character
## Mode :character Median : 6.500e+04 Mode :character Mode :character
## Mean :8.054e+241
## 3rd Qu.: 1.280e+05
## Max. :1.111e+247
## NA's :249076
## ConvertedComp Currency
## Min. : 0 Length:387030
## 1st Qu.: 25356 Class :character
## Median : 55562 Mode :character
## Mean : 112456
## 3rd Qu.: 97288
## Max. :45241312
## NA's :201905
After selecting columns, I started to clean and merge columns because although I had corrected the data while merging there were still different and problematic columns in our data frame.
Also, I added to the tables the average currency of 2017 because there was no column for converted salary currency for 2017. Later I will come back to this.
<- SalaryofStack %>%
SalaryofStack2 mutate(Freqnew = coalesce(CompFreq,SalaryType)) %>% mutate(Freqnew = ifelse(CompFreq == '' ,SalaryType,Freqnew))
$ConvertedComp <- as.numeric(SalaryofStack2$ConvertedComp)
SalaryofStack2
#write_clip(SalaryofStack2 %>% distinct(ConvertedComp, Year))
=read.csv(file = 'https://raw.githubusercontent.com/pjournal/mef05g-rninjas/gh-pages/average_currency_2017.csv',stringsAsFactors = FALSE, header = TRUE,sep = ";", encoding="UTF-8")
Currency2017<-left_join(SalaryofStack2, Currency2017, by =c("Currency"="CurrencyType"))
SalaryofStack3
describe(SalaryofStack3)
## vars n mean sd median trimmed
## Year 1 387030 2.019080e+03 1.350000e+00 2019 2019.10
## Salary 2 63218 1.581828e+100 3.977220e+102 50000 57002.76
## LanguageWorkedWith* 3 387030 4.929239e+04 3.021072e+04 47566 50722.15
## Country* 4 387030 1.369000e+02 7.238000e+01 144 140.42
## MainBranch* 5 387030 2.910000e+00 9.300000e-01 3 2.81
## CompTotal 6 137954 8.054215e+241 Inf 65000 79908.26
## CompFreq* 7 236783 2.330000e+00 1.250000e+00 2 2.29
## SalaryType* 8 98855 2.000000e+00 1.190000e+00 2 1.87
## ConvertedComp 9 185125 1.124556e+05 3.392271e+05 55562 61842.67
## Currency* 10 233686 5.497000e+01 5.835000e+01 50 49.30
## Freqnew* 11 155487 2.030000e+00 9.800000e-01 2 2.04
## Currency2017 12 43563 1.630000e+00 2.850000e+00 1 0.96
## mad min max range skew kurtosis
## Year 1.48 2017.00 2.021000e+03 4.000000e+00 0.06 -1.20
## Salary 59487.04 0.00 1.000000e+105 1.000000e+105 NaN NaN
## LanguageWorkedWith* 45395.73 1.00 8.847800e+04 8.847700e+04 -0.27 -1.29
## Country* 103.78 1.00 2.330000e+02 2.320000e+02 -0.11 -1.49
## MainBranch* 0.00 1.00 5.000000e+00 4.000000e+00 0.62 0.62
## CompTotal 78577.80 0.00 1.111111e+247 1.111111e+247 NaN NaN
## CompFreq* 1.48 1.00 4.000000e+00 3.000000e+00 0.36 -1.52
## SalaryType* 1.48 1.00 4.000000e+00 3.000000e+00 0.82 -0.92
## ConvertedComp 50903.59 0.00 4.524131e+07 4.524131e+07 30.27 2494.12
## Currency* 72.65 1.00 1.680000e+02 1.670000e+02 0.70 -1.03
## Freqnew* 1.48 1.00 3.000000e+00 2.000000e+00 -0.06 -1.96
## Currency2017 0.00 0.02 1.891000e+01 1.889000e+01 4.43 20.64
## se
## Year 0.000000e+00
## Salary 1.581828e+100
## LanguageWorkedWith* 4.856000e+01
## Country* 1.200000e-01
## MainBranch* 0.000000e+00
## CompTotal Inf
## CompFreq* 0.000000e+00
## SalaryType* 0.000000e+00
## ConvertedComp 7.884200e+02
## Currency* 1.200000e-01
## Freqnew* 0.000000e+00
## Currency2017 1.000000e-02
Unfortunately, the calculations with currency do not show reliable results for 2017 when we compare the distribution of other years. Firstly I started with multiplying the salary with the currency mean of 2017.
<- SalaryofStack3 %>% mutate(Finalsalary = ifelse((Year==2017), (Salary)*(Currency2017), ConvertedComp))
SalaryofStack4
<- select(SalaryofStack4, c(-CompFreq, -SalaryType, -Currency2017, -CompTotal, -Currency, -Salary , -Freqnew ))
SalaryofStack4
<- select(SalaryofStack4, c(-ConvertedComp))
SalaryofStack4
str(SalaryofStack4)
## 'data.frame': 387030 obs. of 5 variables:
## $ Year : int 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
## $ LanguageWorkedWith: chr "Swift" "JavaScript; Python; Ruby; SQL" "Java; PHP; Python" "Matlab; Python; R; SQL" ...
## $ Country : chr "United States" "United Kingdom" "United Kingdom" "United States" ...
## $ MainBranch : chr "Student" "Student" "Professional developer" "Hobby" ...
## $ Finalsalary : num NA NA NA NA NA NA NA NA NA NA ...
<-SalaryofStack4 %>% filter(!is.na(Finalsalary))
SalaryofStack4
<-SalaryofStack4 %>% filter(between(Finalsalary, quantile(Finalsalary, 0.05), quantile(Finalsalary, 0.95)))
SalaryofStack5
hist(SalaryofStack5$Finalsalary)
$Finalsalary<-as.numeric(SalaryofStack5$Finalsalary)
SalaryofStack5
<-SalaryofStack5 %>% filter(Year==2017)
SalaryofStack5plot2017<-SalaryofStack5 %>% filter(Year==2018)
SalaryofStack5plot2018<-SalaryofStack5 %>% filter(Year==2019)
SalaryofStack5plot2019plot(density(SalaryofStack5plot2017$Finalsalary),col='Blue')
lines(density(SalaryofStack5plot2018$Finalsalary),col='Red')
lines(density(SalaryofStack5plot2019$Finalsalary),col='Black')
lines(density(SalaryofStack5$Finalsalary),col='Orange')
title(sub="Year Distribution - Blue is 2017")
Unfortunately, again the method does not seem correct when we looked at the distribution. Therefore just looked at the salary data.
<- SalaryofStack3 %>% mutate(Finalsalary = ifelse((Year==2017), (Salary), ConvertedComp))
SalaryofStack4
<- select(SalaryofStack4, c(-CompFreq, -SalaryType, -Currency2017, -CompTotal, -Currency, -Salary , -Freqnew ))
SalaryofStack4
<- select(SalaryofStack4, c(-ConvertedComp))
SalaryofStack4
str(SalaryofStack4)
## 'data.frame': 387030 obs. of 5 variables:
## $ Year : int 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
## $ LanguageWorkedWith: chr "Swift" "JavaScript; Python; Ruby; SQL" "Java; PHP; Python" "Matlab; Python; R; SQL" ...
## $ Country : chr "United States" "United Kingdom" "United Kingdom" "United States" ...
## $ MainBranch : chr "Student" "Student" "Professional developer" "Hobby" ...
## $ Finalsalary : num NA NA 113750 NA NA ...
<-SalaryofStack4 %>% filter(!is.na(Finalsalary))
SalaryofStack4
<-SalaryofStack4 %>% filter(between(Finalsalary, quantile(Finalsalary, 0.04), quantile(Finalsalary, 0.96)))
SalaryofStack5
hist(SalaryofStack5$Finalsalary)
$Finalsalary<-as.numeric(SalaryofStack5$Finalsalary)
SalaryofStack5
<-SalaryofStack5 %>% filter(Year==2017)
SalaryofStack5plot2017<-SalaryofStack5 %>% filter(Year==2018)
SalaryofStack5plot2018<-SalaryofStack5 %>% filter(Year==2019)
SalaryofStack5plot2019plot(density(SalaryofStack5plot2017$Finalsalary),col='Blue')
lines(density(SalaryofStack5plot2018$Finalsalary),col='Red')
lines(density(SalaryofStack5plot2019$Finalsalary),col='Black')
title(sub="Year Distribution - Blue is 2017")
If I do not multiply and just use the Salary column then the results are still cannot be dependable. As you can see from above and below two of the table blue one is show the salary distribution for 2017 and they are not compatible with other years and total.
Then, we changed the range of salary because we noticed that there is some missing and misleading answers in our data frame.
Due to lack of data information when we observed the distribution for 2017 cannot be reliable.
Hence we eliminated 2017’s salary data.
<-SalaryofStack5%>%filter(Year!=2017)
SalaryofStack5
<-SalaryofStack5%>%filter(Country=="Turkey")
SalaryofStack5Turkey
<-SalaryofStack5%>%filter(Country!="Turkey")
SalaryofStack5notTurkey
<-SalaryofStack5%>%filter(Country=="United States")
SalaryofStack5USA
<-SalaryofStack5%>%filter(Country=="Sweden") SalaryofStack5Sweeden
And group them for some countries.
Here we are looked at our data quickly to understand the frame.
plot(density(SalaryofStack5Turkey$Finalsalary),col='red')
lines(density(SalaryofStack5notTurkey$Finalsalary),col='blue')
$Cntry <- 'TR'
SalaryofStack5Turkey$Cntry <- 'Other'
SalaryofStack5notTurkey$Cntry<-'USA'
SalaryofStack5USA$Cntry<-'Sweden'
SalaryofStack5Sweeden
<- rbind(SalaryofStack5Turkey, SalaryofStack5notTurkey)
compareturkey options(scipen = 5)
ggplot(compareturkey, aes(Finalsalary, fill = Cntry)) + geom_density(alpha = 0.6)+
scale_x_continuous(limits = c(10000, 200000)) +
labs(
caption = "density summary" )
Here is our result distribution that shows Turkey and other countries’ differences. When it comes to Salary Turkey is observed as below the world average. While Turkey’s salaries are intense in 12-33k intervals world is between 27 and 91 k.
Here are some other countries comparison from world and summary tables:
<- rbind(SalaryofStack5Turkey, SalaryofStack5notTurkey, SalaryofStack5USA, SalaryofStack5Sweeden)
compareturkey
options(scipen = 5)
ggplot(compareturkey, aes(Finalsalary, fill = Cntry)) + geom_density(alpha = 0.5) +
scale_x_continuous(limits = c(10000, 250000)) +
labs(
caption = "density summary" )
<-summary(SalaryofStack5notTurkey$Finalsalary)
sumother<-summary(SalaryofStack5Turkey$Finalsalary)
sumtr<-summary(SalaryofStack5USA$Finalsalary)
sumusa<-summary(SalaryofStack5Sweeden$Finalsalary)
sumsweeden
print("Other")
## [1] "Other"
sumother
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4020 27923 55776 68263 91788 400000
print("TR")
## [1] "TR"
sumtr
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4380 12816 20688 28414 33114 381468
print("USA")
## [1] "USA"
sumusa
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4050 79000 107500 116885 143000 400000
print("Sweeden")
## [1] "Sweeden"
sumsweeden
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4367 46893 56340 60512 67501 331400
Here we compared different programs and salaries and started to look and change our data.
Because of the fact that one programmer knows more than one language, I separated these persons depending on the language. It is assumed that languages that are known are the main cause of salaries.
It is always good to look at the histogram of data to see the distribution.
library(splitstackshape)
<-SalaryofStack5 %>% filter(!is.na(LanguageWorkedWith))
SalaryofStack5
<- trim(cSplit(SalaryofStack5, "LanguageWorkedWith", sep = ";", direction = "long"))
SalaryofStack5language
<- SalaryofStack5language%>% mutate(LanguageWorkedWith=ifelse(LanguageWorkedWith=='Bash/Shell', 'Bash.Shell.PowerShell',LanguageWorkedWith))%>% mutate(LanguageWorkedWith=ifelse(LanguageWorkedWith=='Bash/Shell/PowerShell', 'Bash.Shell.PowerShell',LanguageWorkedWith))
SalaryofStack5language
<- SalaryofStack5language%>% filter(Country=="Turkey") SalaryofStack5languageTurkey
Because of the fast increase in USD-TL currency rate in 2018, there is a sharp decrease in salaries in Turkey for 2019. In 2021 the salaries are caching the previous year’s salary. But unfortunately, one of the most decreases in the value of TL has recently occurred in late 2021. Thus, again another decrease may be observed in the data of next year.
In line with the MEF Master’s program, I focused on Bash, R, and Python Languages and according to our studies Bash.Shell.Powershell is the winner of the comparison. Lack of R users shows fluctuation in salaries therefore it is hard to predict the salary power of R in Turkey. Python seems to more preferable in Turkey.
<-aggregate(SalaryofStack5languageTurkey[, Finalsalary], list(SalaryofStack5languageTurkey$LanguageWorkedWith), mean)
aggregatedlanguageturkey
<-SalaryofStack5languageTurkey %>% count(LanguageWorkedWith, sort=TRUE)
aggregatedlanguagecountturkey
<-left_join(aggregatedlanguagecountturkey,aggregatedlanguageturkey, by=c("LanguageWorkedWith"="Group.1"))
joinedlanguagecountandsalaryturkey
<-aggregate(SalaryofStack5languageTurkey[, c('Finalsalary')], list(SalaryofStack5languageTurkey$LanguageWorkedWith,SalaryofStack5languageTurkey$Year), mean)
aggregatedlanguageyearturkey
<- filter(aggregatedlanguageyearturkey, Group.1 %in% c('SQL','C#', 'R', 'C+', 'Java', 'Python', 'Bash.Shell.PowerShell'))
aggregatedlanguageyearfilterturkey
= ggplot(data=aggregatedlanguageyearfilterturkey, aes(x = Group.2))+
plotsmoothbylanguagepopulartr geom_smooth(aes(y = Finalsalary, color=as.character(Group.1)))+
scale_y_continuous(limits = c(10000, 90000))
plotsmoothbylanguagepopulartr
Now here, I also analyzed the data for the world because of the fact that there are more data that increase the dependability.
<-SalaryofStack5 %>% filter(!is.na(LanguageWorkedWith))
SalaryofStack5
<-aggregate(SalaryofStack5language[, Finalsalary], list(SalaryofStack5language$Country), mean)
aggregatedcountry
<-aggregate(SalaryofStack5language[, Finalsalary], list(SalaryofStack5language$LanguageWorkedWith), mean)
aggregatedlanguage<-SalaryofStack5language %>% count(LanguageWorkedWith, sort=TRUE)
aggregatedlanguagecount
<-(count(SalaryofStack5language, Country)%>%filter(n>300))
CNTR<- SalaryofStack5language %>% filter(Country==CNTR$Country)
SalaryofStack5languagecntr
<-aggregate(SalaryofStack5languagecntr[, Finalsalary], list(SalaryofStack5languagecntr$Country), mean)
aggregatedcountrYfianal
<-aggregate(SalaryofStack5language[, c('Finalsalary')], list(SalaryofStack5language$LanguageWorkedWith,SalaryofStack5language$Year), mean)
aggregatedlanguageyear
<-left_join(aggregatedlanguagecount,aggregatedlanguage, by=c("LanguageWorkedWith"="Group.1"))
joinedlanguagecountandsalary
#These are the most popular languages and we choose them to show in our smooth graph.
<- filter(aggregatedlanguageyear, Group.1 %in% c('SQL','C#', 'R', 'C+', 'Java', 'Rust', 'Python', 'Ruby', 'Go')) aggregatedlanguageyearfilter
= ggplot(data=aggregatedlanguageyearfilter, aes(x = Group.2))+
plotsmoothbylanguage geom_smooth(aes(y = Finalsalary, color=as.character(Group.1)))+
scale_y_continuous(limits = c(20000, 100000))
plotsmoothbylanguage
And lastly here are the codes that show the R Python, and Bash trend in recent years and a comparison for Turkey and World average.
<-filter(aggregatedlanguageyear, Group.1 %in% c("R", "Python", "Bash.Shell.PowerShell"))
RvsPythonvsBash
= ggplot(data=RvsPythonvsBash, aes(x = Group.2))+
RvsPythonvsBashpl geom_smooth(aes(y = Finalsalary, color=as.character(Group.1)))+
scale_y_continuous(limits = c(10000, 90000))
RvsPythonvsBashpl
<- filter(aggregatedlanguageyearfilterturkey, Group.1 %in% c("R", "Python", "Bash.Shell.PowerShell"))
RvsPythonvsBashtr
= ggplot(data=RvsPythonvsBashtr, aes(x = Group.2))+
RvsPythonvsBashtrpl geom_smooth(aes(y = Finalsalary, color=as.character(Group.1)))+
scale_y_continuous(limits = c(10000, 90000))
RvsPythonvsBashtrpl
This is the summary table of language popularity and salary mean in the world.
colnames(joinedlanguagecountandsalary)[2]<-"Popularity"
colnames(joinedlanguagecountandsalary)[3]<-"Salarymean"
joinedlanguagecountandsalary
## LanguageWorkedWith Popularity Salarymean
## 1: JavaScript 117802 66826.45
## 2: SQL 94637 67638.56
## 3: HTML/CSS 75284 65709.81
## 4: Python 68291 72674.63
## 5: Bash.Shell.PowerShell 63688 77876.20
## 6: Java 62619 65812.54
## 7: C# 55054 67776.55
## 8: TypeScript 45970 70445.32
## 9: PHP 41291 54252.80
## 10: C++ 32565 68485.41
## 11: HTML 29192 64552.90
## 12: CSS 27946 64236.07
## 13: C 27176 66728.40
## 14: Go 15899 88125.83
## 15: Ruby 15775 83792.09
## 16: Node.js 15520 69491.79
## 17: Kotlin 11499 67868.40
## 18: Swift 11004 72155.14
## 19: R 9019 72092.89
## 20: VBA 8631 64009.06
## 21: Objective-C 8330 76506.47
## 22: Assembly 7318 68419.79
## 23: Scala 6873 88505.13
## 24: Rust 6819 88018.26
## 25: PowerShell 5208 78610.54
## 26: Other(s): 4504 73026.90
## 27: Dart 4081 53631.01
## 28: Perl 4043 86117.73
## 29: Groovy 3865 84529.27
## 30: Matlab 3531 57844.66
## 31: VB.NET 2727 60814.22
## 32: Clojure 2426 97417.12
## 33: Haskell 2234 75775.44
## 34: Elixir 1766 88175.37
## 35: F# 1731 89884.60
## 36: CoffeeScript 1640 76239.11
## 37: Lua 1438 71582.51
## 38: Visual Basic 6 1414 58467.32
## 39: Erlang 1349 85019.62
## 40: Delphi 989 60054.49
## 41: Delphi/Object Pascal 984 58305.77
## 42: Julia 846 75958.85
## 43: WebAssembly 508 85174.32
## 44: LISP 484 88985.20
## 45: Cobol 269 66765.23
## 46: Crystal 226 86323.80
## 47: Ocaml 207 79481.40
## 48: COBOL 206 65016.13
## 49: APL 136 99375.16
## 50: Hack 94 90458.07
## LanguageWorkedWith Popularity Salarymean
This is the summary table of language popularity and salary mean in Turkey.
colnames(joinedlanguagecountandsalaryturkey)[2]<-"Popularity"
colnames(joinedlanguagecountandsalaryturkey)[3]<-"Salarymean"
joinedlanguagecountandsalaryturkey
## LanguageWorkedWith Popularity Salarymean
## 1: JavaScript 1208 28905.71
## 2: SQL 947 29178.99
## 3: Java 776 28440.70
## 4: HTML/CSS 770 27128.99
## 5: C# 763 28964.94
## 6: Python 605 28569.65
## 7: PHP 477 27118.58
## 8: TypeScript 404 30157.50
## 9: Bash.Shell.PowerShell 400 33211.27
## 10: C++ 392 26902.76
## 11: C 354 28874.39
## 12: HTML 216 32096.10
## 13: CSS 202 32446.61
## 14: Node.js 187 37128.46
## 15: Swift 171 28023.57
## 16: Go 147 36519.37
## 17: Kotlin 136 28897.93
## 18: Objective-C 119 34511.23
## 19: Assembly 106 30348.26
## 20: Dart 85 25966.73
## 21: Ruby 80 34374.11
## 22: R 63 27843.52
## 23: Matlab 61 29383.56
## 24: VBA 59 29819.02
## 25: Scala 54 28292.89
## 26: PowerShell 38 39685.50
## 27: Other(s): 33 24224.09
## 28: Delphi/Object Pascal 29 37093.31
## 29: Rust 28 26838.75
## 30: VB.NET 27 33991.74
## 31: Delphi 26 30476.88
## 32: Perl 25 60944.24
## 33: Groovy 21 57199.24
## 34: Visual Basic 6 19 34337.26
## 35: CoffeeScript 11 48831.09
## 36: Elixir 9 27928.00
## 37: Erlang 8 27076.25
## 38: Haskell 7 36338.86
## 39: LISP 7 28532.00
## 40: Cobol 6 57510.00
## 41: F# 6 27086.00
## 42: Julia 6 29179.67
## 43: Clojure 5 59613.60
## 44: Lua 4 91491.00
## 45: WebAssembly 4 14649.00
## 46: COBOL 3 30025.00
## 47: APL 2 21822.00
## 48: Crystal 2 22686.00
## 49: Hack 1 23844.00
## LanguageWorkedWith Popularity Salarymean
And this is the model that shows the regression popularity and salary. This is the data that shows the regression between language popularity and salary. We expect that if the language is well known than the salary is lower.
<-lm(Popularity~ Salarymean, data=joinedlanguagecountandsalary)
model model
##
## Call:
## lm(formula = Popularity ~ Salarymean, data = joinedlanguagecountandsalary)
##
## Coefficients:
## (Intercept) Salarymean
## 69660.1710 -0.6958
plot(joinedlanguagecountandsalary$Salarymean, joinedlanguagecountandsalary$Popularity,col = "green", main="The Relation Between Salary and Popularity-World")
abline (model, col="blue")
<-c(1,4,5,19)
selectedwtext(joinedlanguagecountandsalary$Salarymean[selectedw], joinedlanguagecountandsalary$Popularity[selectedw], labels = joinedlanguagecountandsalary$LanguageWorkedWith[selectedw], cex = 0.6, pos = 4, col = "blue")
<-lm(Popularity~ Salarymean, data=joinedlanguagecountandsalaryturkey)
modelturkey modelturkey
##
## Call:
## lm(formula = Popularity ~ Salarymean, data = joinedlanguagecountandsalaryturkey)
##
## Coefficients:
## (Intercept) Salarymean
## 340.706673 -0.004613
plot(joinedlanguagecountandsalaryturkey$Salarymean, joinedlanguagecountandsalaryturkey$Popularity,col = "blue", main="The Relation Between Salary and Popularity-Turkey")
abline (modelturkey, col="red")
<-c(1,2,6,9,22)
selectedtext(joinedlanguagecountandsalaryturkey$Salarymean[selected], joinedlanguagecountandsalaryturkey$Popularity[selected], labels = joinedlanguagecountandsalaryturkey$LanguageWorkedWith[selected], cex = 0.6, pos = 4, col = "red")
Unlike our expectation, the relation between language popularity and salary is not strong. It shows that if one chooses to learn a programming language s/he also needs to evaluate the demand of language in the sector. It is acceptable that there is a relation but it is not strong.