The Esoph data set is related to the relationship between the consumption of alcohol and tobacco in cancer. It contains 88 rows and 5 columns which are grouped by age, alcohol consumption, tobacco consumption and number of cancer cases. Effects of people consuming alcohol and tobacco and the number of cancer cases as a result of will be analyzed. The Esoph data set includes 6 age groups, 4 alcohol and 4 tobacco consumption groups.
library(tidyverse)
library(lubridate)
library(rio)
library('MASS')
head(esoph)
## agegp alcgp tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day 0 40
## 2 25-34 0-39g/day 10-19 0 10
## 3 25-34 0-39g/day 20-29 0 6
## 4 25-34 0-39g/day 30+ 0 5
## 5 25-34 40-79 0-9g/day 0 27
## 6 25-34 40-79 10-19 0 7
summary(esoph)
## agegp alcgp tobgp ncases ncontrols
## 25-34:15 0-39g/day:23 0-9g/day:24 Min. : 0.000 Min. : 1.00
## 35-44:15 40-79 :23 10-19 :24 1st Qu.: 0.000 1st Qu.: 3.00
## 45-54:16 80-119 :21 20-29 :20 Median : 1.000 Median : 6.00
## 55-64:16 120+ :21 30+ :20 Mean : 2.273 Mean :11.08
## 65-74:15 3rd Qu.: 4.000 3rd Qu.:14.00
## 75+ :11 Max. :17.000 Max. :60.00
Firstly, each age group is analyzed to show the percentage of Cancer cases. According to the graph below, after age 55, the risk of cancer increases. The riskiest age group is between ages 65 and 74.
esoph %>%
group_by(agegp) %>% summarise(perc = 100*(round(sum(ncases)/sum(ncontrols),2))) %>%
ggplot(aes(x=agegp, y=perc ,fill=agegp)) +
geom_bar(stat="identity") +
labs(x= "Age Groups", y= "Percentage Of Cancer Cases", title = "The Rate of Cancer Cases according to Age Categories")+
scale_fill_manual(values = c("lightgreen", "green", "yellow", "orange", "red", "tomato4")) +
guides(fill=guide_legend(title="Age Groups"))
esoph %>%
group_by(agegp, alcgp) %>% summarise(perc = 100*(round(sum(ncases)/sum(ncontrols),2))) %>%
ggplot(aes(x=agegp, y=perc ,fill=alcgp)) +
geom_bar(stat="identity", position = "dodge") +
labs(x= "Age Groups", y= "Percentage Of Cancer Cases", title = "The Rate of Cancer Cases according to Age Groups and Alcohol Consumption") + scale_fill_manual(values = c("lightpink1", "rosybrown", "red2", "tomato4")) +
guides(fill=guide_legend(title="Alcohol Usage"))
When looking at alcohol consumption effects, people with higher alcohol consumption levels increase their chances of developing cancer.
esoph %>%
group_by(agegp, tobgp) %>% summarise(perc = 100*(round(sum(ncases)/sum(ncontrols),2))) %>%
ggplot(aes(x=agegp, y=perc ,fill=tobgp,)) +
geom_bar(stat="identity", position = "dodge") +
labs(x= "Age Groups", y= "Percentage Of Cancer Cases", title = "The Rate of Cancer Cases according to Age Groups and Tobacco Consumption") + scale_fill_manual(values = c("lightpink1", "rosybrown", "red2", "tomato4")) +
guides(fill=guide_legend(title="Tobacco Usage"))
Continuing with the tobacco consumption based graph, tobacco consumption levels affects the risk of cancer in negative manner. Ages between 55 and 64 are the highest effected group according to the graph above.
In a nutshell, it can be inferred that getting older, high levels of alcohol and tobacco consumption put people at a higher risk of developing cancer versus those who don’t consume.
Young People survey is a data set that contains 1010 rows and 139 columns. Each row is an individual answer that belongs to a person. This study will analyze a subset of the Young Survey data set and the subset contains variables related to main topics of Hobbies & Interests categories. Find the data set from Kaggle.
Data was imported from github. Visit my Github page by clicking here.
data <- rio::import("https://github.com/pjournal/mef04-baykano/blob/gh-pages/responses.csv?raw=True")
glimpse(data)
## Rows: 1,010
## Columns: 150
## $ Music <int> 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ `Slow songs or fast songs` <int> 3, 4, 5, 3, 3, 3, 5, 3, 3, 3, 3, 3...
## $ Dance <int> 2, 2, 2, 2, 4, 2, 5, 3, 3, 2, 3, 1...
## $ Folk <int> 1, 1, 2, 1, 3, 3, 3, 2, 1, 5, 2, 1...
## $ Country <int> 2, 1, 3, 1, 2, 2, 1, 1, 1, 2, 1, 1...
## $ `Classical music` <int> 2, 1, 4, 1, 4, 3, 2, 2, 2, 2, 2, 4...
## $ Musical <int> 1, 2, 5, 1, 3, 3, 2, 2, 4, 5, 3, 1...
## $ Pop <int> 5, 3, 3, 2, 5, 2, 5, 4, 3, 3, 4, 2...
## $ Rock <int> 5, 5, 5, 2, 3, 5, 3, 5, 5, 5, 3, 5...
## $ `Metal or Hardrock` <int> 1, 4, 3, 1, 1, 5, 1, 1, 5, 2, 2, 1...
## $ Punk <int> 1, 4, 4, 4, 2, 3, 1, 2, 1, 3, 1, 1...
## $ `Hiphop, Rap` <int> 1, 1, 1, 2, 5, 4, 3, 3, 1, 2, 3, 1...
## $ `Reggae, Ska` <int> 1, 3, 4, 2, 3, 3, 1, 2, 2, 4, 2, 1...
## $ `Swing, Jazz` <int> 1, 1, 3, 1, 2, 4, 1, 2, 2, 4, 2, 2...
## $ `Rock n roll` <int> 3, 4, 5, 2, 1, 4, 2, 3, 2, 4, 3, 2...
## $ Alternative <int> 1, 4, 5, 5, 2, 5, 3, 1, NA, 4, 3, ...
## $ Latino <int> 1, 2, 5, 1, 4, 3, 3, 2, 1, 5, 3, 2...
## $ `Techno, Trance` <int> 1, 1, 1, 2, 2, 1, 5, 3, 1, 1, 4, 1...
## $ Opera <int> 1, 1, 3, 1, 2, 3, 2, 2, 1, 2, 2, 2...
## $ Movies <int> 5, 5, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5...
## $ Horror <int> 4, 2, 3, 4, 4, 5, 2, 4, 1, 2, 5, 3...
## $ Thriller <int> 2, 2, 4, 4, 4, 5, 1, 4, 5, 1, 4, 4...
## $ Comedy <int> 5, 4, 4, 3, 5, 5, 5, 5, 5, 5, 5, 4...
## $ Romantic <int> 4, 3, 2, 3, 2, 2, 3, 2, 4, 5, 3, 3...
## $ `Sci-fi` <int> 4, 4, 4, 4, 3, 3, 1, 3, 4, 1, 3, 2...
## $ War <int> 1, 1, 2, 3, 3, 3, 3, 3, 5, 3, 2, 5...
## $ `Fantasy/Fairy tales` <int> 5, 3, 5, 1, 4, 4, 5, 4, 4, 4, 5, 5...
## $ Animated <int> 5, 5, 5, 2, 4, 3, 5, 4, 4, 4, 5, 5...
## $ Documentary <int> 3, 4, 2, 5, 3, 3, 3, 3, 5, 4, 3, 5...
## $ Western <int> 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1...
## $ Action <int> 2, 4, 1, 2, 4, 4, 2, 3, 1, 2, 3, 4...
## $ History <int> 1, 1, 1, 4, 3, 5, 3, 5, 3, 3, 3, 2...
## $ Psychology <int> 5, 3, 2, 4, 2, 3, 3, 2, 2, 2, 3, 2...
## $ Politics <int> 1, 4, 1, 5, 3, 4, 1, 3, 1, 3, 3, 5...
## $ Mathematics <int> 3, 5, 5, 4, 2, 2, 1, 1, 1, 3, 2, 1...
## $ Physics <int> 3, 2, 2, 1, 2, 3, 1, 1, 1, 1, 1, 1...
## $ Internet <int> 5, 4, 4, 3, 2, 4, 2, 5, 1, 5, 4, 5...
## $ PC <int> 3, 4, 2, 1, 2, 4, 1, 4, 1, 1, 5, 4...
## $ `Economy Management` <int> 5, 5, 4, 2, 2, 1, 3, 1, 1, 4, 3, 1...
## $ Biology <int> 3, 1, 1, 3, 3, 4, 5, 2, 3, 2, 2, 1...
## $ Chemistry <int> 3, 1, 1, 3, 3, 4, 5, 2, 1, 1, 1, 1...
## $ Reading <int> 3, 4, 5, 5, 5, 3, 3, 2, 5, 4, 3, 3...
## $ Geography <int> 3, 4, 2, 4, 2, 3, 3, 3, 1, 4, 3, 5...
## $ `Foreign languages` <int> 5, 5, 5, 4, 3, 4, 4, 4, 1, 5, 5, 2...
## $ Medicine <int> 3, 1, 2, 2, 3, 4, 5, 1, 1, 1, 2, 1...
## $ Law <int> 1, 2, 3, 5, 2, 3, 3, 2, 1, 1, 4, 3...
## $ Cars <int> 1, 2, 1, 1, 3, 5, 4, 1, 1, 1, 2, 1...
## $ `Art exhibitions` <int> 1, 2, 5, 5, 1, 2, 1, 1, 1, 4, 2, 5...
## $ Religion <int> 1, 1, 5, 4, 4, 2, 1, 2, 2, 4, 2, 1...
## $ `Countryside, outdoors` <int> 5, 1, 5, 1, 4, 5, 4, 2, 4, 4, 4, 5...
## $ Dancing <int> 3, 1, 5, 1, 1, 1, 3, 1, 1, 5, 1, 1...
## $ `Musical instruments` <int> 3, 1, 5, 1, 3, 5, 2, 1, 2, 3, 1, 1...
## $ Writing <int> 2, 1, 5, 3, 1, 1, 1, 1, 1, 1, 1, 1...
## $ `Passive sport` <int> 1, 1, 5, 1, 3, 5, 5, 4, 4, 4, 5, 5...
## $ `Active sport` <int> 5, 1, 2, 1, 1, 4, 3, 5, 1, 4, 1, 3...
## $ Gardening <int> 5, 1, 1, 1, 4, 2, 3, 1, 1, 1, 3, 1...
## $ Celebrities <int> 1, 2, 1, 2, 3, 1, 1, 3, 5, 2, 2, 2...
## $ Shopping <int> 4, 3, 4, 4, 3, 2, 3, 3, 2, 4, 5, 3...
## $ `Science and technology` <int> 4, 3, 2, 3, 3, 3, 4, 2, 1, 3, 4, 3...
## $ Theatre <int> 2, 2, 5, 1, 2, 1, 3, 2, 5, 5, 2, 1...
## $ `Fun with friends` <int> 5, 4, 5, 2, 4, 3, 5, 4, 4, 5, 4, 3...
## $ `Adrenaline sports` <int> 4, 2, 5, 1, 2, 3, 1, 2, 1, 2, 1, 1...
## $ Pets <int> 4, 5, 5, 1, 1, 2, 5, 5, 1, 2, 5, 1...
## $ Flying <int> 1, 1, 1, 2, 1, 3, 1, 3, 2, 4, 1, 4...
## $ Storm <int> 1, 1, 1, 1, 2, 2, 3, 2, 3, 5, 1, 1...
## $ Darkness <int> 1, 1, 1, 1, 1, 2, 2, 4, 1, 4, 2, 1...
## $ Heights <int> 1, 2, 1, 3, 1, 2, 1, 3, 5, 5, 2, 3...
## $ Spiders <int> 1, 1, 1, 5, 1, 1, 1, 1, 5, 3, 2, 5...
## $ Snakes <int> 5, 1, 1, 5, 1, 2, 5, 5, 5, 4, 1, 5...
## $ Rats <int> 3, 1, 1, 5, 2, 2, 1, 3, 2, 4, 1, 5...
## $ Ageing <int> 1, 3, 1, 4, 2, 1, 4, 1, 2, 3, 1, 5...
## $ `Dangerous dogs` <int> 3, 1, 1, 5, 4, 1, 1, 2, 3, 5, 4, 5...
## $ `Fear of public speaking` <int> 2, 4, 2, 5, 3, 3, 1, 4, 4, 3, 2, 5...
## $ Smoking <chr> "never smoked", "never smoked", "t...
## $ Alcohol <chr> "drink a lot", "drink a lot", "dri...
## $ `Healthy eating` <int> 4, 3, 3, 3, 4, 2, 4, 2, 1, 3, 3, 3...
## $ `Daily events` <int> 2, 3, 1, 4, 3, 2, 3, 3, 1, 4, 3, 3...
## $ `Prioritising workload` <int> 2, 2, 2, 4, 1, 2, 5, 1, 2, 2, 2, 1...
## $ `Writing notes` <int> 5, 4, 5, 4, 2, 3, 5, 3, 1, 2, 4, 5...
## $ Workaholism <int> 4, 5, 3, 5, 3, 3, 5, 2, 4, 3, 2, 3...
## $ `Thinking ahead` <int> 2, 4, 5, 3, 5, 3, 3, 4, 2, 3, 3, 1...
## $ `Final judgement` <int> 5, 1, 3, 1, 5, 1, 3, 3, 5, 5, 3, 1...
## $ Reliability <int> 4, 4, 4, 3, 5, 3, 4, 3, 5, 4, 4, 3...
## $ `Keeping promises` <int> 4, 4, 5, 4, 4, 4, 5, 3, 4, 5, 4, 3...
## $ `Loss of interest` <int> 1, 3, 1, 5, 2, 3, 3, 1, 1, 3, 1, 3...
## $ `Friends versus money` <int> 3, 4, 5, 2, 3, 2, 4, 4, 4, 4, 3, 3...
## $ Funniness <int> 5, 3, 2, 1, 3, 3, 4, 4, 2, 3, 2, 5...
## $ Fake <int> 1, 2, 4, 1, 2, 1, 1, 2, 2, 1, 1, 3...
## $ `Criminal damage` <int> 1, 1, 1, 5, 1, 4, 2, 1, 1, 2, 1, 5...
## $ `Decision making` <int> 3, 2, 3, 5, 3, 2, 2, 3, 4, 5, 5, 3...
## $ Elections <int> 4, 5, 5, 5, 5, 5, 5, 5, 1, 5, 5, 1...
## $ `Self-criticism` <int> 1, 4, 4, 5, 5, 4, 3, 3, 3, 4, 4, 5...
## $ `Judgment calls` <int> 3, 4, 4, 4, 5, 4, 5, 5, 2, 5, 5, 3...
## $ Hypochondria <int> 1, 1, 1, 3, 1, 1, 1, 2, 2, 1, 2, 5...
## $ Empathy <int> 3, 2, 5, 3, 3, 4, 4, 1, 5, 4, 5, 5...
## $ `Eating to survive` <int> 1, 1, 5, 1, 1, 2, 1, 2, 1, 1, 2, 1...
## $ Giving <int> 4, 2, 5, 1, 3, 3, 5, 3, 1, 4, 3, 1...
## $ `Compassion to animals` <int> 5, 4, 4, 2, 3, 5, 5, 5, 4, 5, 5, 2...
## $ `Borrowed stuff` <int> 4, 3, 2, 5, 4, 5, 5, 2, 5, 4, 4, 2...
## $ Loneliness <int> 3, 2, 5, 5, 3, 2, 3, 2, 4, 2, 2, 4...
## $ `Cheating in school` <int> 2, 4, 3, 5, 5, 4, 2, 5, 5, 3, 3, 5...
## $ Health <int> 1, 4, 2, 1, 3, 3, 3, 3, 4, 4, 3, 2...
## $ `Changing the past` <int> 1, 4, 5, 5, 4, 3, 1, 2, 5, 2, 3, 3...
## $ God <int> 1, 1, 5, 4, 5, 3, 5, 4, 5, 5, 4, 1...
## $ Dreams <int> 4, 3, 1, 3, 3, 3, 3, 4, 4, 3, 3, 3...
## $ Charity <int> 2, 1, 3, 3, 3, 2, 3, 1, 1, 2, 1, 3...
## $ `Number of friends` <int> 3, 3, 3, 1, 3, 3, 3, 4, 2, 3, 3, 4...
## $ Punctuality <chr> "i am always on time", "i am often...
## $ Lying <chr> "never", "sometimes", "sometimes",...
## $ Waiting <int> 3, 3, 2, 1, 3, 3, 4, 1, 2, 1, 3, 3...
## $ `New environment` <int> 4, 4, 3, 1, 4, 4, 5, 4, 2, 4, 3, 5...
## $ `Mood swings` <int> 3, 4, 4, 5, 2, 3, 5, 3, 3, 4, 3, 5...
## $ `Appearence and gestures` <int> 4, 4, 3, 3, 3, 3, 4, 4, 4, 3, 4, 2...
## $ Socializing <int> 3, 4, 5, 1, 3, 4, 5, 2, 4, 4, 3, 5...
## $ Achievements <int> 4, 2, 3, 3, 3, 2, 4, 4, 2, 4, 3, 3...
## $ `Responding to a serious letter` <int> 3, 4, 4, 3, 3, 2, 3, 3, 2, 4, 4, 3...
## $ Children <int> 5, 2, 4, 2, 5, 3, 2, 4, 4, 3, 5, 5...
## $ Assertiveness <int> 1, 2, 3, 5, 4, 4, 3, 3, 1, 4, 2, 4...
## $ `Getting angry` <int> 1, 5, 4, 5, 2, 3, 3, 1, 3, 3, 1, 3...
## $ `Knowing the right people` <int> 3, 4, 3, 4, 3, 4, 4, 4, 3, 4, 3, 5...
## $ `Public speaking` <int> 5, 4, 2, 5, 5, 4, 3, 5, 4, 5, 3, 5...
## $ Unpopularity <int> 5, 4, 4, 3, 5, 4, 3, 2, 5, 3, 3, 2...
## $ `Life struggles` <int> 1, 1, 4, 3, 2, 3, 5, 2, 4, 5, 5, 4...
## $ `Happiness in life` <int> 4, 4, 4, 2, 3, 3, 5, 4, 3, 4, 4, 3...
## $ `Energy levels` <int> 5, 3, 4, 2, 5, 4, 4, 4, 1, 4, 3, 3...
## $ `Small - big dogs` <int> 1, 5, 3, 1, 3, 4, 3, 3, 5, 1, 2, 1...
## $ Personality <int> 4, 3, 3, 2, 3, 3, 3, 4, 3, 3, 3, 3...
## $ `Finding lost valuables` <int> 3, 4, 3, 1, 2, 3, 2, 2, 5, 3, 2, 3...
## $ `Getting up` <int> 2, 5, 4, 1, 4, 3, 2, 5, 5, 4, 4, 5...
## $ `Interests or hobbies` <int> 3, 3, 5, NA, 3, 5, 4, 4, 1, 3, 3, ...
## $ `Parents' advice` <int> 4, 2, 3, 2, 3, 3, 4, 3, 4, 3, 4, 4...
## $ `Questionnaires or polls` <int> 3, 3, 1, 4, 3, 4, 5, 3, 3, 3, 4, 4...
## $ `Internet usage` <chr> "few hours a day", "few hours a da...
## $ Finances <int> 3, 3, 2, 2, 4, 2, 4, 3, 2, 4, 2, 2...
## $ `Shopping centres` <int> 4, 4, 4, 4, 3, 3, 3, 4, 1, 4, 4, 2...
## $ `Branded clothing` <int> 5, 1, 1, 3, 4, 3, 1, 4, 3, 4, 2, 1...
## $ `Entertainment spending` <int> 3, 4, 4, 3, 3, 3, 3, 4, 2, 2, 3, 3...
## $ `Spending on looks` <int> 3, 2, 3, 4, 3, 1, 4, 4, 1, 3, 4, 1...
## $ `Spending on gadgets` <int> 1, 5, 4, 4, 2, 4, 1, 3, 3, 2, 2, 1...
## $ `Spending on healthy eating` <int> 3, 2, 2, 1, 4, 4, 5, 2, 4, 4, 2, 2...
## $ Age <int> 20, 19, 20, 22, 20, 20, 20, 19, 18...
## $ Height <int> 163, 163, 176, 172, 170, 186, 177,...
## $ Weight <int> 48, 58, 67, 59, 59, 77, 50, 90, 55...
## $ `Number of siblings` <int> 1, 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1...
## $ Gender <chr> "female", "female", "female", "fem...
## $ `Left - right handed` <chr> "right handed", "right handed", "r...
## $ Education <chr> "college/bachelor degree", "colleg...
## $ `Only child` <chr> "no", "no", "no", "yes", "no", "no...
## $ `Village - town` <chr> "village", "city", "city", "city",...
## $ `House - block of flats` <chr> "block of flats", "block of flats"...
The NA’s were cleaned and the new data set created as subset data frame includes main titles of Young People Survey’s Hobbies & Interests categories. The entire data is not east to read, thus only subset summary was showed here.
young_survey <- na.omit(data)
s_df = as.data.frame(young_survey)
s_df = s_df[32:63]
summary(s_df)
## History Psychology Politics Mathematics Physics
## Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :3.00 Median :3.000 Median :2.000 Median :2.000
## Mean :3.226 Mean :3.14 Mean :2.627 Mean :2.401 Mean :2.096
## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.000
## Internet PC Economy Management Biology
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :4.000 Median :3.000 Median :2.000 Median :2.000
## Mean :4.188 Mean :3.136 Mean :2.662 Mean :2.621
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Chemistry Reading Geography Foreign languages
## Min. :1.000 Min. :1.0 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.0 1st Qu.:2.000 1st Qu.:3.000
## Median :2.000 Median :3.0 Median :3.000 Median :4.000
## Mean :2.121 Mean :3.2 Mean :3.109 Mean :3.813
## 3rd Qu.:3.000 3rd Qu.:5.0 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.0 Max. :5.000 Max. :5.000
## Medicine Law Cars Art exhibitions
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :2.000 Median :2.000 Median :2.000
## Mean :2.475 Mean :2.224 Mean :2.634 Mean :2.617
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Religion Countryside, outdoors Dancing Musical instruments
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :4.000 Median :2.000 Median :2.000
## Mean :2.229 Mean :3.614 Mean :2.399 Mean :2.302
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Writing Passive sport Active sport Gardening
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000
## Median :1.000 Median :4.000 Median :3.000 Median :1.000
## Mean :1.866 Mean :3.394 Mean :3.236 Mean :1.872
## 3rd Qu.:2.000 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:2.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Celebrities Shopping Science and technology Theatre
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :2.000 Median :3.000 Median :3.000 Median :3.000
## Mean :2.319 Mean :3.257 Mean :3.271 Mean :3.023
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Fun with friends Adrenaline sports Pets
## Min. :2.000 Min. :1.00 Min. :1.000
## 1st Qu.:4.000 1st Qu.:2.00 1st Qu.:2.000
## Median :5.000 Median :3.00 Median :4.000
## Mean :4.552 Mean :2.88 Mean :3.324
## 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:5.000
## Max. :5.000 Max. :5.00 Max. :5.000
A correlation matrix was provided for the first ten columns due to readability.
cor(s_df[1:10])
## History Psychology Politics Mathematics Physics
## History 1.000000000 0.277589253 0.40042159 0.006775058 0.07072863
## Psychology 0.277589253 1.000000000 0.18478753 0.033418822 0.06437929
## Politics 0.400421587 0.184787526 1.00000000 0.094843697 0.10597994
## Mathematics 0.006775058 0.033418822 0.09484370 1.000000000 0.61445068
## Physics 0.070728629 0.064379294 0.10597994 0.614450677 1.00000000
## Internet -0.000188671 -0.001358391 0.07408342 0.164590446 0.09649070
## PC 0.027415190 -0.073081652 0.12169054 0.328369470 0.33765152
## Economy Management 0.042573266 0.090542483 0.30657473 0.234618951 0.02414031
## Biology 0.014251442 0.184610980 -0.08653935 0.089351976 0.23540484
## Chemistry 0.020638132 0.043194424 -0.06991511 0.182141475 0.31794058
## Internet PC Economy Management Biology
## History -0.000188671 0.02741519 0.04257327 0.01425144
## Psychology -0.001358391 -0.07308165 0.09054248 0.18461098
## Politics 0.074083423 0.12169054 0.30657473 -0.08653935
## Mathematics 0.164590446 0.32836947 0.23461895 0.08935198
## Physics 0.096490702 0.33765152 0.02414031 0.23540484
## Internet 1.000000000 0.45741969 0.15937134 -0.11974364
## PC 0.457419695 1.00000000 0.16911260 -0.10609323
## Economy Management 0.159371341 0.16911260 1.00000000 -0.17240504
## Biology -0.119743644 -0.10609323 -0.17240504 1.00000000
## Chemistry -0.102591240 -0.06726759 -0.18648820 0.67845540
## Chemistry
## History 0.02063813
## Psychology 0.04319442
## Politics -0.06991511
## Mathematics 0.18214148
## Physics 0.31794058
## Internet -0.10259124
## PC -0.06726759
## Economy Management -0.18648820
## Biology 0.67845540
## Chemistry 1.00000000
Importance of components and their loadings tables were created to understand which variable has an higher effect to identify the variance. First 3 principal components are responsible for 31% of the variance and first 16 PC explain 77% of variance.
pca <- princomp(as.matrix(s_df[1:32]),cor=T)
summary(pca,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.0374787 1.8233387 1.60327816 1.48530949 1.26023134
## Proportion of Variance 0.1297287 0.1038926 0.08032815 0.06894201 0.04963072
## Cumulative Proportion 0.1297287 0.2336214 0.31394951 0.38289152 0.43252224
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 1.20132035 1.07959740 1.06930606 1.05342776 0.98848494
## Proportion of Variance 0.04509908 0.03642283 0.03573173 0.03467844 0.03053445
## Cumulative Proportion 0.47762132 0.51404415 0.54977588 0.58445432 0.61498878
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Standard deviation 0.96843882 0.94342106 0.92688407 0.89350216 0.87546384
## Proportion of Variance 0.02930855 0.02781385 0.02684731 0.02494832 0.02395115
## Cumulative Proportion 0.64429733 0.67211118 0.69895850 0.72390681 0.74785797
## Comp.16 Comp.17 Comp.18 Comp.19 Comp.20
## Standard deviation 0.85981220 0.8379842 0.81377184 0.7619490 0.74936288
## Proportion of Variance 0.02310241 0.0219443 0.02069452 0.0181427 0.01754827
## Cumulative Proportion 0.77096037 0.7929047 0.81359919 0.8317419 0.84929016
## Comp.21 Comp.22 Comp.23 Comp.24 Comp.25
## Standard deviation 0.73524390 0.72622112 0.70400198 0.70113254 0.67137253
## Proportion of Variance 0.01689324 0.01648116 0.01548809 0.01536209 0.01408566
## Cumulative Proportion 0.86618340 0.88266456 0.89815265 0.91351474 0.92760039
## Comp.26 Comp.27 Comp.28 Comp.29 Comp.30
## Standard deviation 0.64706583 0.62748835 0.58745885 0.58160258 0.553117299
## Proportion of Variance 0.01308419 0.01230443 0.01078462 0.01057067 0.009560586
## Cumulative Proportion 0.94068459 0.95298901 0.96377363 0.97434431 0.983904895
## Comp.31 Comp.32
## Standard deviation 0.53278768 0.480812500
## Proportion of Variance 0.00887071 0.007224396
## Cumulative Proportion 0.99277560 1.000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## History 0.184 0.117 0.196 0.225 0.146 0.222 0.252
## Psychology 0.239 0.108 0.168 -0.127 0.201
## Politics 0.220 0.289 0.143 0.280 0.137 0.111
## Mathematics 0.317 -0.119 0.101 -0.292 -0.287
## Physics 0.355 -0.231 0.104 -0.135 -0.188 0.127
## Internet 0.210 0.120 -0.435 0.213 -0.228
## PC 0.363 -0.176 -0.287
## Economy Management 0.166 0.310 0.206 -0.196 -0.255
## Biology 0.273 -0.371 -0.159 0.211
## Chemistry 0.192 -0.414 -0.108 0.240
## Reading 0.279 -0.170 0.195 -0.124
## Geography 0.140 0.160 0.152 0.154 0.337 -0.358
## Foreign languages 0.204 0.199 -0.145 -0.108 -0.470
## Medicine 0.258 -0.316 -0.132 0.280 -0.122
## Law 0.123 0.143 0.283 0.395 0.178
## Cars 0.333 -0.199
## Art exhibitions 0.324 -0.200
## Religion 0.222 0.176
## Countryside, outdoors 0.203 -0.369 0.165 -0.143
## Dancing 0.254 -0.216 -0.237
## Musical instruments 0.213 -0.315
## Writing 0.230 0.186 -0.147 -0.107 0.319
## Passive sport 0.155 -0.215 -0.119 0.245 -0.167
## Active sport 0.178 -0.258 -0.144 0.364 -0.115 0.203
## Gardening 0.191 -0.144 -0.160 0.333 0.334
## Celebrities -0.104 0.167 -0.356 0.119 -0.323 0.136 0.119
## Shopping -0.168 0.144 -0.410 -0.228
## Science and technology 0.364
## Theatre 0.316 -0.112 -0.126 -0.162
## Fun with friends 0.129 -0.270 -0.102 -0.308 -0.266
## Adrenaline sports 0.225 -0.230 -0.181 0.322 -0.195
## Pets -0.249 0.274 0.205
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## History 0.170 0.224 0.163
## Psychology 0.130 -0.272 0.123 0.447 -0.226 0.218 -0.260
## Politics -0.202 0.132 0.158
## Mathematics -0.149 0.332 -0.216
## Physics 0.221 -0.114
## Internet 0.107 -0.241 0.184 0.266 0.195
## PC 0.103 -0.172 0.119 0.121 0.133
## Economy Management -0.263 0.176 -0.192
## Biology
## Chemistry -0.113 0.129
## Reading 0.167 0.184 0.143
## Geography -0.177 0.205 -0.346 -0.196
## Foreign languages 0.171 -0.170 -0.407 0.101
## Medicine -0.231
## Law -0.158 0.140 -0.178
## Cars 0.320 -0.163 -0.177
## Art exhibitions 0.141 -0.172 0.105 -0.308
## Religion -0.358 -0.135 0.362 0.352 0.162
## Countryside, outdoors -0.298 0.275 0.197 0.279 -0.124
## Dancing -0.311 0.147 0.263
## Musical instruments -0.308 -0.333 -0.513 0.210
## Writing 0.164 -0.290 -0.228 -0.331
## Passive sport -0.647 0.105 -0.351 0.111 -0.306
## Active sport -0.137 0.215 -0.128 -0.118 0.101
## Gardening -0.171 0.195 0.161 0.322
## Celebrities -0.255 0.108 -0.103
## Shopping -0.103 0.157 -0.169
## Science and technology 0.281 0.206 0.199 -0.101
## Theatre 0.219 0.203 -0.131 -0.116 0.212 -0.268
## Fun with friends 0.336 0.104 -0.247 0.133 0.272 0.392
## Adrenaline sports 0.118 0.201 -0.228
## Pets 0.319 0.354 0.102 0.419 -0.167 -0.375
## Comp.16 Comp.17 Comp.18 Comp.19 Comp.20 Comp.21 Comp.22
## History 0.289 0.426 0.149
## Psychology 0.167 0.280 0.217 -0.109 -0.107
## Politics -0.124 0.156 0.157 0.191
## Mathematics 0.236 -0.135 -0.186 0.106
## Physics 0.171 -0.151 -0.240
## Internet 0.168 0.118 -0.285 -0.116
## PC 0.208 -0.119 0.278
## Economy Management -0.303 0.133 -0.202 -0.229 0.346
## Biology -0.144 0.155
## Chemistry
## Reading 0.219 -0.348 0.110 -0.216 0.332
## Geography 0.138 -0.326 0.104 -0.306 -0.131
## Foreign languages -0.163 0.150 -0.265 0.280 -0.178
## Medicine -0.174 -0.103
## Law 0.119 -0.344 -0.287
## Cars 0.144 -0.155 0.115 0.115 -0.192 -0.436
## Art exhibitions 0.120 -0.113 -0.442
## Religion -0.354 -0.505 -0.230
## Countryside, outdoors 0.357 0.212 0.298
## Dancing 0.244 0.183 0.245 -0.349 -0.254
## Musical instruments 0.235 -0.141 0.175 0.119 0.147 0.124
## Writing -0.199 -0.217 -0.236 0.139 -0.189
## Passive sport 0.130 -0.125 -0.146 0.107 0.218
## Active sport 0.228 0.308 -0.335 -0.110 0.227
## Gardening -0.466 0.128 0.237 0.113 -0.146 -0.116
## Celebrities 0.216 -0.232 -0.102 0.322 0.184 0.265
## Shopping -0.142 -0.127 0.327 0.180 -0.135
## Science and technology -0.266 0.108 -0.358 0.172 0.285
## Theatre 0.155 -0.148 -0.157 0.115
## Fun with friends -0.158 -0.286 -0.131
## Adrenaline sports -0.162 -0.263 0.261 -0.316 0.175
## Pets 0.195 -0.167 -0.272 -0.120 -0.168
## Comp.23 Comp.24 Comp.25 Comp.26 Comp.27 Comp.28 Comp.29
## History 0.227 0.281 0.236 0.246 0.108 0.131
## Psychology -0.241 -0.139 -0.145 0.113 -0.165 -0.104
## Politics -0.250 -0.302 0.158 -0.214 -0.460 0.165 -0.187
## Mathematics 0.169
## Physics -0.141 0.265 -0.183
## Internet 0.177 0.185 -0.253 0.219 0.272
## PC -0.166 0.321 -0.145 -0.569
## Economy Management 0.365 0.201
## Biology -0.128
## Chemistry 0.101 0.159 0.251 -0.102 -0.145 -0.126
## Reading -0.162 0.209 0.278 -0.234 -0.329 -0.160
## Geography -0.116 -0.157 0.201 -0.148 -0.150
## Foreign languages -0.185 -0.243 0.133 0.226
## Medicine 0.107 -0.121 0.218
## Law 0.376 -0.267 0.296 -0.173
## Cars -0.277 0.430 0.104 -0.211
## Art exhibitions 0.206 -0.121 0.154 -0.405 -0.223 -0.253 0.187
## Religion
## Countryside, outdoors 0.142 0.182 -0.282 -0.185 -0.131
## Dancing -0.304 0.244 -0.171 0.172 -0.162
## Musical instruments 0.348 -0.116
## Writing 0.362 -0.133 -0.250 0.210
## Passive sport 0.136
## Active sport -0.159 -0.338 -0.212 0.183
## Gardening -0.224 0.110 0.181 0.109
## Celebrities -0.338 -0.159 0.186 0.207
## Shopping 0.406 -0.142 0.186 -0.214 -0.344
## Science and technology -0.164 -0.191 -0.112 -0.307 0.399
## Theatre -0.142 0.308 0.223 0.459
## Fun with friends -0.331 -0.125
## Adrenaline sports 0.124 0.483 0.128
## Pets
## Comp.30 Comp.31 Comp.32
## History 0.177
## Psychology -0.165
## Politics 0.106
## Mathematics 0.592
## Physics 0.169 -0.609
## Internet
## PC 0.115 -0.115
## Economy Management -0.247
## Biology 0.777
## Chemistry -0.627 -0.108 -0.295
## Reading 0.167 -0.134
## Geography -0.102 -0.107
## Foreign languages
## Medicine 0.508 0.136 -0.475
## Law -0.152
## Cars
## Art exhibitions -0.132
## Religion -0.127
## Countryside, outdoors
## Dancing 0.163
## Musical instruments
## Writing 0.133
## Passive sport
## Active sport -0.120
## Gardening
## Celebrities -0.116
## Shopping 0.160 0.118
## Science and technology
## Theatre -0.250 0.192
## Fun with friends
## Adrenaline sports
## Pets
These first 16 PC that explain 77% of variance used as input to PCA matrix.
pca <- princomp(as.matrix(s_df[1:16]),cor=T)
summary(pca,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.6552342 1.6248718 1.4967461 1.08102708 1.03100086
## Proportion of Variance 0.1712375 0.1650130 0.1400156 0.07303872 0.06643517
## Cumulative Proportion 0.1712375 0.3362505 0.4762661 0.54930482 0.61574000
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 1.01792860 0.95720869 0.83035903 0.81731697 0.73044156
## Proportion of Variance 0.06476116 0.05726553 0.04309351 0.04175044 0.03334655
## Cumulative Proportion 0.68050116 0.73776669 0.78086020 0.82261064 0.85595719
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Standard deviation 0.71950819 0.67698546 0.66000951 0.57274280 0.56040481
## Proportion of Variance 0.03235575 0.02864433 0.02722578 0.02050214 0.01962835
## Cumulative Proportion 0.88831294 0.91695728 0.94418306 0.96468520 0.98431355
## Comp.16
## Standard deviation 0.50098221
## Proportion of Variance 0.01568645
## Cumulative Proportion 1.00000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## History 0.193 0.207 0.311 0.266 0.416 0.247
## Psychology 0.249 0.239 -0.145 -0.316 0.571 -0.357
## Politics 0.125 0.366 0.264 0.280
## Mathematics 0.167 0.290 -0.295 -0.280 -0.325 0.286 -0.258
## Physics 0.258 0.242 -0.356 -0.128 0.379
## Internet 0.297 -0.138 -0.359 0.258 -0.346 0.374 -0.129
## PC 0.376 -0.283 -0.224 0.167 0.238 0.164
## Economy Management 0.352 0.119 -0.436 -0.322 -0.294 -0.264
## Biology 0.480 -0.176 -0.159 -0.189 -0.127
## Chemistry 0.442 -0.145 -0.240
## Reading 0.226 -0.119 0.338 -0.389 0.181 0.466
## Geography 0.157 0.207 0.173 0.573 -0.417 -0.472
## Foreign languages 0.170 0.305 -0.427 -0.382 -0.272 0.280
## Medicine 0.475 -0.127 -0.114 0.141 -0.294
## Law 0.155 0.292 0.263 0.386 -0.206 -0.163 0.246
## Cars 0.327 -0.219 0.335 0.206 -0.160 0.393
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## History 0.483 0.259 0.383 0.223
## Psychology -0.457 -0.134 -0.217 -0.113
## Politics 0.279 -0.546 -0.281 -0.471
## Mathematics 0.164 -0.123 0.261 -0.585
## Physics -0.101 -0.207 -0.136 0.698
## Internet 0.401 -0.243 0.183 0.235 -0.301 0.114
## PC 0.290 -0.238 -0.376 0.535 -0.106 -0.156
## Economy Management 0.395 0.394 -0.174 0.248
## Biology 0.107 -0.106
## Chemistry 0.216 0.249 0.171 -0.175 -0.644 -0.131
## Reading 0.173 -0.154 0.340 -0.483 -0.101
## Geography -0.132 -0.256 -0.147 0.130 -0.142
## Foreign languages -0.321 0.123 -0.373 0.355
## Medicine -0.135 0.595
## Law 0.225 -0.456 0.235 0.443 -0.132
## Cars -0.535 0.297 -0.112 -0.330
## Comp.16
## History
## Psychology
## Politics
## Mathematics
## Physics
## Internet
## PC
## Economy Management
## Biology 0.782
## Chemistry -0.320
## Reading
## Geography
## Foreign languages
## Medicine -0.497
## Law
## Cars
According to graph below, with the first 8 PC, 78% of variance is explained. As a result, first 3 PC explain 48% and it is almost half of the variance and 78% of variance consist of first 8 PC.
ggplot(data.frame(pc=1:16,cum_var=c(0.1712375, 0.3362505, 0.4762661, 0.54930482, 0.61574000, 0.68050116, 0.73776669, 0.78086020, 0.82261064, 0.85595719, 0.88831294, 0.91695728, 0.94418306, 0.96468520, 0.98431355, 1.00000000)),aes(x=pc,y=cum_var)) +
geom_point() +
geom_line()
Multidimensional Scaling was applied to the same subset of the Young Survey Data set to investigate the relationship of each hobbies and interests.
From History to Pets, total of 32 categories were selected, and calculated the distance values by subtracting the correlation values from 1. Then each categories was located on a scale that has x and y coordinates and its origin (0,0).
s_df_mds <- s_df[,sapply(s_df,class)=="integer"] %>%
dplyr::select(History:Pets)
set.seed(42)
s_df_mds_dist <- 1 - cor(s_df_mds)
s_df_mds <- cmdscale(s_df_mds_dist,k=2)
colnames(s_df_mds) <- c("x","y")
print(s_df_mds)
## x y
## History 0.031306226 -0.057585867
## Psychology 0.221673919 -0.034255291
## Politics -0.210706395 0.084207670
## Mathematics -0.366106787 -0.301011922
## Physics -0.338602615 -0.488946202
## Internet -0.435984609 0.193575251
## PC -0.602886441 -0.104205122
## Economy Management -0.313464728 0.369369369
## Biology 0.249496019 -0.352229723
## Chemistry 0.111150203 -0.427188341
## Reading 0.525002226 -0.036403993
## Geography -0.082496760 -0.010735122
## Foreign languages 0.197742370 0.164499851
## Medicine 0.190107055 -0.319538506
## Law -0.091720302 0.234211611
## Cars -0.593017958 0.091655704
## Art exhibitions 0.380132671 -0.025937391
## Religion 0.177085206 -0.229380678
## Countryside, outdoors 0.159766147 -0.101417980
## Dancing 0.326087243 0.151882648
## Musical instruments 0.165516710 -0.156630640
## Writing 0.291051922 -0.065898705
## Passive sport -0.288889494 0.153049302
## Active sport -0.180541543 0.077474365
## Gardening 0.203069385 -0.051532921
## Celebrities 0.059109759 0.515094714
## Shopping 0.248971459 0.505415685
## Science and technology -0.368579088 -0.307211932
## Theatre 0.452232345 0.006680821
## Fun with friends 0.002109219 0.282818197
## Adrenaline sports -0.251016420 0.072054588
## Pets 0.132403055 0.168120560
ggplot(data.frame(s_df_mds),aes(x=x,y=y)) +
geom_text(label=rownames(s_df_mds),angle= 0, size=3) +
labs(x="x",y="y", title="Multidimensional Scale of Hobbies and Interests")
According to the graph above, variances related to Sports is a group but Politics is also in this group. It can be stated that people tend to enjoy both Sports and Politics according to people who answered the survey. It may be an interesting result. Chemistry, Biology and Medicine is another group and it may result from people’s profession. Science and Technology and Mathematics have a strong relationship. On the other hand, Psychology and Gardening are two very close interests and its reasons can be investigated.
K Means clustering method was applied to output of MDS and the variables were clustered as 8 different centers.
set.seed(42)
categories_cluster<-kmeans(s_df_mds,centers=8)
mds_clusters<-data.frame(categories=names(categories_cluster$cluster),cluster_mds=categories_cluster$cluster) %>% arrange(cluster_mds,categories)
mds_clusters
## categories cluster_mds
## Art exhibitions Art exhibitions 1
## Dancing Dancing 1
## Reading Reading 1
## Theatre Theatre 1
## Mathematics Mathematics 2
## Physics Physics 2
## Science and technology Science and technology 2
## Active sport Active sport 3
## Adrenaline sports Adrenaline sports 3
## Economy Management Economy Management 3
## Geography Geography 3
## Law Law 3
## Passive sport Passive sport 3
## Politics Politics 3
## Foreign languages Foreign languages 4
## Fun with friends Fun with friends 4
## Pets Pets 4
## Biology Biology 5
## Chemistry Chemistry 5
## Medicine Medicine 5
## Religion Religion 5
## Cars Cars 6
## Internet Internet 6
## PC PC 6
## Countryside, outdoors Countryside, outdoors 7
## Gardening Gardening 7
## History History 7
## Musical instruments Musical instruments 7
## Psychology Psychology 7
## Writing Writing 7
## Celebrities Celebrities 8
## Shopping Shopping 8
#Plot the output
ggplot(data.frame(s_df_mds) %>% mutate(clusters=as.factor(categories_cluster$cluster),category=rownames(s_df_mds)),aes(x=x,y=y)) + geom_text(aes(label=category,color=clusters),angle=45,size=3) + geom_point(data=as.data.frame(categories_cluster$centers),aes(x=x,y=y)
)
The graph below is a cluster dendrogram created from the Young Survey distance data set. Different types of cluster dendrogram can be provided by changing the method. The first graph method is the complete, the second one is the average. The closest and largest dissimilarity and, location of nodes are adjustment points for each method.
s_hc<-hclust(as.dist(s_df_mds_dist),method="complete")
plot(s_hc,hang=-1)
s_hc<-hclust(as.dist(s_df_mds_dist),method="average")
plot(s_hc,hang=-1)