library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#Prepare data
responses <-
read.csv("/home/idil/İndirilenler/responses.csv",sep=",") %>%
filter(complete.cases(.)) %>%
# mutate(id=row_number()) %>%
tbl_df()
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
#Prepare PCA data
yr_pca<-
responses[,sapply(responses,class)=="integer"] %>%
select(Dance:Musical)
#Run PCA analysis
yr_pca_result<-princomp(yr_pca,cor=T)
summary(yr_pca_result,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.386322 1.0302560 0.9274961 0.8012219 0.7172707
## Proportion of Variance 0.384378 0.2122855 0.1720498 0.1283913 0.1028954
## Cumulative Proportion 0.384378 0.5966635 0.7687133 0.8971046 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Dance 0.921 0.135 0.208 0.289
## Folk 0.540 -0.322 0.549 -0.549
## Country 0.468 -0.611 -0.580 0.247
## Classical.music 0.514 -0.365 0.279 0.332 0.644
## Musical 0.468 0.653 -0.457 -0.374
#See the PCA output
ggplot(data=data.frame(PC=1:length(yr_pca_result$sdev),var_exp=cumsum(yr_pca_result$sdev^2/sum(yr_pca_result$sdev^2))),
aes(x=PC,y=var_exp)) + geom_line() + geom_point() + scale_y_continuous(labels = scales::percent,breaks=seq(0,1,length.out=11)) + scale_x_continuous(breaks=seq(0,135,by=5))
When we examine 5 variables (Dance,Folk,Country,Classical.music,Musical) we can see that they are highly correlated. With the fourth PC, 89% of the variation is explained.
yr_pca2<-
responses[,sapply(responses,class)=="integer"] %>%
select(Psychology:Economy.Management)
#Run PCA analysis
yr_pca_result2<-princomp(yr_pca2,cor=T)
summary(yr_pca_result2,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.4737965 1.1336980 1.0677091 0.9348659 0.8425667
## Proportion of Variance 0.3102966 0.1836102 0.1628575 0.1248535 0.1014169
## Cumulative Proportion 0.3102966 0.4939068 0.6567643 0.7816178 0.8830347
## Comp.6 Comp.7
## Standard deviation 0.69720631 0.5767671
## Proportion of Variance 0.06944238 0.0475229
## Cumulative Proportion 0.95247710 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Psychology 0.521 0.330 0.726 0.239 0.163
## Politics 0.247 0.599 -0.181 -0.712 -0.168 0.109
## Mathematics 0.514 -0.161 0.369 -0.172 0.236 -0.262 0.648
## Physics 0.470 -0.244 0.491 -0.153 -0.672
## Internet 0.355 -0.594 0.409 -0.571 -0.117
## PC 0.482 -0.209 -0.339 0.141 -0.183 0.729 0.157
## Economy.Management 0.300 0.482 -0.215 -0.472 0.563 0.131 -0.273
#See the PCA output
ggplot(data=data.frame(PC=1:length(yr_pca_result2$sdev),var_exp=cumsum(yr_pca_result2$sdev^2/sum(yr_pca_result2$sdev^2))),
aes(x=PC,y=var_exp)) + geom_line() + geom_point() + scale_y_continuous(labels = scales::percent,breaks=seq(0,1,length.out=11)) + scale_x_continuous(breaks=seq(0,135,by=5))
When we examine Psychology, Politics, Mathematics, Physics, Internet, PC, Economy.Management they are not much correlated. With the sixth PC, 95% of the variation is explained.
#Prepare PCA data
dataset<-esoph
dataset_pca <- princomp(as.matrix(dataset[4:5],cor=T))
summary(dataset_pca,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 12.6811764 2.59023762
## Proportion of Variance 0.9599494 0.04005055
## Cumulative Proportion 0.9599494 1.00000000
##
## Loadings:
## Comp.1 Comp.2
## ncases 0.997
## ncontrols 0.997
#PCA output
ggplot(data.frame(pc=1:2,cum_var=c(0.9599494,1.00000000)),aes(x=pc,y=cum_var)) + geom_point() + geom_line()
Since we have two variables we have two PC. From the graphic we can say that our variables have a good correlation. Because contents of the PC’s 0.997. We can say that (o)esophageal cancer is related to alcohol consumption and tobacco consumption per day.