Assignment 3

Young People Survey PCA Analysis 1

library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
#Prepare data

responses <-
  read.csv("/home/idil/İndirilenler/responses.csv",sep=",") %>%
  filter(complete.cases(.)) %>%
  # mutate(id=row_number()) %>%
  tbl_df()
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
#Prepare PCA data

yr_pca<-
  responses[,sapply(responses,class)=="integer"] %>%
  select(Dance:Musical)
#Run PCA analysis

yr_pca_result<-princomp(yr_pca,cor=T)
summary(yr_pca_result,loadings=TRUE)
## Importance of components:
##                          Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
## Standard deviation     1.386322 1.0302560 0.9274961 0.8012219 0.7172707
## Proportion of Variance 0.384378 0.2122855 0.1720498 0.1283913 0.1028954
## Cumulative Proportion  0.384378 0.5966635 0.7687133 0.8971046 1.0000000
## 
## Loadings:
##                 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Dance                   0.921  0.135  0.208  0.289
## Folk             0.540        -0.322  0.549 -0.549
## Country          0.468        -0.611 -0.580  0.247
## Classical.music  0.514 -0.365  0.279  0.332  0.644
## Musical          0.468         0.653 -0.457 -0.374
#See the PCA output

ggplot(data=data.frame(PC=1:length(yr_pca_result$sdev),var_exp=cumsum(yr_pca_result$sdev^2/sum(yr_pca_result$sdev^2))),
       aes(x=PC,y=var_exp)) + geom_line() + geom_point() + scale_y_continuous(labels = scales::percent,breaks=seq(0,1,length.out=11)) + scale_x_continuous(breaks=seq(0,135,by=5))

When we examine 5 variables (Dance,Folk,Country,Classical.music,Musical) we can see that they are highly correlated. With the fourth PC, 89% of the variation is explained.

Young People Survey PCA Analysis 2

yr_pca2<-
  responses[,sapply(responses,class)=="integer"] %>%
  select(Psychology:Economy.Management)
#Run PCA analysis

yr_pca_result2<-princomp(yr_pca2,cor=T)
summary(yr_pca_result2,loadings=TRUE)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
## Standard deviation     1.4737965 1.1336980 1.0677091 0.9348659 0.8425667
## Proportion of Variance 0.3102966 0.1836102 0.1628575 0.1248535 0.1014169
## Cumulative Proportion  0.3102966 0.4939068 0.6567643 0.7816178 0.8830347
##                            Comp.6    Comp.7
## Standard deviation     0.69720631 0.5767671
## Proportion of Variance 0.06944238 0.0475229
## Cumulative Proportion  0.95247710 1.0000000
## 
## Loadings:
##                    Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Psychology                 0.521  0.330  0.726  0.239  0.163       
## Politics            0.247  0.599        -0.181 -0.712 -0.168  0.109
## Mathematics         0.514 -0.161  0.369 -0.172  0.236 -0.262  0.648
## Physics             0.470 -0.244  0.491        -0.153        -0.672
## Internet            0.355        -0.594  0.409        -0.571 -0.117
## PC                  0.482 -0.209 -0.339  0.141 -0.183  0.729  0.157
## Economy.Management  0.300  0.482 -0.215 -0.472  0.563  0.131 -0.273
#See the PCA output
ggplot(data=data.frame(PC=1:length(yr_pca_result2$sdev),var_exp=cumsum(yr_pca_result2$sdev^2/sum(yr_pca_result2$sdev^2))),
       aes(x=PC,y=var_exp)) + geom_line() + geom_point() + scale_y_continuous(labels = scales::percent,breaks=seq(0,1,length.out=11)) + scale_x_continuous(breaks=seq(0,135,by=5))

When we examine Psychology, Politics, Mathematics, Physics, Internet, PC, Economy.Management they are not much correlated. With the sixth PC, 95% of the variation is explained.

Esoph PCA Analysis

#Prepare PCA data
dataset<-esoph
dataset_pca <- princomp(as.matrix(dataset[4:5],cor=T))
summary(dataset_pca,loadings=TRUE)
## Importance of components:
##                            Comp.1     Comp.2
## Standard deviation     12.6811764 2.59023762
## Proportion of Variance  0.9599494 0.04005055
## Cumulative Proportion   0.9599494 1.00000000
## 
## Loadings:
##           Comp.1 Comp.2
## ncases            0.997
## ncontrols  0.997
#PCA output

ggplot(data.frame(pc=1:2,cum_var=c(0.9599494,1.00000000)),aes(x=pc,y=cum_var)) + geom_point() + geom_line()

Since we have two variables we have two PC. From the graphic we can say that our variables have a good correlation. Because contents of the PC’s 0.997. We can say that (o)esophageal cancer is related to alcohol consumption and tobacco consumption per day.