First Dataset of the Assignment 3: `esoph`

Introduction

Required Libraries

In order to examine this dataset, the following libraries are required and loaded. This dataset is already available in base-R under the name of esoph; therefore, there is no need for reading and loading the dataset.

library(tidyverse)
library(ggplot2)
library(knitr)

Overview of Dataset

The data comes from a case-control study of Espophageal Cancer in France.
Data was collected from 1175 patients with various age / alcohol / tobacco use combinations.
Among 1175 patients, 200 cancer “cases” are included. Then 975 non-cases were sampled from comparable hospital populations.
Data frame with records for 88 age / alcohol / tobacco combinations.

glimpse(esoph)

## Rows: 88
## Columns: 5
## $ agegp     <ord> 25-34, 25-34, 25-34, 25-34, 25-34, 25-34, 25-34, 25-34, 2...
## $ alcgp     <ord> 0-39g/day, 0-39g/day, 0-39g/day, 0-39g/day, 40-79, 40-79,...
## $ tobgp     <ord> 0-9g/day, 10-19, 20-29, 30+, 0-9g/day, 10-19, 20-29, 30+,...
## $ ncases    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ...
## $ ncontrols <dbl> 40, 10, 6, 5, 27, 7, 4, 7, 2, 1, 2, 1, 1, 1, 2, 60, 14, 7...

summary(esoph)

##    agegp          alcgp         tobgp        ncases         ncontrols    
##  25-34:15   0-39g/day:23   0-9g/day:24   Min.   : 0.000   Min.   : 1.00  
##  35-44:15   40-79    :23   10-19   :24   1st Qu.: 0.000   1st Qu.: 3.00  
##  45-54:16   80-119   :21   20-29   :20   Median : 1.000   Median : 6.00  
##  55-64:16   120+     :21   30+     :20   Mean   : 2.273   Mean   :11.08  
##  65-74:15                                3rd Qu.: 4.000   3rd Qu.:14.00  
##  75+  :11                                Max.   :17.000   Max.   :60.00

Objectives

Exploring the dataset esoph which comes long in “R” package.
Visualising the relationship between case occurrence and age / alcohol / tobacco profile.
Identifying the groups at risk via useful analyzes and graphs.
Building a well-developed generalized linear model.
Predicting cancer percentages among the groups.
Testing the robustness of the model via leave-one-out cross validation.

Analyzes and Visualizations

Overview

If the data set is small, sometimes a boxplot may not be very accurate, as the quartiles are not well estimated from the data and may give a falsely inflated or deflated figure. In those cases, plotting the raw data may be more desirable. This can be done using a strip chart.

Stripchart of Age Groups based on Number of Cases

stripchart(ncases ~ agegp, data=esoph)

We can say that age groups have an effect on the number of cases.
While there is at most 1 case in the 25-34 age group, that risk increases with age.
The number of cases of 5 and more was seen in the age ranges of 45-74.

Stripchart of Tobacco Consumption based on Number of Cases

stripchart(ncases ~ tobgp, data=esoph)

It cannot be said that tobacco use has a serious impact on the number of cases.
It was observed that when tobacco use increased, the number of cases did not decrease.
The highest number of cases was seen in the group with the least tobacco use.

Stripchart of Alcohol Consumption based on Number of Cases

stripchart(ncases ~ alcgp, data=esoph)

We can say that alcohol use has an effect on the number of cases.
The number of cases higher than 5 was seen in the groups of those who consumed more than 40g of alcohol per day.

We cannot get all inferences about the dataset just by looking at the stripcharts. More detailed analysis continues below.

Cancer Proportion of Age Groups

esoph %>% 
  group_by(agegp) %>%
  summarise(total_cases = sum(ncases), 
            total_controls = sum(ncontrols),
            percentage = 100 * total_cases / (total_cases+total_controls)) %>%
  ggplot(., aes(x = agegp, y = percentage, fill = agegp)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Proportion of Cancer Cases over Age Groups", subtitle = "Data Source: `esoph`", x = 'Age Groups', y = "% of Cancer Cases") +
  theme_minimal() +
  theme(legend.position = "none") +
  geom_text(aes(label = paste(format(percentage,digits=1), "%")), size=4.5, position = position_stack(vjust = 0.5))

As the age increases, the probability of seeing cases increases significantly.
The highest risk is seen in the 65-74 age group.
In the range of 25-44 ages, this risk is less than 5 percent.
It has been observed that the risk has increased significantly since the age of 45.

Cancer Proportion of Alcohol Consumption Groups

esoph %>% 
  group_by(alcgp) %>%
  summarise(total_cases = sum(ncases), 
            total_controls = sum(ncontrols),
            percentage = 100 * total_cases / (total_cases+total_controls)) %>%
  ggplot(., aes(x = alcgp, y = percentage, fill = alcgp)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Proportion of Cancer Cases over Alcohol Consumption", subtitle = "Data Source: `esoph`", x = 'Alcohol Consumption', y = "% of Cancer Cases") +
  theme_minimal() +
  theme(legend.position = "none") +
  geom_text(aes(label = paste(format(percentage,digits=1), "%")), size=4.5, position = position_stack(vjust = 0.5))

We can say that alcohol use has a very serious effect on the cancer proportion.
As alcohol use increased, the risk increased in the same direction.
While this risk is only 7 percent in the group with the least alcohol consumption, this risk is 40 percent in the group with the highest alcohol consumption.

Cancer Proportion of Tobacco Consumption Groups

esoph %>% 
  group_by(tobgp) %>%
  summarise(total_cases = sum(ncases), 
            total_controls = sum(ncontrols),
            percentage = 100 * total_cases / (total_cases+total_controls)) %>%
  ggplot(., aes(x = tobgp, y = percentage, fill = tobgp)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Proportion of Cancer Cases over Tobacco Consumption", subtitle = "Data Source: `esoph`", x = 'Tobacco Consumption', y = "% of Cancer Cases") +
  theme_minimal() +
  theme(legend.position = "none") +
  geom_text(aes(label = paste(format(percentage,digits=1), "%")), size=4.5, position = position_stack(vjust = 0.5))

We can say that tobacco use has an effect on cancer percentages but not so serious.
While the risk in the group with the least tobacco consumption was 13 percent, it was 27 percent in the group with the highest tobacco consumption.
The risk is equal and 20 percent in groups consuming 10-19 and 20-29 grams of tobacco per day.

Cancer Case Distribution of Alcohol Consumption by Age Group

esoph %>% 
  select(-tobgp, -ncontrols) %>%
  group_by(agegp, alcgp) %>%
  summarize(total_cases = sum(ncases)) %>%
  group_by(agegp) %>%
  mutate(percentage = 100 * total_cases / sum(total_cases)) %>%
  filter(percentage != "NaN" & percentage != 0) %>%
  ggplot(., aes(x = agegp, y = percentage, fill = alcgp)) +
  geom_col(stat = "identity", position = "fill") +
  theme_minimal() +
  geom_text(aes(label = paste(format(percentage,digits=1), "%")), size=4, position = "fill", hjust = 0.5, vjust = 1.1) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Stacked Bar Chart of Case Distribution of Alcohol Consumption by Age Groups", subtitle = "Data Source: `esoph`", x = "Age Groups", y = "% of Cancer Cases", fill = "Alcohol Consumption")

Considering alcohol consumption with age ranges together, we can make significant inferences on cancer cases.
Groups that consume less alcohol in all age groups have a smaller share, while groups that consume the most alcohol generally have the largest share.
Cases were observed only in the group that consumed more than 120 grams of alcohol per day in the age range of 25-34.
It has been observed that those who consume more than 80 grams of alcohol per day have a significant share in all age groups.

Cancer Case Distribution of Tobacco Consumption by Age Group

esoph %>% 
  select(-alcgp, -ncontrols) %>%
  group_by(agegp, tobgp) %>%
  summarize(total_cases = sum(ncases)) %>%
  group_by(agegp) %>%
  mutate(percentage = 100 * total_cases / sum(total_cases)) %>%
  filter(percentage != "NaN" & percentage != 0) %>%
  ggplot(., aes(x = agegp, y = percentage, fill = tobgp)) +
  geom_col(stat = "identity", position = "fill") +
  theme_minimal() +
  geom_text(aes(label = paste(format(percentage,digits=1), "%")), size=4, position = "fill", hjust = 0.5, vjust = 1.1) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Stacked Bar Chart of Case Distribution of Tobacco Consumption by Age Groups", subtitle = "Data Source: `esoph`", x = "Age Groups", y = "% of Cancer Cases", fill = "Tobacco Consumption")

Groups that consume less tobacco in all age groups does not have a smaller share, while groups that consume the most alcohol generally have the smallest share. That’s why we cannot make significant inferences on cancer cases by considering tobacco consumption with age ranges together.
Cases were observed only in the group that consumed more than 10-19 grams of tobacco per day in the age range of 25-34.
It has been observed that those who consume less than 19 grams of tobacco per day have a significant share in all age groups.
Looking at the stacked bar chart, excessive consumption of tobacco does not pose a serious risk.

Cancer Case Distribution of Alcohol Consumption by Tobacco Consumption

esoph %>% 
  select(-agegp, -ncontrols) %>%
  group_by(tobgp, alcgp) %>%
  summarize(total_cases = sum(ncases)) %>%
  group_by(tobgp) %>%
  mutate(percentage = 100 * total_cases / sum(total_cases)) %>%
  filter(percentage != "NaN" & percentage != 0) %>%
  ggplot(., aes(x = tobgp, y = percentage, fill = alcgp)) +
  geom_col(stat = "identity", position = "fill") +
  theme_minimal() +
  geom_text(aes(label = paste(format(percentage,digits=1), "%")), size=4, position = "fill", hjust = 0.5, vjust = 1.1) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Stacked Bar Chart of Case Distribution of Alcohol Consumption by Tobacco Groups", subtitle = "Data Source: `esoph`", x = "Tobacco Consumption", y = "% of Cancer Cases", fill = "Alcohol Consumption")

According to tobacco consumption groups, the percentage of cases was lowest in the groups with the least alcohol consumption. (12 to 17 percent)
Among those who consumed the least tobacco, the highest risk was the group consuming 40-79 grams of alcohol per day.
Among those who consumed the most tobacco, the highest risk was the group consuming 120+ grams of alcohol per day.

Cancer Case Distribution of Tobacco Consumption by Alcohol Consumption

esoph %>% 
  select(-agegp, -ncontrols) %>%
  group_by(alcgp, tobgp) %>%
  summarize(total_cases = sum(ncases)) %>%
  group_by(alcgp) %>%
  mutate(percentage = 100 * total_cases / sum(total_cases)) %>%
  filter(percentage != "NaN" & percentage != 0) %>%
  ggplot(., aes(x = alcgp, y = percentage, fill = tobgp)) +
  geom_col(stat = "identity", position = "fill") +
  theme_minimal() +
  geom_text(aes(label = paste(format(percentage,digits=1), "%")), size=4, position = "fill", hjust = 0.5, vjust = 1.1) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Stacked Bar Chart of Case Distribution of Tobacco Consumption by Alcohol Groups", subtitle = "Data Source: `esoph`", x = "Alcohol Consumption", y = "% of Cancer Cases", fill = "Tobacco Consumption")

According to alcohol consumption groups, the percentage of cases was lowest in the groups with the highest tobacco consumption. (12 to 22 percent)
Moreover, the percentage of cases was highest in the groups with the lowest tobacco consumption. (31 to 45 percent)
As can be deduced from this graph, it has been observed that tobacco use does not pose a serious risk.
Among those who consumed the least alcohol, the highest risk was the group consuming 10-19 grams of tobacco per day.
Among those who consumed the most alcohol, the highest risk was the group consuming 0-9 grams of tobacco per day.

Heatmap of Cancer Case Distribution

esoph %>% 
  group_by(agegp) %>%
  mutate(total_cases = sum(ncases), 
            total_controls = sum(ncontrols),
            percentage = 100 * total_cases / (total_cases+total_controls)) %>%
  ggplot(., aes(x = alcgp, y = tobgp, fill = percentage)) +
  geom_tile() +
  facet_wrap(~agegp) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_gradient2(low="white", high="red3", guide="colorbar") +
  labs(title = "Heatmap of Cancer Cases", x = "Alcohol Consumption", subtitle = "Data Source: `esoph`", y = "Tobacco Consumption", fill = "Cancer Cases (%)")

Heatmap of cancer case distribution allows us to clearly see the highest and lowest risk groups.
The risk is the lowest in the 25-34 age group, regardless of the amount of alcohol and tobacco use.
Likewise, in the 35-44 age group, the risk can be said to be around 5 percent.
However, the risk increases significantly in groups over the age of 45.
The highest risk is in the 65-74 age group with around 25 percent.
Tobacco or alcohol use does not show a serious risk difference between age groups. It can be said that the most important criterion is the age range. A generalized linear model will be installed to support this.

Jitterplot of Cancer Cases by Alcohol and Age Groups

ggplot(esoph, aes(x = as.factor(agegp), y = ncases, color = alcgp)) +
  geom_jitter(size=1.5, position = position_jitter(width = 0.4)) +
  theme_minimal() + 
  scale_color_manual(values=c("green2", "blue2", "yellow2", "red2")) +
  labs(title = "Jitter Plot of 'ncases' by Alcohol and Age Groups", x = "Age Groups", y = "Number of Cases", color = "Alcohol Consumption")

The number of cases is at the lowest levels in the age range of 25-34.
The highest number of cases occurred in the 40-79 grams of alcohol consumption group per day (blue dotted) in the range of 65-74 ages.
In all age groups, the lowest number of cases were generally observed in the groups with the least alcohol consumption (green dotted).

Mosaicplot of Cancer Case Distribution

require(graphics) # for mosaicplot
## Re-arrange data for a mosaic plot
ttt <- table(esoph$agegp, esoph$alcgp, esoph$tobgp)
o <- with(esoph, order(tobgp, alcgp, agegp))
ttt[ttt == 1] <- esoph$ncases[o]
tt1 <- table(esoph$agegp, esoph$alcgp, esoph$tobgp)
tt1[tt1 == 1] <- esoph$ncontrols[o]
tt <- array(c(ttt, tt1), c(dim(ttt),2),
            c(dimnames(ttt), list(c("Cancer", "Control"))))
mosaicplot(tt, main = "Mosaicplot of Cancer Case Distribution", color = TRUE)

We can see all the values together with Mosaicplot. You can see the age groups from left to right, the alcohol consumption from bottom to top, the tobacco consumption among each blocks, and the distribution of cancer (dark color) and control (light color) numbers.
An only case in the range of 25-34 ages was occured in the highest alcohol consumption group.
Few cases were seen between the ages of 35-44 (mostly in cases of high alcohol consumption).
There is a serious increase in the number of cases since the age of 45.
In the 45-54 age range and the 40-79 g / day alcohol consumption group, an increase in the number of cases was observed with the increase in tobacco use. In the 120+ g / day alcohol consumption group, the case rates are high and they are distributed independently of tobacco use.
We cannot say that tobacco use has a serious impact on the number of cases. Because generally, more cases were observed in groups using less tobacco.
However, in cases where alcohol use increases, we can say that there is a serious increase in the number of cases. Likewise, we can say that the number of cases increases as the age increases.

Generalized Linear Model

ANOVA Test

esoph$percentage <- esoph$ncases / (esoph$ncontrols+esoph$ncases) #The new column is created to show the cancer percentages
model <- lm(percentage ~ agegp + tobgp + alcgp, data = esoph) #Linear model is created in order to apply anova test
anova(model)

## Analysis of Variance Table
## 
## Response: percentage
##           Df  Sum Sq  Mean Sq F value    Pr(>F)    
## agegp      5 1.23742 0.247484 19.4982 1.918e-12 ***
## tobgp      3 0.03361 0.011205  0.8828     0.454    
## alcgp      3 0.86514 0.288381 22.7203 1.333e-10 ***
## Residuals 76 0.96464 0.012693                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the results of the ANOVA test, it was observed that while age and alcohol groups had a serious effect, tobacco groups did not make a significant difference with the p-value 45%.

Akaike’s Information Criterion

To build a better model, we can decide whether we should remove tobacco groups from our model by looking at the AIC values.

AIC(glm(percentage ~ agegp + tobgp + alcgp, data = esoph, family = binomial(link = "logit"))) #With tobacco groups

## [1] 69.62955

AIC(glm(percentage ~ agegp + alcgp, data = esoph, family = binomial(link = "logit"))) #Without tobacco groups

## [1] 63.43588

AIC (Akaike’s Information Criterion) allows to compare models with different distributions and with different number of parameters. The best fitting model is the model with the smallest AIC-value. When we remove the tobacco group from our model, we see a serious decrease in AIC value. Therefore, we will proceed without adding tobacco groups to our model.

Logistic Regression

model <- glm(percentage ~ agegp + alcgp, data = esoph, family = binomial(link = "logit")) #Logistic regression
summary(model)

## 
## Call:
## glm(formula = percentage ~ agegp + alcgp, family = binomial(link = "logit"), 
##     data = esoph)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8486  -0.2650  -0.1225   0.1672   1.1824  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.69863    0.37820  -4.491 7.08e-06 ***
## agegp.L      2.45976    1.06194   2.316   0.0205 *  
## agegp.Q     -0.94555    0.95610  -0.989   0.3227    
## agegp.C     -0.01623    0.90430  -0.018   0.9857    
## agegp^4      0.44297    0.81404   0.544   0.5863    
## agegp^5     -0.17601    0.65825  -0.267   0.7892    
## alcgp.L      1.41043    0.64271   2.195   0.0282 *  
## alcgp.Q     -0.09357    0.60333  -0.155   0.8768    
## alcgp.C      0.16453    0.56916   0.289   0.7725    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23.2729  on 87  degrees of freedom
## Residual deviance:  8.5657  on 79  degrees of freedom
## AIC: 63.436
## 
## Number of Fisher Scoring iterations: 6

Here is the summary of logistic regression.
There are catogeries with a p-value of less than 5 percent, so we can proceed with this model.
We can predict the cancer percentages according to this model, and then we can draw the error histogram by comparing it with actual values.

Predicted Cancer Risk Percentages among Alcohol and Age Groups

predict_cancer_percentages <- data.frame()
for (i in 1:6) {
  for (j in 1:4) {
    predict_cancer_percentages[i,j] <- plogis(predict(model, data.frame(agegp = unique(esoph$agegp)[i], alcgp = unique(esoph$alcgp)[j]))) #Prediction
  }
}
pivot_longer(predict_cancer_percentages, cols=everything(), names_to = "Alcohol_Consumption", values_to = "Cancer_Percentage") %>%
  add_column(.before="Alcohol_Consumption", Age_Group = c(rep("25-34",4),rep("35-44",4),rep("45-54",4),rep("55-64",4),rep("65-74",4),rep("75+",4))) %>%
  ggplot(.,aes(x=Age_Group, y=Cancer_Percentage, fill = Alcohol_Consumption)) +
  geom_bar(stat = "identity", position = "dodge") + 
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_discrete(name = "Alcohol Consumption (gm/day)", labels = c("0-39", "40-79", "80-119", "120+")) +
  labs(title = "Predicted Cancer Risk Percentages among Alcohol and Age Groups", subtitle = "Prediction Based on Logistic Regression", x = "Age Groups", y = "Predicted Percentage")

Above, you can graphically see the estimated cancer percentages by age and alcohol groups. You can see the percentage values in the table below.
The increase in alcohol consumption has increased the risk level in all age groups. Likewise, risk levels increased with increasing age.
The risk is very low in the 25-34 and 35-44 age groups.
Since the age of 45, the risk increases significantly.
The risk levels in the 55-64 and 65-74 age groups are very similar.
The highest risks were observed in the age range above 75.
In case of low consumption of alcohol, there is a risk of at most 15 percent (group over 75 years old). In case of excessive consumption of alcohol, this risk increases up to 56 percent. Therefore, reducing alcohol consumption significantly reduces the risk of cancer.

predict_cancer_percentages <- data.frame(row.names = c("25-34 years","35-44 years","45-54 years","55-64 years","65-74 years","75+ years"))
for (i in 1:6) {
  for (j in 1:4) {
    predict_cancer_percentages[i,j] <- paste(round(100*(plogis(predict(model, data.frame(agegp = unique(esoph$agegp)[i], alcgp = unique(esoph$alcgp)[j])))),0),"%",sep="")
  }
}
colnames(predict_cancer_percentages) <- c("0-39 gm/day","40-79 gm/day","80-119 gm/day","120+ gm/day")
kable(predict_cancer_percentages, caption = "Predicted Cancer Percentages corresp. to Age and Alcohol Groups")

Predicted Cancer Percentages corresp. to Age and Alcohol Groups
	0-39 gm/day	40-79 gm/day	80-119 gm/day	120+ gm/day
25-34 years	1%	2%	3%	7%
35-44 years	2%	5%	7%	14%
45-54 years	9%	19%	26%	41%
55-64 years	12%	25%	34%	50%
65-74 years	13%	26%	34%	51%
75+ years	15%	30%	40%	56%

Leave-one-out Cross Validation

pred_length <- nrow(esoph)
fit_glm_error <- c()
fit_glm_sq_error <- c()
for(i in 1:pred_length){
  fit_glm <- glm(percentage ~ agegp + alcgp, family = binomial(link = "logit"), data = esoph[-i,]) #Leave-one-out Cross Validation
  fit_glm_pred <- (predict(fit_glm, esoph[i,]))^2
  fit_glm_error[i] <- esoph$percentage[i] - fit_glm_pred
  fit_glm_sq_error[i] = (esoph$percentage[i] - fit_glm_pred)^2
}
hist(fit_glm_error, breaks = 50, xlim = range(-50,50), title = "Histogram of Errors", xlab = "Fitted GLM Errors")

When the histogram of the errors is drawn, we see that it is normally distributed. A few errors appear to be more than 30. Apart from that, since we obviously want to minimize the error, the fact that the errors are mostly close to zero shows that we have established a good model. In order to evaluate the model better, RMSE is calculated as follows:

rmse_fit_glm <- sqrt(mean(fit_glm_sq_error))
rmse_fit_glm #Root Mean Square Error

## [1] 42.17921

References

Second Part of the Assignment: `Young People Survey`

Click here in order to see the second part of the assignment where I reviewed another dataset Young People Survey

Assignment 3: Part 1: Esoph

Can Aytöre / 2019702009

13/09/2020

First Dataset of the Assignment 3: `esoph`

Introduction

Required Libraries

Overview of Dataset

Objectives

Analyzes and Visualizations

Overview

Stripchart of Age Groups based on Number of Cases

Stripchart of Tobacco Consumption based on Number of Cases

Stripchart of Alcohol Consumption based on Number of Cases

Cancer Proportion of Age Groups

Cancer Proportion of Alcohol Consumption Groups

Cancer Proportion of Tobacco Consumption Groups

Cancer Case Distribution of Alcohol Consumption by Age Group

Cancer Case Distribution of Tobacco Consumption by Age Group

Cancer Case Distribution of Alcohol Consumption by Tobacco Consumption

Cancer Case Distribution of Tobacco Consumption by Alcohol Consumption

Heatmap of Cancer Case Distribution

Jitterplot of Cancer Cases by Alcohol and Age Groups

Mosaicplot of Cancer Case Distribution

Generalized Linear Model

ANOVA Test

Akaike’s Information Criterion

Logistic Regression

Predicted Cancer Risk Percentages among Alcohol and Age Groups

Leave-one-out Cross Validation

References

Second Part of the Assignment: `Young People Survey`

Assignment 3: Part 1: Esoph

Can Aytöre / 2019702009

13/09/2020

First Dataset of the Assignment 3: esoph

Introduction

Required Libraries

Overview of Dataset

Objectives

Analyzes and Visualizations

Overview

Stripchart of Age Groups based on Number of Cases

Stripchart of Tobacco Consumption based on Number of Cases

Stripchart of Alcohol Consumption based on Number of Cases

Cancer Proportion of Age Groups

Cancer Proportion of Alcohol Consumption Groups

Cancer Proportion of Tobacco Consumption Groups

Cancer Case Distribution of Alcohol Consumption by Age Group

Cancer Case Distribution of Tobacco Consumption by Age Group

Cancer Case Distribution of Alcohol Consumption by Tobacco Consumption

Cancer Case Distribution of Tobacco Consumption by Alcohol Consumption

Heatmap of Cancer Case Distribution

Jitterplot of Cancer Cases by Alcohol and Age Groups

Mosaicplot of Cancer Case Distribution

Generalized Linear Model

ANOVA Test

Akaike’s Information Criterion

Logistic Regression

Predicted Cancer Risk Percentages among Alcohol and Age Groups

Leave-one-out Cross Validation

References

Second Part of the Assignment: Young People Survey

First Dataset of the Assignment 3: `esoph`

Second Part of the Assignment: `Young People Survey`