First Dataset of the Assignment 3: esoph

Introduction

Required Libraries

In order to examine this dataset, the following libraries are required and loaded. This dataset is already available in base-R under the name of esoph; therefore, there is no need for reading and loading the dataset.

Overview of Dataset

  • The data comes from a case-control study of Espophageal Cancer in France.
  • Data was collected from 1175 patients with various age / alcohol / tobacco use combinations.
  • Among 1175 patients, 200 cancer “cases” are included. Then 975 non-cases were sampled from comparable hospital populations.
  • Data frame with records for 88 age / alcohol / tobacco combinations.
## Rows: 88
## Columns: 5
## $ agegp     <ord> 25-34, 25-34, 25-34, 25-34, 25-34, 25-34, 25-34, 25-34, 2...
## $ alcgp     <ord> 0-39g/day, 0-39g/day, 0-39g/day, 0-39g/day, 40-79, 40-79,...
## $ tobgp     <ord> 0-9g/day, 10-19, 20-29, 30+, 0-9g/day, 10-19, 20-29, 30+,...
## $ ncases    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ...
## $ ncontrols <dbl> 40, 10, 6, 5, 27, 7, 4, 7, 2, 1, 2, 1, 1, 1, 2, 60, 14, 7...
##    agegp          alcgp         tobgp        ncases         ncontrols    
##  25-34:15   0-39g/day:23   0-9g/day:24   Min.   : 0.000   Min.   : 1.00  
##  35-44:15   40-79    :23   10-19   :24   1st Qu.: 0.000   1st Qu.: 3.00  
##  45-54:16   80-119   :21   20-29   :20   Median : 1.000   Median : 6.00  
##  55-64:16   120+     :21   30+     :20   Mean   : 2.273   Mean   :11.08  
##  65-74:15                                3rd Qu.: 4.000   3rd Qu.:14.00  
##  75+  :11                                Max.   :17.000   Max.   :60.00

Objectives

  • Exploring the dataset esoph which comes long in “R” package.
  • Visualising the relationship between case occurrence and age / alcohol / tobacco profile.
  • Identifying the groups at risk via useful analyzes and graphs.
  • Building a well-developed generalized linear model.
  • Predicting cancer percentages among the groups.
  • Testing the robustness of the model via leave-one-out cross validation.

Analyzes and Visualizations

Overview

If the data set is small, sometimes a boxplot may not be very accurate, as the quartiles are not well estimated from the data and may give a falsely inflated or deflated figure. In those cases, plotting the raw data may be more desirable. This can be done using a strip chart.

Stripchart of Age Groups based on Number of Cases

  • We can say that age groups have an effect on the number of cases.
  • While there is at most 1 case in the 25-34 age group, that risk increases with age.
  • The number of cases of 5 and more was seen in the age ranges of 45-74.

Stripchart of Tobacco Consumption based on Number of Cases

  • It cannot be said that tobacco use has a serious impact on the number of cases.
  • It was observed that when tobacco use increased, the number of cases did not decrease.
  • The highest number of cases was seen in the group with the least tobacco use.

Stripchart of Alcohol Consumption based on Number of Cases

  • We can say that alcohol use has an effect on the number of cases.
  • The number of cases higher than 5 was seen in the groups of those who consumed more than 40g of alcohol per day.

We cannot get all inferences about the dataset just by looking at the stripcharts. More detailed analysis continues below.

Cancer Case Distribution of Alcohol Consumption by Age Group

  • Considering alcohol consumption with age ranges together, we can make significant inferences on cancer cases.
  • Groups that consume less alcohol in all age groups have a smaller share, while groups that consume the most alcohol generally have the largest share.
  • Cases were observed only in the group that consumed more than 120 grams of alcohol per day in the age range of 25-34.
  • It has been observed that those who consume more than 80 grams of alcohol per day have a significant share in all age groups.

Cancer Case Distribution of Tobacco Consumption by Age Group

  • Groups that consume less tobacco in all age groups does not have a smaller share, while groups that consume the most alcohol generally have the smallest share. That’s why we cannot make significant inferences on cancer cases by considering tobacco consumption with age ranges together.
  • Cases were observed only in the group that consumed more than 10-19 grams of tobacco per day in the age range of 25-34.
  • It has been observed that those who consume less than 19 grams of tobacco per day have a significant share in all age groups.
  • Looking at the stacked bar chart, excessive consumption of tobacco does not pose a serious risk.

Cancer Case Distribution of Tobacco Consumption by Alcohol Consumption

  • According to alcohol consumption groups, the percentage of cases was lowest in the groups with the highest tobacco consumption. (12 to 22 percent)
  • Moreover, the percentage of cases was highest in the groups with the lowest tobacco consumption. (31 to 45 percent)
  • As can be deduced from this graph, it has been observed that tobacco use does not pose a serious risk.
  • Among those who consumed the least alcohol, the highest risk was the group consuming 10-19 grams of tobacco per day.
  • Among those who consumed the most alcohol, the highest risk was the group consuming 0-9 grams of tobacco per day.

Heatmap of Cancer Case Distribution

  • Heatmap of cancer case distribution allows us to clearly see the highest and lowest risk groups.
  • The risk is the lowest in the 25-34 age group, regardless of the amount of alcohol and tobacco use.
  • Likewise, in the 35-44 age group, the risk can be said to be around 5 percent.
  • However, the risk increases significantly in groups over the age of 45.
  • The highest risk is in the 65-74 age group with around 25 percent.
  • Tobacco or alcohol use does not show a serious risk difference between age groups. It can be said that the most important criterion is the age range. A generalized linear model will be installed to support this.

Jitterplot of Cancer Cases by Alcohol and Age Groups

  • The number of cases is at the lowest levels in the age range of 25-34.
  • The highest number of cases occurred in the 40-79 grams of alcohol consumption group per day (blue dotted) in the range of 65-74 ages.
  • In all age groups, the lowest number of cases were generally observed in the groups with the least alcohol consumption (green dotted).

Mosaicplot of Cancer Case Distribution

  • We can see all the values together with Mosaicplot. You can see the age groups from left to right, the alcohol consumption from bottom to top, the tobacco consumption among each blocks, and the distribution of cancer (dark color) and control (light color) numbers.
  • An only case in the range of 25-34 ages was occured in the highest alcohol consumption group.
  • Few cases were seen between the ages of 35-44 (mostly in cases of high alcohol consumption).
  • There is a serious increase in the number of cases since the age of 45.
  • In the 45-54 age range and the 40-79 g / day alcohol consumption group, an increase in the number of cases was observed with the increase in tobacco use. In the 120+ g / day alcohol consumption group, the case rates are high and they are distributed independently of tobacco use.
  • We cannot say that tobacco use has a serious impact on the number of cases. Because generally, more cases were observed in groups using less tobacco.
  • However, in cases where alcohol use increases, we can say that there is a serious increase in the number of cases. Likewise, we can say that the number of cases increases as the age increases.

Generalized Linear Model

ANOVA Test

## Analysis of Variance Table
## 
## Response: percentage
##           Df  Sum Sq  Mean Sq F value    Pr(>F)    
## agegp      5 1.23742 0.247484 19.4982 1.918e-12 ***
## tobgp      3 0.03361 0.011205  0.8828     0.454    
## alcgp      3 0.86514 0.288381 22.7203 1.333e-10 ***
## Residuals 76 0.96464 0.012693                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the results of the ANOVA test, it was observed that while age and alcohol groups had a serious effect, tobacco groups did not make a significant difference with the p-value 45%.

Akaike’s Information Criterion

To build a better model, we can decide whether we should remove tobacco groups from our model by looking at the AIC values.

## [1] 69.62955
## [1] 63.43588

AIC (Akaike’s Information Criterion) allows to compare models with different distributions and with different number of parameters. The best fitting model is the model with the smallest AIC-value. When we remove the tobacco group from our model, we see a serious decrease in AIC value. Therefore, we will proceed without adding tobacco groups to our model.

Logistic Regression

## 
## Call:
## glm(formula = percentage ~ agegp + alcgp, family = binomial(link = "logit"), 
##     data = esoph)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8486  -0.2650  -0.1225   0.1672   1.1824  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.69863    0.37820  -4.491 7.08e-06 ***
## agegp.L      2.45976    1.06194   2.316   0.0205 *  
## agegp.Q     -0.94555    0.95610  -0.989   0.3227    
## agegp.C     -0.01623    0.90430  -0.018   0.9857    
## agegp^4      0.44297    0.81404   0.544   0.5863    
## agegp^5     -0.17601    0.65825  -0.267   0.7892    
## alcgp.L      1.41043    0.64271   2.195   0.0282 *  
## alcgp.Q     -0.09357    0.60333  -0.155   0.8768    
## alcgp.C      0.16453    0.56916   0.289   0.7725    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23.2729  on 87  degrees of freedom
## Residual deviance:  8.5657  on 79  degrees of freedom
## AIC: 63.436
## 
## Number of Fisher Scoring iterations: 6
  • Here is the summary of logistic regression.
  • There are catogeries with a p-value of less than 5 percent, so we can proceed with this model.
  • We can predict the cancer percentages according to this model, and then we can draw the error histogram by comparing it with actual values.

Predicted Cancer Risk Percentages among Alcohol and Age Groups

  • Above, you can graphically see the estimated cancer percentages by age and alcohol groups. You can see the percentage values in the table below.
  • The increase in alcohol consumption has increased the risk level in all age groups. Likewise, risk levels increased with increasing age.
  • The risk is very low in the 25-34 and 35-44 age groups.
  • Since the age of 45, the risk increases significantly.
  • The risk levels in the 55-64 and 65-74 age groups are very similar.
  • The highest risks were observed in the age range above 75.
  • In case of low consumption of alcohol, there is a risk of at most 15 percent (group over 75 years old). In case of excessive consumption of alcohol, this risk increases up to 56 percent. Therefore, reducing alcohol consumption significantly reduces the risk of cancer.
Predicted Cancer Percentages corresp. to Age and Alcohol Groups
0-39 gm/day 40-79 gm/day 80-119 gm/day 120+ gm/day
25-34 years 1% 2% 3% 7%
35-44 years 2% 5% 7% 14%
45-54 years 9% 19% 26% 41%
55-64 years 12% 25% 34% 50%
65-74 years 13% 26% 34% 51%
75+ years 15% 30% 40% 56%

Leave-one-out Cross Validation

When the histogram of the errors is drawn, we see that it is normally distributed. A few errors appear to be more than 30. Apart from that, since we obviously want to minimize the error, the fact that the errors are mostly close to zero shows that we have established a good model. In order to evaluate the model better, RMSE is calculated as follows:

## [1] 42.17921