Preprocess and Libraries

You may find libraries and data preprocess steps below. Code chunk loads data and creates column names for it.

library(dplyr)
library(ggplot2)
library(tidyverse)
library(corrplot)
library(readr)
library(tidyr)
library(GGally)
library(caret)

data<-if(!file.exists("C:/Users/erenm/Downloads/spambase.zip")) {
  download.file(url = "http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.zip",
                destfile = "C:/Users/erenm/Downloads/spambase.zip")
  
  unzip("C:/Users/erenm/Downloads/spambase.zip", exdir = "C:/Users/erenm/Downloads")
}


data_raw <- read.csv("C:/Users/erenm/Downloads/spambase.data", header = F)


data_raw_names <- read.delim("C:/Users/erenm/Downloads/spambase.names", header = FALSE)
data_raw_names <- data_raw_names[-(1:30),]
data_raw_names <- as.data.frame(data_raw_names)
data_raw_names <- data_raw_names %>%
  separate(data_raw_names, c("Variable", "Type"), sep = ":")
names(data_raw) <- data_raw_names$Variable
names(data_raw)[is.na(names(data_raw))] <- "classes"

data<-data_raw

Introduction

Aim of this document to explain the prediction process of Email Spam. Spam data can be found in UCI ML library. There are 58 columns in this data set, includes word frequencies’ of 4601 emails.

Correlation and Classes

Below you can find a correlation matrix with the most correlated variables. If we would use a logistic regression model; we could try to add some interaction terms based on these variables.

corrplot(cor(data[c(28:32,34:36,40)]))

Below classes seem to be a bit imbalanced but not a big issue. We may arrange the threshold.

table(data$classes)
## 
##    0    1 
## 2788 1813

Below code chunk helps us to set our target variable as factor and create train and test sets for prediction.

data$classes<-as.factor(data$classes)

set.seed(125)
partition <- createDataPartition(data$classes, p = 0.8, list = FALSE)
train <- data[partition, ]
test <- data[-partition, ]

Modeling

Below code chunk build Random Forest, GBM and Xgboost models and presents the summary statistics of our models. As we can see from plots and the stats GBM gave us a more normal distribution but Xgboost also seems to be fine in distribution part. I would work with Xgboost to improve modeling process.

control <- trainControl(method="repeatedcv", number=5, repeats=2)
# train the Rfe model
set.seed(125)
modelRfe <- train(classes~., data=train, method="ranger", trControl=control)
# train the GBM model
set.seed(125)
modelGbm <- train(classes~., data=train, method="gbm", trControl=control, verbose=FALSE)
# train the Xgb model
set.seed(125)
modelXgb <- train(classes~., data=train, method="xgbTree", trControl=control)

results <- resamples(list(RFE=modelRfe, GBM=modelGbm, XGB=modelXgb))
# summarize the distributions
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: RFE, GBM, XGB 
## Number of resamples: 10 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RFE 0.9470828 0.9514925 0.9565217 0.9550545 0.9588995 0.9605978    0
## GBM 0.9415761 0.9460651 0.9483696 0.9478539 0.9497794 0.9525102    0
## XGB 0.9497965 0.9524457 0.9551944 0.9553247 0.9575555 0.9619565    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RFE 0.8887978 0.8982346 0.9086130 0.9056811 0.9138683 0.9173392    0
## GBM 0.8760865 0.8869178 0.8911595 0.8902672 0.8944308 0.9000817    0
## XGB 0.8946284 0.8998123 0.9060960 0.9063985 0.9114642 0.9205257    0
# boxplots of results
bwplot(results)

# dot plots of results
dotplot(results)

## Model Evaluation

We can see that we have a good accuracy and also similar results in precision and recall. Imbalance data didn’t turn into a problem.

pred <- predict(modelXgb, test)

confusionMatrix(pred,test$classes)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 540  27
##          1  17 335
##                                          
##                Accuracy : 0.9521         
##                  95% CI : (0.9363, 0.965)
##     No Information Rate : 0.6061         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8992         
##                                          
##  Mcnemar's Test P-Value : 0.1748         
##                                          
##             Sensitivity : 0.9695         
##             Specificity : 0.9254         
##          Pos Pred Value : 0.9524         
##          Neg Pred Value : 0.9517         
##              Prevalence : 0.6061         
##          Detection Rate : 0.5876         
##    Detection Prevalence : 0.6170         
##       Balanced Accuracy : 0.9474         
##                                          
##        'Positive' Class : 0              
##