Spam Detection: UCI Spambase

PS: It is updated at 9/15/2020

Intro, Dataset

The topic of this study is clear, unfortunately very common thing, right? Spam mails … You probably have more than one email accounts because (at least) one of them is for subscriptions that may forward spam mails to you. This is one way to cope with spam mails; however, it’s not a good fight I think. I’ve found more clever way to fight spam ..or at least make it more enjoyful process. Watch the video below, it’s pretty short :)

. Quite funny right? I’ve really loved it :)

The spam and promotional emails are two separate things. The word “spam” in mails means “unsolicited bulk mails”. Unsolicited is the key word there. Unsolicited means that the recipient has not granted verifiable permission for the message to be sent. Bulk means that the message is sent as part of a larger collection of messages, all having substantively identical content.

Actually, service providers like Google are also trying to detect spam mails in an algorithmic way. In this study, I will practice classification trees trying to predict spam mails in the dataset given as example.

glimpse(data)

## Rows: 4,601
## Columns: 59
## $ spam       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
## $ testid     <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
## $ make       <dbl> 0.00, 0.21, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.15, 0.06…
## $ address    <dbl> 0.64, 0.28, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.12…
## $ all        <dbl> 0.64, 0.50, 0.71, 0.00, 0.00, 0.00, 0.00, 0.00, 0.46, 0.77…
## $ X3d        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ our        <dbl> 0.32, 0.14, 1.23, 0.63, 0.63, 1.85, 1.92, 1.88, 0.61, 0.19…
## $ over       <dbl> 0.00, 0.28, 0.19, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.32…
## $ remove     <dbl> 0.00, 0.21, 0.19, 0.31, 0.31, 0.00, 0.00, 0.00, 0.30, 0.38…
## $ internet   <dbl> 0.00, 0.07, 0.12, 0.63, 0.63, 1.85, 0.00, 1.88, 0.00, 0.00…
## $ order      <dbl> 0.00, 0.00, 0.64, 0.31, 0.31, 0.00, 0.00, 0.00, 0.92, 0.06…
## $ mail       <dbl> 0.00, 0.94, 0.25, 0.63, 0.63, 0.00, 0.64, 0.00, 0.76, 0.00…
## $ receive    <dbl> 0.00, 0.21, 0.38, 0.31, 0.31, 0.00, 0.96, 0.00, 0.76, 0.00…
## $ will       <dbl> 0.64, 0.79, 0.45, 0.31, 0.31, 0.00, 1.28, 0.00, 0.92, 0.64…
## $ people     <dbl> 0.00, 0.65, 0.12, 0.31, 0.31, 0.00, 0.00, 0.00, 0.00, 0.25…
## $ report     <dbl> 0.00, 0.21, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ addresses  <dbl> 0.00, 0.14, 1.75, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.12…
## $ free       <dbl> 0.32, 0.14, 0.06, 0.31, 0.31, 0.00, 0.96, 0.00, 0.00, 0.00…
## $ business   <dbl> 0.00, 0.07, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ email      <dbl> 1.29, 0.28, 1.03, 0.00, 0.00, 0.00, 0.32, 0.00, 0.15, 0.12…
## $ you        <dbl> 1.93, 3.47, 1.36, 3.18, 3.18, 0.00, 3.85, 0.00, 1.23, 1.67…
## $ credit     <dbl> 0.00, 0.00, 0.32, 0.00, 0.00, 0.00, 0.00, 0.00, 3.53, 0.06…
## $ your       <dbl> 0.96, 1.59, 0.51, 0.31, 0.31, 0.00, 0.64, 0.00, 2.00, 0.71…
## $ font       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ X000       <dbl> 0.00, 0.43, 1.16, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.19…
## $ money      <dbl> 0.00, 0.43, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.15, 0.00…
## $ hp         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hpl        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ george     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ X650       <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ lab        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ labs       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ telnet     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ X857       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ data       <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.15, 0.00…
## $ X415       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ X85        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ technology <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ X1999      <dbl> 0.00, 0.07, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ parts      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pm         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ direct     <dbl> 0.00, 0.00, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ cs         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ meeting    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ original   <dbl> 0.00, 0.00, 0.12, 0.00, 0.00, 0.00, 0.00, 0.00, 0.30, 0.00…
## $ project    <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.06…
## $ re         <dbl> 0.00, 0.00, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ edu        <dbl> 0.00, 0.00, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ table      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ conference <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ch.        <dbl> 0.000, 0.000, 0.010, 0.000, 0.000, 0.000, 0.000, 0.000, 0.…
## $ ch..1      <dbl> 0.000, 0.132, 0.143, 0.137, 0.135, 0.223, 0.054, 0.206, 0.…
## $ ch..2      <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.…
## $ ch..3      <dbl> 0.778, 0.372, 0.276, 0.137, 0.135, 0.000, 0.164, 0.000, 0.…
## $ ch..4      <dbl> 0.000, 0.180, 0.184, 0.000, 0.000, 0.000, 0.054, 0.000, 0.…
## $ ch..5      <dbl> 0.000, 0.048, 0.010, 0.000, 0.000, 0.000, 0.000, 0.000, 0.…
## $ crl.ave    <dbl> 3.756, 5.114, 9.821, 3.537, 3.537, 3.000, 1.671, 2.450, 9.…
## $ crl.long   <int> 61, 101, 485, 40, 40, 15, 4, 11, 445, 43, 6, 11, 61, 7, 24…
## $ crl.tot    <int> 278, 1028, 2259, 191, 191, 54, 112, 49, 1257, 749, 21, 184…

As seen above, the dataset includes 48 continuous real attributes of type word frequencies, 6 continuous real attributes of type char frequencies, 1 continuous real attribute of average length of uninterrupted sequences of capital letters, 1 continuous integer attribute of length of longest uninterrupted sequence of capital letters and 1 continuous integer attribute of sum of length of uninterrupted sequences of capital letters.

colnames(data)

##  [1] "spam"       "testid"     "make"       "address"    "all"       
##  [6] "X3d"        "our"        "over"       "remove"     "internet"  
## [11] "order"      "mail"       "receive"    "will"       "people"    
## [16] "report"     "addresses"  "free"       "business"   "email"     
## [21] "you"        "credit"     "your"       "font"       "X000"      
## [26] "money"      "hp"         "hpl"        "george"     "X650"      
## [31] "lab"        "labs"       "telnet"     "X857"       "data"      
## [36] "X415"       "X85"        "technology" "X1999"      "parts"     
## [41] "pm"         "direct"     "cs"         "meeting"    "original"  
## [46] "project"    "re"         "edu"        "table"      "conference"
## [51] "ch."        "ch..1"      "ch..2"      "ch..3"      "ch..4"     
## [56] "ch..5"      "crl.ave"    "crl.long"   "crl.tot"

There are also two columns for train/test split and target (spam or not) provided. There seems no missing values in these columns.

Classification w/Decision Trees

Let’s see the distribution of the target variable which is indicating whether an email is spam or not.

print(paste0("Percentage: ", round((nrow(filter(data, spam == TRUE))/nrow(data)) * 100, 2), "%"))

## [1] "Percentage: 39.4%"

The ratio is quite close to be balanced dataset. In classification tasks, imbalance is very important problem to be solved from different perspectives with different methods. Some of the wellknown methods are oversampling, mice etc. But I think we don’t need them for now due to dataset size and target distribution.

data$spam <- as.factor(data$spam)

train <- subset(data, testid == FALSE)
test <- subset(data, testid == TRUE)

train <- train[, -2]
test <- test[, -2]

print(paste0("Train > Percentage: ", round((nrow(filter(train, spam == TRUE))/nrow(data)) * 100, 2), "%"))

## [1] "Train > Percentage: 26.47%"

print(paste0("Test  > Percentage: ", round((nrow(filter(test, spam == TRUE))/nrow(data)) * 100, 2), "%"))

## [1] "Test  > Percentage: 12.93%"

Afte splitting data to train and test sets, wee see that target distributions differ. This is unwanted such that I would try to split the dataset based on the target distribution to see close ration in both sets. However, train-test split is already given in this task, so it’s used as is.

Let’s quickly fit a decision tree maxdepth=2 without playing parameters for introduction. In this process, we fit over train data and then compute accuracy over test set to report the performance.

train_features <- train %>% select(-spam)
train_labels   <- train$spam

test_features <- test %>% select(-spam)
test_labels   <- test$spam

compute_accuracy <- function(fit, test_features, test_labels) {
  predicted <- predict(fit, test_features, type = "class")
  mean(predicted == test_labels)
}

reg_tree <- rpart(spam~., data = train, control = rpart.control(maxdepth = 2))
compute_accuracy(reg_tree, test_features, test_labels)

## [1] 0.859375

Accuracy is already not bad. Now let’s try different control parameters to go further. We can test different parameter settings in a loop.

gs <- list(minsplit = c(2, 5, 10, 20, 30),
           maxdepth = c(2, 5, 10, 20, 30)) %>% 
    cross_df() # Convert to data frame grid

mod <- function(...) {
  rpart(spam ~ ., data = train, control = rpart.control(...))
}

gs <- gs %>% mutate(fit = pmap(gs, mod))

gs <- gs %>%
  mutate(test_accuracy = map_dbl(fit, compute_accuracy,
                                 test_features, test_labels))

print.data.frame(gs[, c("minsplit", "maxdepth", "test_accuracy")])

##    minsplit maxdepth test_accuracy
## 1         2        2     0.8593750
## 2         5        2     0.8593750
## 3        10        2     0.8593750
## 4        20        2     0.8593750
## 5        30        2     0.8593750
## 6         2        5     0.8984375
## 7         5        5     0.8984375
## 8        10        5     0.8984375
## 9        20        5     0.8984375
## 10       30        5     0.8984375
## 11        2       10     0.8977865
## 12        5       10     0.8977865
## 13       10       10     0.8977865
## 14       20       10     0.8977865
## 15       30       10     0.8977865
## 16        2       20     0.8977865
## 17        5       20     0.8977865
## 18       10       20     0.8977865
## 19       20       20     0.8977865
## 20       30       20     0.8977865
## 21        2       30     0.8977865
## 22        5       30     0.8977865
## 23       10       30     0.8977865
## 24       20       30     0.8977865
## 25       30       30     0.8977865

It seems that minsplit=2 and maxdepth=10 makes the classifier quite strong, rigth? Then let’s construct the tree with these settings now and then browse the tree to investigate nodes.

reg_tree2 <- rpart(spam~., data = train, control = rpart.control(minsplit=2, maxdepth=10))
prp(reg_tree2)

The tree seems better than what I expected. I was expecting to see a more complicated diagram then this. It seems that frequency of the character “$” is first step to identify spam mail in this tree. It makes sense rigth, it is very correlated to fraudulent behaviours.

PS: ch4 -> $ ch3 -> !

Let’s see the importance of the features in classifying the mails. We already have some intuition about it but let’s compare feature importances via ggplot.

df <- data.frame(imp = reg_tree$variable.importance)
df2 <- df %>% 
    tibble::rownames_to_column() %>% 
    dplyr::rename("variable" = rowname) %>% 
    dplyr::arrange(imp) %>%
    dplyr::mutate(variable = forcats::fct_inorder(variable))
ggplot2::ggplot(df2) +
    geom_col(aes(x = variable, y = imp),
             col = "black", show.legend = F) +
    coord_flip() +
    scale_fill_grey() +
    theme_bw()

“Money” and “Credit” are very expected, right? ch5 here is corresponding to the symbol “#”.

We already know that accuracy is around 89% with this setting but let’s report the accuracy and confusion matrix again.

confusionMatrix(predict(reg_tree2, test_features, type = "class"), test$spam, positive = 'TRUE')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   878   94
##      TRUE     63  501
##                                           
##                Accuracy : 0.8978          
##                  95% CI : (0.8815, 0.9125)
##     No Information Rate : 0.6126          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7826          
##                                           
##  Mcnemar's Test P-Value : 0.01665         
##                                           
##             Sensitivity : 0.8420          
##             Specificity : 0.9330          
##          Pos Pred Value : 0.8883          
##          Neg Pred Value : 0.9033          
##              Prevalence : 0.3874          
##          Detection Rate : 0.3262          
##    Detection Prevalence : 0.3672          
##       Balanced Accuracy : 0.8875          
##                                           
##        'Positive' Class : TRUE            
##

It was really fun!

Spam Detection: UCI Spambase

Faruk Tufekci

9/13/2020

Intro, Dataset

Classification w/Decision Trees

References