Algorithm Optimization & Selection for Spam Detection

Matheus Schmitz
LinkedIn

Github Portfolio

Dataset

Source:
http://archive.ics.uci.edu/ml/datasets/Spambase

Creators:
Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt
Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304

Donor:
George Forman (gforman at nospam hpl.hp.com) 650-857-7835

Imports

1. Loading the Dataset

1.1. Train-Test Split

1.2. Exploratory Analysis

Considering only the training data so as to leave testing data untouched

The skewness in this boxplot indicated the need to implement a MinMaxScaler during the model training pipeline.

I don't see much strong evidence of correlation among variables, which would be a strong motivation to use PCA, SVD or other dimensionality reduction method, so I'll opt to skip those, while perhaps using a classifier that has some dimensionality reduction built-in, such as Linear Discriminant Analysis.

Some of those features have a nearly linearly separable boundary, others don't, indicating the need for testing algorithms with different assumptions about the data.

One this is clear though, I do see a lot more blue dots than red dots, so maybe SMOTE could be of use here.

Those boxplots are just more evidence of the need for scaling the variables.

2. Building the Classifier

Given the skewness seen in the boxplots, I'll be including a normalizing (StandardScaler) pipeline with all classifiers to be tested.

Additionally, before making a call on which classifier to choose as the final one, I'll first optimize all classifiers using a GridSearch approach.

2.1 Defining Grid Searches to Optimize the Classifiers

Support Vector Classifier

Random Forest Classifier

KNN Classifier

Gaussian Process Classifier

Linear Discriminant Analysis

2.2. Finding Optimal Hyperparameter with Grid Search & K-Fold Cross-Validation

Considering the validation data the Random Forest Classifier is ahead, but none of the other algorithms is too far behind! Let me now use the best hyperparameters from each K-Fold iteration to build a final version of each model to be tested on the test dataset, so that I can select the best model.

2.3. Comparing Results of the Optimized Classifiers & Selecting the Best

Among the optimized models there were many ties in the training set, but when looking at the testing set, the Random Forecast Classifier is the clear winner!

But before I move ahead to calculating the performance metrics for our winning algorithm, let's introduce one final challenger: TPOT's AutoML.

Bonus: AutoML

Attempting to improve on the performance of the best classifier by using TPOT's AutoML tool to search for an ideal classifier pipeline.

TPOT AutoML Documentation: http://epistasislab.github.io/tpot/

Looks like TPOT proved no match for our Random Forest Classifier, so let's move ahead with extracting performance metrics for our champion.

3. Evaluating the Results

3.1. K-Fold Cross-Validation for the Best Model

Now the the best model has been identified, rerun a K-Fold cross-validation without using SMOTE, so that more representative FPR, FNR and Error values can be obtained

Note: It is possible to obtain all the metrics (error, FPR, FNR) while comparing all optimized models.

I've chosen to first (2.3) compare all models and then later (3.1) re-train the winning model and calculate the metrics only for that model for the sake of clarity, as I believe this apporach makes the step-by-step of the pipeline I've implemented clearer.

3.2. Confusion Matrix and ROC-AUC Curve for the Best Model

End

Matheus Schmitz
LinkedIn

Github Portfolio