Exploring the NSL-KDD dataset from the Canadian Institute for Cybersecurity. (https://www.unb.ca/cic/datasets/nsl.html)
Using the older 1999 dataset since it's smaller and allows for faster iterations.
Download source I used: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Using this dataset to train an Autoencoder for Hack Detection.
References: [1] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set,” Submitted to Second IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2009.
# Imports
from tqdm import tqdm
import random
import sklearn
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import matplotlib
import matplotlib.pyplot as plt
import datetime
import tensorflow.keras.backend as K
from mpl_toolkits import mplot3d
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler, OneHotEncoder
from tensorflow.keras.layers import Input, Dense, LSTM, TimeDistributed, RepeatVector, AlphaDropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.models import Model
from scipy.spatial.distance import euclidean
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve, f1_score
from sklearn.metrics import precision_score, recall_score, accuracy_score
pd.options.mode.chained_assignment = None
%reload_ext watermark
%watermark -v --iv
# File paths
path_train = 'https://raw.githubusercontent.com/Matheus-Schmitz/Autoencoder_Hack_Detection/master/data/KDDTrain.csv'
path_test = 'https://raw.githubusercontent.com/Matheus-Schmitz/Autoencoder_Hack_Detection/master/data/KDDTest.csv'
path_colnames = 'https://raw.githubusercontent.com/Matheus-Schmitz/Autoencoder_Hack_Detection/master/data/kddcup-columns-names.txt'
# Load train dataset
df_train = pd.read_csv(path_train, header = None)
df_train.head()
# Load column names from txt file
col_names = pd.read_csv(path_colnames, header = None)
col_names = col_names.squeeze()
col_names
# Clean column names
col_names_cleaned = [i.split(':')[0] for i in col_names]
col_names_cleaned
# Add a column for the webtraffic type (the target variable)
col_names_cleaned.extend(['result'])
# Set the column names on the training dataset
df_train.columns = col_names_cleaned
df_train.head()
# Data types
df_train.dtypes
# Check all possible traffic classifications
# 'normal' is benign traffic, all alternatives are anomalous activity
df_train.result.unique()
# Check all network service classifications
df_train.service.unique()
There are many network service types for many purposes, and those can have confounding attributes when it comes to identifying malicious/anomalous activity.
I'll keep only http services and focus on web activity.
# Extracting only samples with http service
df_train_http = df_train[df_train['service'] == 'http']
df_train_http.head()
# Check amount of normal and anomalous samples
normal_samples = df_train_http[df_train_http['result'] == 'normal'].shape[0]
anomalous_samples = df_train_http[df_train_http['result'] != 'normal'].shape[0]
print(f'Share of Normal Samples: {100*(normal_samples/len(df_train_http)):.2f} %')
print(f'Share of Anomalous Samples: {100*(anomalous_samples/len(df_train_http)):.2f} %')
The dataset is highly unbalanced. One alternative would be using SMOTE to balance it, but here I'll take another route which I was taught can work well in such scenarios:
I'll create a single class autoencoder, which learns the 'signature' of a class and then classifies a test sample as being or not being part of that class based on it's distance (in standard deviations) from the class signature. This is done with SELUs (a 'self-normalizing' variation of RELUs).
This model will essentially return the probability of a test sample being or not being 'normal' traffic.
# Extracting only the samples with normal activity
df_train_http_normal = df_train_http[df_train_http['result'] == 'normal']
df_train_http_normal.shape
# Describe
df_train_http_normal.describe()
Since the model works based on standard deviations, some columns will have to be dropped, namely those without a standard deviation.
Note: This means dropping the target variable (results) too, since this will be unsupervised learning.
# List of all columns to remove
zero_std_cols = []
# Columns with string values
mask = (df_train_http_normal.dtypes == 'object').values
string_cols = df_train_http_normal.columns[mask]
zero_std_cols.extend(string_cols)
zero_std_cols
# Columns with standard deviation of zero
mask = df_train_http_normal.std(axis = 0) == 0
no_std_cols = mask.index[mask.values == True]
zero_std_cols.extend(no_std_cols)
zero_std_cols
# Total columns
len(df_train_http_normal.columns)
# Columns to be dropped
len(zero_std_cols)
# Removing columns with a standard deviation of zero
df_train_std = df_train_http_normal.drop(zero_std_cols, axis = 1)
# Describe
df_train_std.describe()
# Boxplot
df_train_std.boxplot(figsize = (35, 10))
The 'dst_bytes' column has a much greater range, and the other columns also have varying ranges. This clearly calls for a dataset scaling.
Here Standandizing (mean = 0, std = 1) is much preferred over Normalizing ([0,1]) due to the approach being used.
# Standardizing data
scaler = StandardScaler()
df_train_scaled = pd.DataFrame(scaler.fit_transform(df_train_std),
columns = df_train_std.columns)
df_train_scaled.head()
# Correlation plot
plt.figure(figsize = (15, 10))
sns.heatmap(df_train_scaled.corr(), cmap = 'viridis')
plt.show()
Although most features are not correlated there are a few pairs of features with high correlation. Moreso, the dataset has 28 dimensions, somewhat too much to efficiently train a model.
Hence I'll employ PCA to reduce the feature space while simultaneously removing the correlations present on the dataset.
In line with Pareto's principle I'll aim to keep 80% of the varation present in the original feature space (in hopes of needing only ~20% of the features)
# Reduce dataset dimensionality with PCA
# Aiming to keep 80% of the explained variance present in the original dataset
pca = PCA(n_components = 0.8)
pca.fit(df_train_scaled)
# Extracting the Principal Components resultant from PCA
pca_cols = ['PCA_' + str(i) for i in range(pca.n_components_)]
df_train_pca = pd.DataFrame(pca.transform(df_train_scaled), columns = pca_cols)
df_train_pca.head()
From the original Task description:
"The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records.
A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes."
In other words, this is a time series dataset!
Therefore... (drums beating) ... we are going with LSTMs!
Source: https://kdd.ics.uci.edu/databases/kddcup99/task.html
Because LSTMs require 2D input data (timesteps, features), the dataset will need to be transformed. Using a function to extract 2D windows.
After some testing a window_size of 10 seems to be a good value. Also using a stride of 10, therefore there is no overlap between windows.
# Function to get windows
def get_windows(df, window_size = 10, stride = 10):
windows_arr = []
for i in tqdm(range(0, len(df)-stride, stride)):
windows_arr.append(df.iloc[i:i+window_size, :].to_numpy())
return np.array(windows_arr)
# Get windows
window_size, stride = 10, 10
windows_arr = get_windows(df_train_pca)
df_train_pca.shape
# 3804 samples (38049 rows divided by stride of 10, keeping only the integer part)
# Each sample constains 10 timesteps (from window_size)
# Each timestep constains 14 features (from pca.n_components_)
windows_arr.shape
# Shuffle the samples to avoid biasing the model during training
indices = np.arange(windows_arr.shape[0])
np.random.shuffle(indices)
windows_shuffled = windows_arr[indices]
The activation function used will be SELU (Scaled Exponential Linear Unit) instead of the usual RELU.
From TF: https://www.tensorflow.org/api_docs/python/tf/keras/activations/selu
TF recommends to use AlphaDropout along with SELU: https://www.tensorflow.org/api_docs/python/tf/keras/layers/AlphaDropout
The SELU activation auto-normalizes the neural network.
The reasoning this choice is that given the assumption that the network's weights are initialized as a normal distribuition, then SELU will make it so that the entire NN always follows a gaussian distribution. This is because when multiplying the NNs componenets (forward/backward step) it will still be gaussian thanks to SELU. All this implies that the NN output will also be normally distributed.
This architecture choice supposedly results in better performance for this type of problem (unbalanced classes on an autoencoder).
# Clean Keras session
K.clear_session()
# Encoder with Stacked LSTM
encoder = Sequential([LSTM(80, return_sequences = True,
activation = 'selu',
input_shape = (window_size, pca.n_components_)),
AlphaDropout(rate = 0.2),
LSTM(50, activation = 'selu', return_sequences = True),
LSTM(20, activation = 'selu')],
name = 'encoder')
The last LSTM does not return_sequences, which will compress the output, hence the decoder needs to start with a RepeatVector of window_size to get the data back to the shape of windows_arr
TF RepeatVector: https://www.tensorflow.org/api_docs/python/tf/keras/layers/RepeatVector
# Decoder with output dimension equals to input dimension
decoder = Sequential([RepeatVector(window_size),
LSTM(50, activation = 'selu', return_sequences = True),
LSTM(80, activation = 'selu', return_sequences = True),
TimeDistributed(Dense(pca.n_components_, activation = 'linear'))],
name = 'decoder')
# Sequential Autoencoder
autoencoder = Sequential([encoder, decoder], name = 'autoencoder')
When to use HuberLoss:
As said earlier that Huber loss has both MAE and MSE. So when we think higher weightage should not be given to outliers, go for Huber.
https://medium.com/@gobiviswaml/huber-error-loss-functions-3f2ac015cd45
Wikipedia: https://en.wikipedia.org/wiki/Huber_loss
# Compile the autoenceoder with Huber Loss and adam
autoencoder.compile(optimizer = 'adam', loss = tf.keras.losses.Huber(100.))
# Model summary
autoencoder.summary()
# Encoder and Decoder summary
encoder.summary(), decoder.summary()
# Model checkpoint (save model whenever validation loss improves)
now = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M")
check_point = tf.keras.callbacks.ModelCheckpoint(f'models/autoencoder_{now}.h5',
monitor = 'val_loss',
save_best_only = True,
mode = 'min',
verbose = 1)
%%time
# Note: test y as windows_shuffled[:, :, ::-1]
# This would be a flipped autoencoder
# Train model
train_hist = autoencoder.fit(windows_shuffled,
windows_shuffled,
batch_size = 64,
validation_split = 0.2,
epochs = 100,
callbacks = [check_point])
# Load best model
autoencoder_loaded = tf.keras.models.load_model(f'models/autoencoder_{now}.h5')
# Load test dataset
df_test = pd.read_csv(path_test, header = None, names = col_names_cleaned)
df_test.head(3)
# Get only samples with service == http
df_test_http = df_test[df_test['service'] == 'http']
A sample is anomalous if it's result is != normal A time window will be anomalous if it contains any sample which is != normal
# Binary label to represent anomalies
status = pd.Series([0 if i == 'normal' else 1 for i in df_test_http['result']])
test_labels = [1 if np.sum(status[i:i+window_size])>0 else 0 for i in range(0, len(status)-stride, stride)]
print(f'Test samples: {len(status)}')
print(f'Test windows: {len(test_labels)}')
# Removing columns with a standard deviation of zero
df_test_std = df_test_http.drop(zero_std_cols, axis = 1)
# Scale test dataframe
df_test_scaled = pd.DataFrame(scaler.transform(df_test_std),
columns = df_test_std.columns)
# PCA
df_test_pca = pd.DataFrame(pca.transform(df_test_scaled),
columns = pca_cols)
df_test_pca.head(3)
# Create time windows
test_windows = get_windows(df_test_pca, window_size = window_size, stride = stride)
test_windows.shape
# Predict
test_windows_pred = autoencoder_loaded.predict(test_windows)
The reconstruction error is used as a metric to predict the probability of a sample being anomalous.
The reasoning is that the autoencoder was trained only with normal data. Thus, during inference (prediction) if the model receives an anomalous sample the encoder will compress it to a latent representation similar to that of normal samples (because it was trained to compress only normal samples).
Thefore this latent representation will lose information related to it's anomalous characteristics, and thus the decoder will reconstruct it as a normal sample, then when the input and output are compared there will be a huge error, as the output will look nothing like the anomalous input. This huge error is what signifies a sample being anomalous.
After calculating the reconstruction errors for all samples in the test dataset, they can be scaled to [0, 1], known as Anomaly Score. A threshold can be defined on this Anomaly Score so that all samples above the threshold are considered anomalous.
# Reconstruction error for each sample
# The reconstruction error I'm using is the euclidean distance between the y_true and y_pred tensors
# Euclidean distance = The hipotenuse that you get on the pitagoras formula, aka the shortest distance between two points
# Since it will be lots of matrix (tensor) math its best done in TensorFlow
# List with errors
recon_errors = []
# Condition to continue the loop:
# While i is less than iters, continue the loop, once i gets to the value of iters, then the loop ends
def cond(y_true, y_pred, i, iters):
return tf.less(i, iters)
def body(y_true, y_pred, i, iters):
# Tensor
# First reference is sample (as in the index reference),
# second is timewindow (the sample's sample of time) and third is PCA components (14 components for each timewindow unit)
# Here I'm getting the differente (math.subtract) between two entire windows (tf.slice), namely the y_true and y_pred windows
tensor_for_error = tf.math.subtract(tf.slice(y_true, [i, 0, 0], [1, -1, -1]),
tf.slice(y_pred, [i, 0, 0], [1, -1, -1]))
# Reshape
# Get the error back to the format of the batch used (number of samples in a timewindow x attributes per sample)
tensor_for_error = tf.reshape(tensor_for_error, [window_size, pca.n_components_])
# Reconstruction error
# Measured as the distance between the y_true and y_pred tensors
# The error will be the mean euclidean distance between each of the pca.n_components_ of y_true and y_pred
recon_error = tf.math.reduce_mean(tf.norm(tensor_for_error, ord = 'euclidean', axis = 1))
# Append error to the list
recon_errors.append(recon_error.numpy())
return [y_true, y_pred, tf.add(i, 1), iters]
Note on tf.norm(ord = 'euclidean')
It essentially means I'm taking the dimensional location of the tensor on an n-dimensional plane where the number of planar dimensions is the number of dimensions in the tensor. Kinda like plotting a point on a 3D graph, except here it's a 14D graph.
From wikipedia: https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm
In mathematics, a norm is a function from a vector space over the real or complex numbers to the nonnegative real numbers that satisfies certain properties pertaining to scalability and additivity, and takes the value zero only if the input vector is zero. A pseudonorm or seminorm satisfies the same properties, except that it may have a zero value for some nonzero vectors.[1]
The Euclidean norm or 2-norm is a specific norm on a Euclidean vector space, that is strongly related to the Euclidean distance, and equals the square root of the inner product of a vector with itself.
A vector space on which a norm is defined is called a normed vector space. Similarly, a vector space with a seminorm is called a seminormed vector space.
On the n-dimensional Euclidean space ℝn, the intuitive notion of length of the vector x = (x1, x2, ..., xn) is captured by the formula
This is the Euclidean norm, which gives the ordinary distance from the origin to the point X, a consequence of the Pythagorean theorem. This operation may also be referred to as "SRSS" which is an acronym for the square root of the sum of squares.
# Iterations (which is the number of test_windows)
iters = tf.constant(len(test_windows))
# Loop
result = tf.while_loop(cond, body, [tf.constant(test_windows.astype(np.float32)),
tf.constant(test_windows_pred.astype(np.float32)), 0, iters])
# Minmax scaler
mm_scaler = MinMaxScaler()
# Reshape
recon_errors = np.array(recon_errors).reshape(-1, 1)
# Apply scaler
anomaly_predictions = mm_scaler.fit_transform(recon_errors).flatten()
# Plot
plt.figure(figsize = (20, 10))
plt.plot(test_labels, c = 'blue', label = 'Original')
plt.plot(anomaly_predictions, c = 'red', label = 'Predicted')
plt.yticks(np.arange(0, 1.1, 0.1))
plt.xlabel('Samples')
plt.ylabel('Anomaly Scores')
plt.grid()
plt.legend()
plt.show()
# Putting y_true and y_pred into a dataframe so that it can be ordered
df_compare = pd.DataFrame(data = {'y_true': test_labels,
'y_pred': anomaly_predictions})
# Sort dataframe so the plot is more legible
df_compare = df_compare.sort_values(['y_pred']).reset_index(drop=True)
# Plot
plt.figure(figsize = (20, 10))
plt.plot(df_compare.y_true, c = 'blue', label = 'Original', marker='.', linestyle='None')
plt.plot(df_compare.y_pred, c = 'red', label = 'Predicted')
plt.yticks(np.arange(0, 1.1, 0.1))
plt.xlabel('Samples')
plt.ylabel('Anomaly Scores')
plt.title('Probability of Sample Being Anomalous', fontsize = 24)
plt.grid()
plt.legend()
plt.show()
# ROC Curve
fpr, tpr, thresholds = roc_curve(test_labels, anomaly_predictions)
# AUC Score
auc = roc_auc_score(test_labels, anomaly_predictions)
print(f'AUC Score: {auc}')
# Plot ROC Curve
plt.figure(figsize = (10,5))
plt.plot([0, 1], [0, 1], color = 'black', linestyle = '--')
plt.plot(fpr, tpr, label = f'AUC = {auc}')
plt.grid()
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.title('ROC')
plt.show()
Need to find the threshold above which a sample will be considered an anomaly. Too low and the system will catch all anomalies at the cost of raising way too many false positives. To high and the the system will allow many anomalies to pass by undetected. From a visual analysis of the AUC Curve is seems that the ideal value is in the range of [0.05, 0,15].
Using the F-1 Score to determine the ideal threshold, as it accounts for both Precision (impact of false positives) and Recall (impact of false negatives).
# List the thresholds suggested by roc_curve and their F-1 scores
thresholds_anomalies = [(anomaly_predictions > i).astype(np.int32) for i in thresholds]
# thresholds_anomalies is a list of lists, each containig 0 or 1 depending on whether a given threshold predicts a given sample as anomaly
# Compare thresholds_anomalies to test_labels to calculate the F1 score (by first calculating precision and recall) of each proposed threshold value
f1_scores = [f1_score(test_labels, i) for i in thresholds_anomalies]
# Plot
plt.figure(figsize = (10, 5))
plt.plot(thresholds, f1_scores)
plt.grid()
plt.xlabel('Thresholds')
plt.ylabel('F-1 Score')
plt.title('F-1 Score vs Thresholds')
plt.show()
# Get the best threshold
max_f1_score = np.max(f1_scores)
best_threshold = thresholds[f1_scores.index(max_f1_score)]
print(f'Best Threshold = {best_threshold}')
# Create an anomaly indicator (a mask)
anomaly_indicator = (anomaly_predictions > best_threshold).astype(np.int32)
anomaly_indicator
# Confusion Matrix
confusion_matrix(test_labels, anomaly_indicator)
# Adjust labels
anomaly_indicator_final = ['normal' if i == 0 else 'anomaly' for i in anomaly_indicator]
# Plot
plt.figure(figsize = (20,10))
sns.scatterplot(x = np.arange(0, len(anomaly_predictions)), y = anomaly_predictions, hue = anomaly_indicator_final,
palette = ['red', 'blue'], legend = 'full')
plt.axhline(y = best_threshold, linestyle = '--', label = 'threshold')
plt.legend()
plt.grid()
plt.show()
# Metrics
precision = precision_score(test_labels, anomaly_indicator)
recall = recall_score(test_labels, anomaly_indicator)
f1_sc = f1_score(test_labels, anomaly_indicator)
accuracy_sc = accuracy_score(test_labels, anomaly_indicator)
print('Model Performance Metrics:')
print(f'Precision = {precision:.5f}')
print(f'Recal = {recall:.5f}')
print(f'F1 Score = {f1_sc:.5f}')
print(f'Accuracy = {accuracy_sc:.5f}')