Introduction¶

Predicting Disasters by Analyzing Keywords in Texts on Social Media

Twitter has become an important communication channel in times of emergency.

The ubiquity of smartphones allows people to announce an emergency that they are observing in real time. For this reason, more agencies are interested in programmatically monitoring Twitter (that is, disaster relief organizations and news agencies).

But it is not always clear whether a person's words are really announcing a disaster.

This project contains a complete pipeline for the Natural Language Processing task of text classification. Specifically, it tries to classify whether a tweet describes a real or fake disaster.

The pipeline consists of:

Cleaning and Pre-Processing
Attribute Engineering
Pre-Trained Word Embeddings Template
Deep Learning Model - BiLSTM Recurrent Neural Network

Problem Definition

This project will be predicting whether a particular tweet is about a real disaster or not. If it is a disaster, the forecast must be 1. Otherwise, 0.

Each sample in the training and test set has the following information:

The text of a tweet.
A keyword from that tweet (although it may be blank).
The location from which the tweet was sent (may also be blank).

Dataset

Using a dataset based on the public dataset: Multilingual Disaster Response Messages

The data contains a set of messages related to disaster response, covering several languages, suitable for categorizing text and tasks related to the processing of natural languages.

Details about the dataset can be obtained from the address below.

https://appen.com/datasets/combined-disaster-response-data/

Imports¶

!pip install -q -U watermark

!pip install -q gensim

# Imports
import re
import gc
import nltk
import torch
import sklearn
import gensim
import numpy as np
import pandas as pd
import torch.nn as nn
import matplotlib.pyplot as plt
import seaborn as sns
import gensim.downloader as api
from collections import Counter
from copy import deepcopy
from nltk.tokenize import TweetTokenizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from gensim.models.word2vec import Word2Vec
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
%matplotlib inline

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
/usr/local/lib/python3.6/dist-packages/nltk/twitter/__init__.py:20: UserWarning: The twython library has not been installed. Some functionality from the twitter package will not be available.
  warnings.warn("The twython library has not been installed. "

# Google Colab Package Versions
%reload_ext watermark
%watermark -v -iv

torch   1.6.0+cu101
nltk    3.2.5
seaborn 0.10.1
re      2.2.1
gensim  3.6.0
numpy   1.18.5
pandas  1.0.5
sklearn 0.22.2.post1
CPython 3.6.9
IPython 5.5.0

# Set the device to run the model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Device: {device}')

Device: cuda

# Download NTLK tagger
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.

True

# Download lexicon
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...

True

Preparing the Embedding Model¶

# Loading a model pretrained with twitter data
model_glove_twitter = api.load("glove-twitter-100")

[==================================================] 100.0% 387.1/387.1MB downloaded

/usr/local/lib/python3.6/dist-packages/smart_open/smart_open_lib.py:254: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL

# Model vector size
model_glove_twitter.vector_size

100

# Create a randomized vector to represent the <UNK> token (unseen word)
random_vec_for_unk = np.random.uniform(-1, 1, size = model_glove_twitter.vector_size).astype('float32')
random_vec_for_unk = random_vec_for_unk.reshape(1, model_glove_twitter.vector_size)
random_vec_for_unk

array([[-0.15228774, -0.7779301 ,  0.1460513 , -0.78397346, -0.67638195,
        -0.92179185,  0.6653437 , -0.52267456, -0.07600751,  0.59899086,
         0.30160093,  0.51256144,  0.62603956, -0.98823535, -0.8988257 ,
        -0.11218564, -0.11963747, -0.82945645, -0.56116694, -0.48435378,
        -0.24895552, -0.3755303 ,  0.4293109 ,  0.04036734, -0.40730292,
        -0.32084548,  0.630183  ,  0.17985234,  0.3362619 ,  0.9567786 ,
        -0.56765604, -0.6460069 ,  0.56170535,  0.24307416,  0.43949696,
        -0.11729505,  0.5648272 ,  0.6418695 ,  0.8418464 , -0.92995185,
        -0.08862245, -0.39612278,  0.9556933 ,  0.36304367,  0.9435056 ,
         0.9012786 , -0.03444151,  0.59105396, -0.00391296,  0.6635091 ,
         0.9435658 ,  0.2171051 ,  0.958137  ,  0.74699914,  0.36015955,
         0.81875867,  0.47610083, -0.52134055,  0.51991296,  0.9171786 ,
         0.45472383, -0.29279634, -0.9296127 , -0.9245855 ,  0.612925  ,
        -0.80240875,  0.4776652 ,  0.02110074, -0.73492515,  0.91925156,
         0.97876674,  0.27329978,  0.4391174 ,  0.9864372 ,  0.07137173,
        -0.56672674, -0.8826371 , -0.7024133 ,  0.24381687,  0.26646733,
        -0.17567189, -0.9907748 ,  0.13369767, -0.904514  , -0.8957776 ,
         0.57441974,  0.86603725, -0.5194038 ,  0.71051836,  0.7253515 ,
        -0.1079134 ,  0.65871036, -0.59114414,  0.39294195, -0.33553073,
        -0.9913117 ,  0.72779137, -0.9105424 , -0.1652777 ,  0.17466587]],
      dtype=float32)

# Similarity test
model_glove_twitter.most_similar(random_vec_for_unk)

/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):

[('nnsunshine', 0.41928791999816895),
 ('despedirnos', 0.3852018117904663),
 ('̿', 0.38152292370796204),
 ('jumpity', 0.36855828762054443),
 ('amarlo', 0.3654995560646057),
 ('tankless', 0.3651796877384186),
 ('знаю.n', 0.362972617149353),
 ('❦', 0.35941869020462036),
 ('chapemos', 0.3581884503364563),
 ('مطاوع', 0.3550451993942261)]

# Add the random vector to the model
model_glove_twitter.add(['<UNK>'], random_vec_for_unk, replace = True)

# Generate normalized vectors and substitute the original ones
model_glove_twitter.init_sims(replace = True)

Tokenizer¶

# Create a tokenizer which converts to lower case, reduces length, and preserves the user ('@user')
tokenizer = TweetTokenizer(preserve_case = False, reduce_len = True, strip_handles = False)

Cleaning and Preprocessing Functions¶

Creating multiple functions that will be used in the cleaning process

Text Cleaning¶

# Sample text to test functions
txt = 'ALLCAPS Capitalized 1234 #Hashtag @UserName ">.<* http://t.co/8kscqKfKkF'

def normalize_tweet(text):

    # Change hyperlinks to '<url>' tokens
    output = re.sub(r'http[s]{0,1}://t.co/[a-zA-Z0-9]+\b', '<url>', text)

    # Split the '#' symbols from the ensuing word with a blank space
    output = re.sub(r'#(\w+)', r'# \1', output)

    return output

def tokenize(tokenizer, string):
    # Tokenize sentenses but keeps hashtags (#) and users (@)
    tokenized = tokenizer.tokenize(string)
    return tokenized

# Function that returns the tokenized string (list) with numbers substituted by a numeric token
def number_tokens(tokenized_string, num_token = '<number>'):

    # Create a list of tuples (word, POS tags)
    pos_tagged = nltk.pos_tag(tokenized_string)

    # Find all number indexes in the POS tags
    num_indexes = [idx for idx in range(len(pos_tagged)) if pos_tagged[idx][1] == 'CD']

    # Substitute numbers for tokens
    for idx in num_indexes:
        tokenized_string[idx] = num_token

    return tokenized_string

# Function which runs all text cleaning functions
def preprocess_text(tokenizer, string):
    return number_tokens(tokenize(tokenizer, normalize_tweet(string)))

preprocess_text(tokenizer, txt)

['allcaps',
 'capitalized',
 '<number>',
 '#',
 'hashtag',
 '@username',
 '"',
 '>',
 '.',
 '<',
 '*',
 '<url>']

# Return the tokeniezd and cleaned keyword
def preprocess_keyword(keyword):

    # Return None if the keyword is np.nan
    if type(keyword) == np.float and np.isnan(keyword):
        return
    
    # Replace '%20' with space, lower case and tokenized
    output = re.sub(r'%20', ' ', keyword)
    output = output.lower()
    output = output.split()
    return output

preprocess_keyword(txt)

['allcaps',
 'capitalized',
 '1234',
 '#hashtag',
 '@username',
 '">.<*',
 'http://t.co/8kscqkfkkf']

Feature Engineering¶

# Function to tally words written in ALL CAPS
def count_all_caps(text):
    return len([word for word in text.split() if word.isupper()])

# Function to tally words with the First Letter Capitalized
def count_capitalized(text):
    return len([word for word in text.split() if word.istitle()])

# Function to tally number of words in the tweet
def count_words(text):
    return len(text.split())

print(f'ALLCAPS: {count_all_caps(txt)}')
print(f'Captilaized: {count_capitalized(txt)}')
print(f'words: {count_words(txt)}')

ALLCAPS: 1
Captilaized: 2
words: 7

# Function that appends 4 sentiment analysis score columns to a DataFrame
def sentiment_analyze_df(df, column):

    # Instance the sentiment intensity analyzer
    sid = SentimentIntensityAnalyzer()

    # Create a matrix and fill with scores from each of the df[column]
    output_values = np.zeros((len(df), 4))
    for tup in df.itertuples():
        output_values[tup.Index, :] = list(sid.polarity_scores(' '.join(getattr(tup, column))).values())
    
    # Append the column to the DataFrame
    for idx, col in enumerate(['sent_neg', 'sent_neu', 'sent_pos', 'sent_compound']):
        df[col] = output_values[:, idx]

Word Embedding¶

Text Embedding¶

# Get the embeddinbg vector of the input word
def get_word_vec(embedding_model, use_norm, word):

    if word[0] == '@':
        return embedding_model.word_vec('<user>', use_norm = use_norm)

    elif word == '#':
        return embedding_model.word_vec('<hashtag>', use_norm = use_norm)

    elif word in embedding_model.vocab:
        return embedding_model.word_vec(word, use_norm = use_norm)

    else:
        return embedding_model.word_vec('<UNK>', use_norm = use_norm)

get_word_vec(model_glove_twitter, True, 'car')

array([ 1.7185701e-01, -6.2684923e-02, -7.7950597e-02, -5.3556234e-02,
        1.7126767e-01,  4.4343792e-02, -2.7128624e-02,  9.2938647e-02,
        1.1612639e-01,  7.2050910e-03, -6.5412987e-03,  1.4826292e-01,
       -5.9182960e-01,  1.3363624e-02,  3.2728985e-02,  8.3465651e-03,
       -3.1170310e-02,  8.7296419e-02, -1.7081790e-01,  2.5265975e-02,
       -7.6807566e-02, -1.0581435e-01,  5.0994121e-02, -7.9337113e-02,
        6.7829318e-02, -3.8430151e-02, -8.6004496e-02,  8.9869387e-02,
       -9.3864545e-04,  6.4546026e-02, -1.9738508e-02, -7.5290777e-02,
       -6.8303891e-02,  5.3086311e-02,  6.2632196e-02,  2.6654042e-02,
       -6.8552040e-02, -1.3344702e-02,  5.0967757e-02,  6.1545003e-02,
       -2.7503947e-02,  2.6702121e-01, -1.0047454e-02, -2.5424168e-02,
        6.2819861e-02, -1.0835165e-01,  1.6926698e-01,  4.8653791e-03,
       -5.8930158e-03,  2.5560649e-04,  9.7270355e-02,  1.4213991e-01,
       -1.0200686e-01,  4.2408250e-02, -4.8805777e-02,  2.4174130e-02,
        1.6320290e-02,  5.3745449e-02, -1.3390920e-02, -4.1901100e-02,
        9.4697386e-02,  7.4665755e-02,  1.5075369e-01, -5.7577759e-02,
        8.1754997e-02, -1.9986656e-02, -3.8340196e-02, -1.2066903e-02,
        5.7976346e-02, -4.3714122e-04,  7.5338848e-02,  1.1568128e-01,
        1.4580627e-02,  3.0684874e-03, -8.2536653e-02, -4.5375153e-03,
       -2.8164636e-02,  1.1597905e-02,  9.3602434e-02,  3.9475471e-02,
        2.9464304e-01,  5.1937077e-02, -7.5799473e-02, -5.1871937e-02,
        3.3121362e-02,  2.0430218e-02, -7.9000564e-03,  4.2912297e-02,
        7.4671959e-03,  9.2232982e-03,  3.8744986e-02, -3.4132563e-02,
       -4.7775973e-02,  3.9501833e-03,  2.7931998e-02,  3.7803579e-02,
        1.2098852e-02, -9.6324295e-02, -1.8367499e-01, -3.5287995e-02],
      dtype=float32)

# Get embedding vectors of all words in a tweet
def text_to_vectors(embedding_model, use_norm, tokenized_text):
    vectors = [get_word_vec(embedding_model, use_norm, word) for word in tokenized_text]
    vectors = np.array(vectors)

    return vectors

# Return a matrix with the embedding vectors of the texts with dimensions (seq_len, embedding)
def trim_and_pad_vectors(text_vectors, embedding_dimension, seq_len):

    # Instance the 0's matrix
    output = np.zeros((seq_len, embedding_dimension))

    # Adjust (cut) the tweets longer than seq_len
    trimmed_vectors = text_vectors[:seq_len]

    # Calculate the number of zeroes needed to pad the beginning of tweets shorter than seq_len
    end_of_padding_index = seq_len - trimmed_vectors.shape[0]

    # Alternative: Pad at the end of tweets
    #tweet_len = len(trimmed_vectors)

    # Output
    output[end_of_padding_index:] = trimmed_vectors

    return output

# Return embedding representations from the tokenized input text
def embedding_preprocess(embedding_model, use_norm, seq_len, tokenized_text):

    # Get the matrix with embedding vectors (tweet length, embedding_dimension)
    text_vectors = text_to_vectors(embedding_model, use_norm, tokenized_text)

    # Output
    output = trim_and_pad_vectors(text_vectors, embedding_model.vector_size, seq_len)

    return output

embedding_preprocess(model_glove_twitter, True, 30, 'car').shape

(30, 100)

Keyword Embedding¶

# Return embedding vectors from the keywords
def keyword_to_avg_vector(embedding_model, use_norm, tokenized_keyword):

    # Return a zeros vector if tokenized_keyword is None
    if tokenized_keyword is None:
        return np.zeros((1, embedding_model.vector_size))

    # If not, calculate the average embedding
    vectors = [get_word_vec(embedding_model, use_norm, word) for word in tokenized_keyword]
    vectors = np.array(vectors)
    avg_vector = np.mean(vectors, axis = 0)
    avg_vector = avg_vector.reshape((1, embedding_model.vector_size))
    return avg_vector

keyword_to_avg_vector(model_glove_twitter, True, 'car')

array([[ 7.1422994e-02,  1.0837428e-04, -2.0304391e-02, -9.7281085e-03,
         5.4865009e-03,  1.6073441e-02, -5.4071378e-02, -7.6703727e-04,
        -7.9964429e-02,  1.3259843e-02, -3.6176857e-02, -2.7829463e-02,
        -5.2243638e-01, -8.9000994e-03,  6.7731552e-03, -8.6577915e-02,
        -4.9584497e-02, -4.8650862e-03, -7.6949872e-02,  1.1324879e-01,
         5.4936390e-03, -3.9466392e-02,  6.5117046e-02,  1.7653236e-02,
         8.3768116e-03, -4.0137902e-01, -1.2125109e-02,  5.3382986e-03,
         8.5278727e-02,  1.2624326e-01, -2.2414498e-02, -2.8622678e-02,
         1.4793989e-02,  3.3068947e-02,  1.0932809e-01,  1.9231563e-02,
        -2.3901654e-02, -5.0810490e-02,  7.2450265e-03,  5.5991784e-02,
        -2.2831148e-01,  2.0155048e-02,  3.3842932e-02, -7.8908496e-02,
        -2.3725579e-02, -1.9583063e-02,  1.5705056e-02, -2.2528805e-02,
        -1.7095990e-03, -1.1425083e-01,  3.2133192e-02,  2.1985523e-02,
         1.5025365e-02,  1.3879955e-02,  4.4742834e-02,  4.4733051e-02,
         1.0754197e-02, -4.5661073e-02,  3.0807272e-02, -8.5210733e-02,
        -3.4276128e-02,  7.3676580e-03,  2.3491520e-03,  3.7474710e-02,
         7.2234049e-02, -2.6061146e-02, -6.5206297e-02,  4.4206645e-02,
        -7.2959871e-03, -2.2194684e-02,  1.3491760e-02, -1.2196609e-01,
         3.0354822e-02,  2.6037836e-02, -5.5951755e-02,  6.7963578e-02,
        -4.2082403e-02,  8.6082615e-02, -2.0323297e-02, -7.3177807e-02,
         1.8340968e-01,  4.4217035e-02,  7.0645311e-03, -2.1019183e-02,
         5.6230608e-02,  1.0603925e-01, -1.0344982e-01,  1.0831359e-02,
         2.6398713e-02,  9.4233036e-02,  4.3990161e-02, -1.6522117e-02,
        -8.5271560e-03,  4.5768660e-02,  3.2450225e-02, -6.9335289e-02,
        -6.3237041e-02, -1.9166874e-02, -1.8572828e-02, -3.9172549e-02]],
      dtype=float32)

Preprocessing Train Dataset¶

# Load training data
data_train = pd.read_csv('https://raw.githubusercontent.com/Matheus-Schmitz/Disaster_Occurance_Twitter/master/dataset_train.csv')
data_train.head()

Cleaning¶

# Normalize and tokenize text
data_train['tok_norm_text'] = [preprocess_text(tokenizer, text) for text in data_train['text']]

# Normalize and tokenize keyword
data_train['keyword'] = data_train['keyword'].apply(preprocess_keyword)

# Check
data_train.head(3)

Feature Engineering¶

# Apply the functions to the data
data_train['num_all_caps'] = data_train['text'].apply(count_all_caps)
data_train['num_caps'] = data_train['text'].apply(count_capitalized)
data_train['num_words'] = data_train['text'].apply(count_words)

# Create a scaler to set all featuresi to the [-1, 1] range
scaler = MinMaxScaler(feature_range=(-1, 1))

# Apply the scaler
columns_to_scale = ['num_all_caps', 'num_caps', 'num_words']
scaler.fit(data_train[columns_to_scale])
data_train[columns_to_scale] = scaler.transform(data_train[columns_to_scale])

# Create sentiment analysis features
sentiment_analyze_df(data_train, 'tok_norm_text')

# Visualize
data_train.head()

# Plot
sns.distplot([len(tok) for tok in data_train['tok_norm_text']])

<matplotlib.axes._subplots.AxesSubplot at 0x7ff4deedd828>

Seems like most texts have under 30 words. Therefore a reasonable choice between data loss and computational complexity is to set the maximum sequence length to 30.

Textual Attributes to Word Embedding Representations¶

# Max sequence length
sequence_max_length = 30

# Generate the text embedding
data_train['text_embedding'] = [embedding_preprocess(embedding_model = model_glove_twitter,
                                                     use_norm = True,
                                                     seq_len = sequence_max_length,
                                                     tokenized_text = text) 
                                for text in data_train['tok_norm_text']]

data_train['keyword_embedding'] = [keyword_to_avg_vector(embedding_model = model_glove_twitter,
                                                         use_norm = True,
                                                         tokenized_keyword = keyword)
                                    for keyword in data_train['keyword']]

# Visualize
data_train.head()

data_train['text_embedding'][0]

array([[ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       ...,
       [ 6.16427232e-03,  3.37593295e-02, -8.35306564e-05, ...,
        -1.87020600e-02,  5.86427785e-02,  6.01626709e-02],
       [ 5.04050851e-02,  6.07456863e-02, -6.03196025e-02, ...,
        -3.67820472e-03,  3.66689712e-02, -5.68437902e-03],
       [-6.18075095e-02,  2.37066876e-02, -1.73280034e-02, ...,
        -4.81248870e-02,  9.76478979e-02,  2.44460125e-02]])

Embedding Representation for All Chosen Features¶

Creating a vector projection which contains the text embeddings, plus the keyword embeddings, plus the engineered features.

The most common approach is to simply embed a text, which dispenses this step, but since here I want to add more things to my GLOVE model, this merging of all features in a single vector is necessary.

Engineered Features¶

# Function which returns a numpy array containing unique values with length seq_len
def _single_values_repeat(seq_len, static_single_values):
    
    # Create a sequenced array with one position representing each of the engineered features
    output = static_single_values.reshape((1, len(static_single_values)))

    # Repeat that array seq_len times since the engineered features are the same for all words in a same tweet
    output = np.repeat(output, seq_len, axis = 0)
    
    return output

static_singles_cols = ['num_all_caps', 'num_caps', 'num_words', 'sent_neg', 'sent_neu', 'sent_pos', 'sent_compound']
data_train[static_singles_cols].shape

(7613, 7)

data_train[static_singles_cols].head()

_single_values_repeat(30, data_train['num_all_caps'].values).shape
# one for each word position in each sample, with a size of one (7613 * 1) since it's a single feature

(30, 7613)

_single_values_repeat(30, data_train['num_all_caps'].values)

array([[-0.92, -1.  , -1.  , ..., -0.84, -1.  , -0.92],
       [-0.92, -1.  , -1.  , ..., -0.84, -1.  , -0.92],
       [-0.92, -1.  , -1.  , ..., -0.84, -1.  , -0.92],
       ...,
       [-0.92, -1.  , -1.  , ..., -0.84, -1.  , -0.92],
       [-0.92, -1.  , -1.  , ..., -0.84, -1.  , -0.92],
       [-0.92, -1.  , -1.  , ..., -0.84, -1.  , -0.92]])

Keyword Embedding¶

# Vector size used by the twitter glove model
model_glove_twitter.vector_size

100

# Return a numpy array of stacked embedding vectors
def _static_embedding_repeat(seq_len, static_embedding_values):

    # Reshape the keyword embedding by stacking it horizontally
    horizontally_stacked = np.hstack(static_embedding_values)

    # Repeat that array seq_len times since the keyword is the same for all words in a same tweet
    output = np.repeat(horizontally_stacked, seq_len, axis = 0)
    return output

_static_embedding_repeat(30, data_train['keyword_embedding']).shape
# one for each word position in each sample, with a size of a 100 (7613 * 100) since it's a vector using the representation length of model_glove_twitter.vector_size

(30, 761300)

_static_embedding_repeat(30, data_train['keyword_embedding'])

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Concatenate All¶

# Function which returns the embedding representations of all features
def concatenate_embeddings(df,
                           embedding_model,
                           seq_len,
                           sequence_embedding_col,
                           static_embedding_cols,
                           static_singles_cols):
    
    # Embedding dimensions
    emb_dim = embedding_model.vector_size

    # Output matrix
    output = np.zeros((len(df), seq_len, len(static_singles_cols) + len(static_embedding_cols) * emb_dim + emb_dim))

    # Loop
    for idx, row in df.iterrows():

        single_vals = _single_values_repeat(seq_len, row[static_singles_cols].values)
        static_emb_vals = _static_embedding_repeat(seq_len, row[static_embedding_cols])
        seq_emb_vals = row[sequence_embedding_col]

        # Stack embeddings and features for each tweet
        # AKA putting together the vectors for all text word embeddings + keyword embeddings + feature engineering embeddings + sentiment score embeddings
        row_embedding = np.hstack((single_vals, static_emb_vals, seq_emb_vals))

        output[idx, :, :] = row_embedding

    return output

# Create a final embedding representation of all features selected for training
embedding_matrix = concatenate_embeddings(df = data_train,
                                          embedding_model = model_glove_twitter,
                                          seq_len = sequence_max_length,
                                          sequence_embedding_col = 'text_embedding',
                                          static_embedding_cols = ['keyword_embedding'],
                                          static_singles_cols = ['num_all_caps',
                                                                 'num_caps',
                                                                 'num_words',
                                                                 'sent_neg',
                                                                 'sent_neu',
                                                                 'sent_pos',
                                                                 'sent_compound'])

# Shape
embedding_matrix.shape

(7613, 30, 207)

The first 7 positions represent the engineered features for that tweet, the next 100 represent the vectorized encoding for the keyword associated with that tweet. Those both are cloned/projected for every word in the tweet. The next 100 positions are the word embeddings for a specific word on a tweet.

That is, the fist 107 positions will be identical for all word embedding slots (30 here) of a tweet, and the last 100 will differ, since each is the representation fo a specific word on that tweet.

Model Construction¶

PyTorch implementation of a Bidirectional LSTM model

class BiLSTM(nn.Module):

    # Constructor Method
    def __init__(self, embedding_dim, hidden_dim, num_layers, num_classes, batch_size, dropout, device):
        super(BiLSTM, self).__init__()

        # Initialize attributes
        self.hidden_dim = hidden_dim
        self.batch_size = batch_size
        self.num_layers = num_layers

        # Dropout to reduce overfitting
        self.dropout = nn.Dropout(p = dropout)

        # LSTM model
        self.lstm = nn.LSTM(input_size = embedding_dim,
                            hidden_size = hidden_dim,
                            num_layers = num_layers,
                            batch_first = True,
                            dropout = dropout,
                            bidirectional = True)
        
        # Fully connected layer
        self.fc = nn.Linear(hidden_dim * 2, num_classes)

        # Device
        self.device = device

        # Lists for evaluations and plots
        self.train_loss = []
        self.train_acc = []
        self.val_loss = []
        self.val_acc = []

        # Attribute to store the best model weights (used for evaluating)
        self.best_weights = deepcopy(self.state_dict())

    # Hidden layers and LSTM cells
    def _init_hidden(self, current_batch_size):
        h0 = torch.zeros(self.num_layers * 2, current_batch_size, self.hidden_dim).to(self.device)
        c0 = torch.zeros(self.num_layers * 2, current_batch_size, self.hidden_dim).to(self.device)
        return h0, c0

    # Forward step
    def forward(self, x):

        # Forward LSTM
        h, c = self._init_hidden(current_batch_size = x.size(0))
        out, _ = self.lstm(x, (h, c))

        # Dropout
        out = self.dropout(out)

        # Decode the hidden state for the last time step
        out = self.fc(out[:, -1, :])

        return out
    
    # Predictions
    def predict(self, x: torch.tensor):
        class_predictions = self(x).data
        _, predicted = torch.max(class_predictions, dim = 1)
        return predicted

    # Training and evaluation with validation data
    def _train_evaluate(self, X_train, y_train, X_val, y_val, criterion):

        # Change the model to evaluation mode
        self.eval()

        # Calculate accuracy and loss on training data
        epoch_train_acc = (self.predict(X_train) == y_train).sum().item() / y_train.shape[0]
        epoch_train_loss = criterion(self(X_train), y_train).item()
        self.train_acc.append(epoch_train_acc)
        self.train_loss.append(epoch_train_loss)

        # Calculate accuracyand loss on validation data
        if X_val is not None and y_val is not None:
            epoch_val_acc = (self.predict(X_val) == y_val).sum().item() / y_val.shape[0]
            epoch_val_loss = criterion(self(X_val), y_train).item()
            self.val_acc.append(epoch_val_acc)
            self.val_loss.append(epoch_val_loss)

            # Return the accuracy and loss values
            return epoch_train_loss, epoch_train_acc, epoch_val_loss, epoch_val_acc

        # Return accuracy and loss values if there is not validation dataset
        return epoch_train_loss, epoch_train_acc, None, None

    # Return a dictionary with the best epochs
    def best_epoch(self):
        best_train_loss_epoch = np.argmin(np.array(self.train_loss)) + 1
        best_train_acc_epoch = np.argmax(np.array(self.train_acc)) + 1

        output = {'Epoch with lowest training loss': best_train_loss_epoch,
                  'Epoch with highest training accuracy': best_train_acc_epoch}

        if len(self.val_loss) > 0:
            best_val_loss_epoch = np.argmin(np.array(self.val_loss))
            best_val_acc_epoch = np.argmax(np.array(self.val_acc))

            output.update({'Epoch with lowest validation loss': best_val_loss_epoch,
                           'Epoch with highest validation accuracy': best_val_acc_epoch})
            
        return output

    # Return a dictionary with the total number of parameters
    def get_num_parameters(self):
        total_params = sum(p.numel() for p in self.parameters())
        trainable_params = sum(p.numel() for p in self.parameters() if p.requires_grad)
        return {'total_parameters': total_params, 'trainable_parameters': trainable_params}

    @staticmethod
    def _print_progress(epoch, train_loss, train_acc, val_loss, val_acc, improved, verbose=False):

        output = f'Epoch {str(epoch + 1).zfill(3)}:'
        output += f'\n\t Training Error: {str(train_loss)[:5]} | Accuracy: {str(train_acc)[:5]}'

        if val_loss is not None and val_acc is not None:
            output += f'\n\t Validation Error: {str(val_loss)}[:5] | Accuracy: {str(val_acc)[:5]}'

        if improved:
            output += f'\n\t The model improved!'

        if verbose:
            print(output)
        
    # Training method
    def fit(self, X_train, y_train, X_val, y_val, epoch_num, criterion, optimizer, verbose = False):

        # Variable to determine if the best weights should be updated (and report progress)
        best_acc = 0.0

        # Divide the dataset in batches
        X_train_tensor_batches = torch.split(X_train, self.batch_size)
        y_train_tensor_batches = torch.split(y_train, self.batch_size)

        # Loop
        for epoch in range(epoch_num):

            # Set the model to training mode
            # At the end of each epoch the _train_evaluate method is called and sets the model to evaluation mode
            self.train()

            for i, (X_batch, y_batch) in enumerate(zip(X_train_tensor_batches, y_train_tensor_batches)):

                # Forward pass
                outputs = self(X_batch)
                loss = criterion(outputs, y_batch)

                # Backward and optimization
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            # Calculate accuracy and loss for train and validation
            train_loss, train_acc, val_loss, val_acc = self._train_evaluate(X_train, y_train, X_val, y_val, criterion)

            # A boolean to determine the correct accuracy to be considered on progress (validation or training)
            if X_val is not None and y_val is not None:
                accuracy = val_acc
            else:
                accuracy = train_acc

            # If the accuracy improves on the previous best, print it and update the best accuracy records and the model weights
            if accuracy > best_acc:
                self._print_progress(epoch,
                                     train_loss,
                                     train_acc,
                                     val_loss,
                                     val_acc,
                                     improved = True,
                                     verbose = verbose)
                best_acc = accuracy
                self.best_weights = deepcopy(self.state_dict())

            # Else, just print wihtout updating
            else:
                self._print_progress(epoch,
                                     train_loss,
                                     train_acc,
                                     val_loss,
                                     val_acc,
                                     improved = False,
                                     verbose = verbose)
       
        # Gargabe collector, to clean memory        
        gc.collect()

def plot_graphs(model):
    plt.figure(figsize = (24, 12))

    plt.subplot(311)
    plt.title('Accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.plot(range(1, len(model.train_acc)+1), model.train_acc, label = 'Train')

    plt.xticks(np.arange(0, len(model.train_acc)+1, 5))
    plt.legend()

    plt.subplot(312)
    plt.title('Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.plot(range(1, len(model.train_loss)+1), model.train_loss, label = 'Train')
    
    plt.xticks(np.arange(0, len(model.train_acc)+1, 5))
    plt.legend()

    plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
    plt.show()

Training¶

# Hyperparameters
embedding_dim = embedding_matrix.shape[2]
hidden_size = 50
num_layers = 2
num_classes = 2
batch_size = 256
dropout = 0.3
num_epochs = 300
learning_rate = 0.0005
weight_decay = 0.0005

# Load attributes
X_train = torch.from_numpy(embedding_matrix).float().to(device)

# Load label
y_train = torch.from_numpy(data_train['target'].values).long().to(device)

# Create model
model = BiLSTM(embedding_dim, hidden_size, num_layers, num_classes, batch_size, dropout, device).to(device)

# Loss function
criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate, weight_decay = weight_decay)

%time
# Train
model.fit(X_train = X_train,
          y_train = y_train,
          X_val = None,
          y_val = None,
          epoch_num = num_epochs,
          criterion = criterion,
          optimizer = optimizer,
          verbose = True)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.39 µs
Epoch 001:
	 Training Error: 0.674 | Accuracy: 0.570
	 The model improved!
Epoch 002:
	 Training Error: 0.641 | Accuracy: 0.630
	 The model improved!
Epoch 003:
	 Training Error: 0.554 | Accuracy: 0.724
	 The model improved!
Epoch 004:
	 Training Error: 0.513 | Accuracy: 0.752
	 The model improved!
Epoch 005:
	 Training Error: 0.483 | Accuracy: 0.774
	 The model improved!
Epoch 006:
	 Training Error: 0.459 | Accuracy: 0.790
	 The model improved!
Epoch 007:
	 Training Error: 0.441 | Accuracy: 0.802
	 The model improved!
Epoch 008:
	 Training Error: 0.433 | Accuracy: 0.806
	 The model improved!
Epoch 009:
	 Training Error: 0.425 | Accuracy: 0.812
	 The model improved!
Epoch 010:
	 Training Error: 0.419 | Accuracy: 0.814
	 The model improved!
Epoch 011:
	 Training Error: 0.418 | Accuracy: 0.812
Epoch 012:
	 Training Error: 0.441 | Accuracy: 0.801
Epoch 013:
	 Training Error: 0.424 | Accuracy: 0.813
Epoch 014:
	 Training Error: 0.412 | Accuracy: 0.818
	 The model improved!
Epoch 015:
	 Training Error: 0.405 | Accuracy: 0.823
	 The model improved!
Epoch 016:
	 Training Error: 0.404 | Accuracy: 0.825
	 The model improved!
Epoch 017:
	 Training Error: 0.402 | Accuracy: 0.826
	 The model improved!
Epoch 018:
	 Training Error: 0.399 | Accuracy: 0.826
Epoch 019:
	 Training Error: 0.398 | Accuracy: 0.824
Epoch 020:
	 Training Error: 0.402 | Accuracy: 0.824
Epoch 021:
	 Training Error: 0.410 | Accuracy: 0.815
Epoch 022:
	 Training Error: 0.415 | Accuracy: 0.813
Epoch 023:
	 Training Error: 0.406 | Accuracy: 0.818
Epoch 024:
	 Training Error: 0.398 | Accuracy: 0.827
	 The model improved!
Epoch 025:
	 Training Error: 0.433 | Accuracy: 0.803
Epoch 026:
	 Training Error: 0.404 | Accuracy: 0.827
	 The model improved!
Epoch 027:
	 Training Error: 0.396 | Accuracy: 0.825
Epoch 028:
	 Training Error: 0.391 | Accuracy: 0.831
	 The model improved!
Epoch 029:
	 Training Error: 0.396 | Accuracy: 0.830
Epoch 030:
	 Training Error: 0.399 | Accuracy: 0.828
Epoch 031:
	 Training Error: 0.389 | Accuracy: 0.831
Epoch 032:
	 Training Error: 0.397 | Accuracy: 0.825
Epoch 033:
	 Training Error: 0.395 | Accuracy: 0.825
Epoch 034:
	 Training Error: 0.399 | Accuracy: 0.827
Epoch 035:
	 Training Error: 0.411 | Accuracy: 0.818
Epoch 036:
	 Training Error: 0.388 | Accuracy: 0.833
	 The model improved!
Epoch 037:
	 Training Error: 0.388 | Accuracy: 0.830
Epoch 038:
	 Training Error: 0.387 | Accuracy: 0.833
	 The model improved!
Epoch 039:
	 Training Error: 0.405 | Accuracy: 0.822
Epoch 040:
	 Training Error: 0.403 | Accuracy: 0.822
Epoch 041:
	 Training Error: 0.384 | Accuracy: 0.833
Epoch 042:
	 Training Error: 0.383 | Accuracy: 0.834
	 The model improved!
Epoch 043:
	 Training Error: 0.389 | Accuracy: 0.832
Epoch 044:
	 Training Error: 0.416 | Accuracy: 0.811
Epoch 045:
	 Training Error: 0.395 | Accuracy: 0.825
Epoch 046:
	 Training Error: 0.381 | Accuracy: 0.833
Epoch 047:
	 Training Error: 0.384 | Accuracy: 0.832
Epoch 048:
	 Training Error: 0.403 | Accuracy: 0.819
Epoch 049:
	 Training Error: 0.417 | Accuracy: 0.809
Epoch 050:
	 Training Error: 0.388 | Accuracy: 0.829
Epoch 051:
	 Training Error: 0.381 | Accuracy: 0.830
Epoch 052:
	 Training Error: 0.397 | Accuracy: 0.823
Epoch 053:
	 Training Error: 0.432 | Accuracy: 0.796
Epoch 054:
	 Training Error: 0.412 | Accuracy: 0.811
Epoch 055:
	 Training Error: 0.391 | Accuracy: 0.822
Epoch 056:
	 Training Error: 0.408 | Accuracy: 0.814
Epoch 057:
	 Training Error: 0.445 | Accuracy: 0.790
Epoch 058:
	 Training Error: 0.416 | Accuracy: 0.809
Epoch 059:
	 Training Error: 0.413 | Accuracy: 0.812
Epoch 060:
	 Training Error: 0.450 | Accuracy: 0.794
Epoch 061:
	 Training Error: 0.453 | Accuracy: 0.795
Epoch 062:
	 Training Error: 0.434 | Accuracy: 0.803
Epoch 063:
	 Training Error: 0.426 | Accuracy: 0.806
Epoch 064:
	 Training Error: 0.436 | Accuracy: 0.793
Epoch 065:
	 Training Error: 0.433 | Accuracy: 0.791
Epoch 066:
	 Training Error: 0.409 | Accuracy: 0.808
Epoch 067:
	 Training Error: 0.408 | Accuracy: 0.809
Epoch 068:
	 Training Error: 0.421 | Accuracy: 0.799
Epoch 069:
	 Training Error: 0.397 | Accuracy: 0.819
Epoch 070:
	 Training Error: 0.386 | Accuracy: 0.828
Epoch 071:
	 Training Error: 0.373 | Accuracy: 0.834
	 The model improved!
Epoch 072:
	 Training Error: 0.367 | Accuracy: 0.839
	 The model improved!
Epoch 073:
	 Training Error: 0.366 | Accuracy: 0.840
	 The model improved!
Epoch 074:
	 Training Error: 0.368 | Accuracy: 0.841
	 The model improved!
Epoch 075:
	 Training Error: 0.369 | Accuracy: 0.841
	 The model improved!
Epoch 076:
	 Training Error: 0.373 | Accuracy: 0.837
Epoch 077:
	 Training Error: 0.376 | Accuracy: 0.835
Epoch 078:
	 Training Error: 0.375 | Accuracy: 0.835
Epoch 079:
	 Training Error: 0.378 | Accuracy: 0.832
Epoch 080:
	 Training Error: 0.370 | Accuracy: 0.839
Epoch 081:
	 Training Error: 0.372 | Accuracy: 0.838
Epoch 082:
	 Training Error: 0.366 | Accuracy: 0.841
	 The model improved!
Epoch 083:
	 Training Error: 0.357 | Accuracy: 0.844
	 The model improved!
Epoch 084:
	 Training Error: 0.350 | Accuracy: 0.851
	 The model improved!
Epoch 085:
	 Training Error: 0.363 | Accuracy: 0.840
Epoch 086:
	 Training Error: 0.344 | Accuracy: 0.851
Epoch 087:
	 Training Error: 0.347 | Accuracy: 0.849
Epoch 088:
	 Training Error: 0.345 | Accuracy: 0.848
Epoch 089:
	 Training Error: 0.335 | Accuracy: 0.854
	 The model improved!
Epoch 090:
	 Training Error: 0.336 | Accuracy: 0.852
Epoch 091:
	 Training Error: 0.338 | Accuracy: 0.853
Epoch 092:
	 Training Error: 0.335 | Accuracy: 0.854
Epoch 093:
	 Training Error: 0.340 | Accuracy: 0.852
Epoch 094:
	 Training Error: 0.337 | Accuracy: 0.855
	 The model improved!
Epoch 095:
	 Training Error: 0.338 | Accuracy: 0.853
Epoch 096:
	 Training Error: 0.354 | Accuracy: 0.844
Epoch 097:
	 Training Error: 0.377 | Accuracy: 0.825
Epoch 098:
	 Training Error: 0.340 | Accuracy: 0.852
Epoch 099:
	 Training Error: 0.357 | Accuracy: 0.846
Epoch 100:
	 Training Error: 0.354 | Accuracy: 0.846
Epoch 101:
	 Training Error: 0.362 | Accuracy: 0.843
Epoch 102:
	 Training Error: 0.349 | Accuracy: 0.845
Epoch 103:
	 Training Error: 0.344 | Accuracy: 0.846
Epoch 104:
	 Training Error: 0.352 | Accuracy: 0.844
Epoch 105:
	 Training Error: 0.348 | Accuracy: 0.846
Epoch 106:
	 Training Error: 0.385 | Accuracy: 0.826
Epoch 107:
	 Training Error: 0.367 | Accuracy: 0.834
Epoch 108:
	 Training Error: 0.345 | Accuracy: 0.845
Epoch 109:
	 Training Error: 0.333 | Accuracy: 0.855
Epoch 110:
	 Training Error: 0.327 | Accuracy: 0.858
	 The model improved!
Epoch 111:
	 Training Error: 0.322 | Accuracy: 0.859
	 The model improved!
Epoch 112:
	 Training Error: 0.323 | Accuracy: 0.862
	 The model improved!
Epoch 113:
	 Training Error: 0.329 | Accuracy: 0.858
Epoch 114:
	 Training Error: 0.321 | Accuracy: 0.861
Epoch 115:
	 Training Error: 0.316 | Accuracy: 0.862
	 The model improved!
Epoch 116:
	 Training Error: 0.317 | Accuracy: 0.862
Epoch 117:
	 Training Error: 0.332 | Accuracy: 0.857
Epoch 118:
	 Training Error: 0.330 | Accuracy: 0.858
Epoch 119:
	 Training Error: 0.358 | Accuracy: 0.843
Epoch 120:
	 Training Error: 0.391 | Accuracy: 0.829
Epoch 121:
	 Training Error: 0.364 | Accuracy: 0.839
Epoch 122:
	 Training Error: 0.368 | Accuracy: 0.841
Epoch 123:
	 Training Error: 0.372 | Accuracy: 0.834
Epoch 124:
	 Training Error: 0.356 | Accuracy: 0.842
Epoch 125:
	 Training Error: 0.337 | Accuracy: 0.850
Epoch 126:
	 Training Error: 0.329 | Accuracy: 0.859
Epoch 127:
	 Training Error: 0.321 | Accuracy: 0.865
	 The model improved!
Epoch 128:
	 Training Error: 0.313 | Accuracy: 0.867
	 The model improved!
Epoch 129:
	 Training Error: 0.323 | Accuracy: 0.857
Epoch 130:
	 Training Error: 0.323 | Accuracy: 0.855
Epoch 131:
	 Training Error: 0.319 | Accuracy: 0.858
Epoch 132:
	 Training Error: 0.320 | Accuracy: 0.858
Epoch 133:
	 Training Error: 0.332 | Accuracy: 0.852
Epoch 134:
	 Training Error: 0.316 | Accuracy: 0.858
Epoch 135:
	 Training Error: 0.328 | Accuracy: 0.853
Epoch 136:
	 Training Error: 0.290 | Accuracy: 0.873
	 The model improved!
Epoch 137:
	 Training Error: 0.310 | Accuracy: 0.866
Epoch 138:
	 Training Error: 0.290 | Accuracy: 0.874
	 The model improved!
Epoch 139:
	 Training Error: 0.279 | Accuracy: 0.880
	 The model improved!
Epoch 140:
	 Training Error: 0.278 | Accuracy: 0.880
Epoch 141:
	 Training Error: 0.283 | Accuracy: 0.880
Epoch 142:
	 Training Error: 0.289 | Accuracy: 0.874
Epoch 143:
	 Training Error: 0.279 | Accuracy: 0.881
	 The model improved!
Epoch 144:
	 Training Error: 0.272 | Accuracy: 0.882
	 The model improved!
Epoch 145:
	 Training Error: 0.279 | Accuracy: 0.879
Epoch 146:
	 Training Error: 0.285 | Accuracy: 0.877
Epoch 147:
	 Training Error: 0.281 | Accuracy: 0.882
	 The model improved!
Epoch 148:
	 Training Error: 0.269 | Accuracy: 0.884
	 The model improved!
Epoch 149:
	 Training Error: 0.282 | Accuracy: 0.879
Epoch 150:
	 Training Error: 0.287 | Accuracy: 0.874
Epoch 151:
	 Training Error: 0.274 | Accuracy: 0.882
Epoch 152:
	 Training Error: 0.265 | Accuracy: 0.883
Epoch 153:
	 Training Error: 0.257 | Accuracy: 0.890
	 The model improved!
Epoch 154:
	 Training Error: 0.254 | Accuracy: 0.890
	 The model improved!
Epoch 155:
	 Training Error: 0.254 | Accuracy: 0.887
Epoch 156:
	 Training Error: 0.253 | Accuracy: 0.892
	 The model improved!
Epoch 157:
	 Training Error: 0.249 | Accuracy: 0.892
	 The model improved!
Epoch 158:
	 Training Error: 0.256 | Accuracy: 0.889
Epoch 159:
	 Training Error: 0.270 | Accuracy: 0.882
Epoch 160:
	 Training Error: 0.272 | Accuracy: 0.884
Epoch 161:
	 Training Error: 0.252 | Accuracy: 0.893
	 The model improved!
Epoch 162:
	 Training Error: 0.257 | Accuracy: 0.888
Epoch 163:
	 Training Error: 0.254 | Accuracy: 0.892
Epoch 164:
	 Training Error: 0.259 | Accuracy: 0.890
Epoch 165:
	 Training Error: 0.264 | Accuracy: 0.888
Epoch 166:
	 Training Error: 0.267 | Accuracy: 0.889
Epoch 167:
	 Training Error: 0.273 | Accuracy: 0.885
Epoch 168:
	 Training Error: 0.278 | Accuracy: 0.882
Epoch 169:
	 Training Error: 0.251 | Accuracy: 0.891
Epoch 170:
	 Training Error: 0.240 | Accuracy: 0.897
	 The model improved!
Epoch 171:
	 Training Error: 0.256 | Accuracy: 0.892
Epoch 172:
	 Training Error: 0.239 | Accuracy: 0.899
	 The model improved!
Epoch 173:
	 Training Error: 0.247 | Accuracy: 0.892
Epoch 174:
	 Training Error: 0.238 | Accuracy: 0.898
Epoch 175:
	 Training Error: 0.240 | Accuracy: 0.893
Epoch 176:
	 Training Error: 0.245 | Accuracy: 0.892
Epoch 177:
	 Training Error: 0.244 | Accuracy: 0.893
Epoch 178:
	 Training Error: 0.225 | Accuracy: 0.900
	 The model improved!
Epoch 179:
	 Training Error: 0.225 | Accuracy: 0.902
	 The model improved!
Epoch 180:
	 Training Error: 0.260 | Accuracy: 0.884
Epoch 181:
	 Training Error: 0.251 | Accuracy: 0.888
Epoch 182:
	 Training Error: 0.230 | Accuracy: 0.900
Epoch 183:
	 Training Error: 0.239 | Accuracy: 0.897
Epoch 184:
	 Training Error: 0.251 | Accuracy: 0.892
Epoch 185:
	 Training Error: 0.232 | Accuracy: 0.903
	 The model improved!
Epoch 186:
	 Training Error: 0.226 | Accuracy: 0.903
Epoch 187:
	 Training Error: 0.227 | Accuracy: 0.902
Epoch 188:
	 Training Error: 0.222 | Accuracy: 0.906
	 The model improved!
Epoch 189:
	 Training Error: 0.227 | Accuracy: 0.905
Epoch 190:
	 Training Error: 0.224 | Accuracy: 0.905
Epoch 191:
	 Training Error: 0.221 | Accuracy: 0.908
	 The model improved!
Epoch 192:
	 Training Error: 0.219 | Accuracy: 0.903
Epoch 193:
	 Training Error: 0.227 | Accuracy: 0.902
Epoch 194:
	 Training Error: 0.213 | Accuracy: 0.908
Epoch 195:
	 Training Error: 0.224 | Accuracy: 0.903
Epoch 196:
	 Training Error: 0.213 | Accuracy: 0.910
	 The model improved!
Epoch 197:
	 Training Error: 0.214 | Accuracy: 0.908
Epoch 198:
	 Training Error: 0.231 | Accuracy: 0.900
Epoch 199:
	 Training Error: 0.215 | Accuracy: 0.907
Epoch 200:
	 Training Error: 0.218 | Accuracy: 0.907
Epoch 201:
	 Training Error: 0.209 | Accuracy: 0.912
	 The model improved!
Epoch 202:
	 Training Error: 0.241 | Accuracy: 0.892
Epoch 203:
	 Training Error: 0.249 | Accuracy: 0.890
Epoch 204:
	 Training Error: 0.234 | Accuracy: 0.897
Epoch 205:
	 Training Error: 0.220 | Accuracy: 0.905
Epoch 206:
	 Training Error: 0.272 | Accuracy: 0.891
Epoch 207:
	 Training Error: 0.356 | Accuracy: 0.861
Epoch 208:
	 Training Error: 0.324 | Accuracy: 0.882
Epoch 209:
	 Training Error: 0.291 | Accuracy: 0.883
Epoch 210:
	 Training Error: 0.304 | Accuracy: 0.870
Epoch 211:
	 Training Error: 0.251 | Accuracy: 0.889
Epoch 212:
	 Training Error: 0.229 | Accuracy: 0.898
Epoch 213:
	 Training Error: 0.219 | Accuracy: 0.906
Epoch 214:
	 Training Error: 0.234 | Accuracy: 0.899
Epoch 215:
	 Training Error: 0.265 | Accuracy: 0.886
Epoch 216:
	 Training Error: 0.304 | Accuracy: 0.868
Epoch 217:
	 Training Error: 0.291 | Accuracy: 0.876
Epoch 218:
	 Training Error: 0.225 | Accuracy: 0.908
Epoch 219:
	 Training Error: 0.193 | Accuracy: 0.921
	 The model improved!
Epoch 220:
	 Training Error: 0.183 | Accuracy: 0.927
	 The model improved!
Epoch 221:
	 Training Error: 0.169 | Accuracy: 0.930
	 The model improved!
Epoch 222:
	 Training Error: 0.164 | Accuracy: 0.931
	 The model improved!
Epoch 223:
	 Training Error: 0.184 | Accuracy: 0.923
Epoch 224:
	 Training Error: 0.203 | Accuracy: 0.916
Epoch 225:
	 Training Error: 0.221 | Accuracy: 0.906
Epoch 226:
	 Training Error: 0.183 | Accuracy: 0.926
Epoch 227:
	 Training Error: 0.167 | Accuracy: 0.931
	 The model improved!
Epoch 228:
	 Training Error: 0.180 | Accuracy: 0.924
Epoch 229:
	 Training Error: 0.183 | Accuracy: 0.924
Epoch 230:
	 Training Error: 0.161 | Accuracy: 0.938
	 The model improved!
Epoch 231:
	 Training Error: 0.171 | Accuracy: 0.929
Epoch 232:
	 Training Error: 0.161 | Accuracy: 0.936
Epoch 233:
	 Training Error: 0.165 | Accuracy: 0.936
Epoch 234:
	 Training Error: 0.153 | Accuracy: 0.936
Epoch 235:
	 Training Error: 0.159 | Accuracy: 0.935
Epoch 236:
	 Training Error: 0.136 | Accuracy: 0.945
	 The model improved!
Epoch 237:
	 Training Error: 0.127 | Accuracy: 0.949
	 The model improved!
Epoch 238:
	 Training Error: 0.154 | Accuracy: 0.941
Epoch 239:
	 Training Error: 0.167 | Accuracy: 0.930
Epoch 240:
	 Training Error: 0.181 | Accuracy: 0.927
Epoch 241:
	 Training Error: 0.161 | Accuracy: 0.939
Epoch 242:
	 Training Error: 0.152 | Accuracy: 0.942
Epoch 243:
	 Training Error: 0.144 | Accuracy: 0.943
Epoch 244:
	 Training Error: 0.133 | Accuracy: 0.949
	 The model improved!
Epoch 245:
	 Training Error: 0.152 | Accuracy: 0.943
Epoch 246:
	 Training Error: 0.220 | Accuracy: 0.912
Epoch 247:
	 Training Error: 0.196 | Accuracy: 0.923
Epoch 248:
	 Training Error: 0.204 | Accuracy: 0.914
Epoch 249:
	 Training Error: 0.185 | Accuracy: 0.922
Epoch 250:
	 Training Error: 0.178 | Accuracy: 0.925
Epoch 251:
	 Training Error: 0.132 | Accuracy: 0.948
Epoch 252:
	 Training Error: 0.125 | Accuracy: 0.952
	 The model improved!
Epoch 253:
	 Training Error: 0.128 | Accuracy: 0.949
Epoch 254:
	 Training Error: 0.127 | Accuracy: 0.948
Epoch 255:
	 Training Error: 0.129 | Accuracy: 0.948
Epoch 256:
	 Training Error: 0.139 | Accuracy: 0.946
Epoch 257:
	 Training Error: 0.141 | Accuracy: 0.947
Epoch 258:
	 Training Error: 0.132 | Accuracy: 0.950
Epoch 259:
	 Training Error: 0.134 | Accuracy: 0.945
Epoch 260:
	 Training Error: 0.181 | Accuracy: 0.923
Epoch 261:
	 Training Error: 0.111 | Accuracy: 0.955
	 The model improved!
Epoch 262:
	 Training Error: 0.133 | Accuracy: 0.946
Epoch 263:
	 Training Error: 0.136 | Accuracy: 0.945
Epoch 264:
	 Training Error: 0.136 | Accuracy: 0.948
Epoch 265:
	 Training Error: 0.134 | Accuracy: 0.947
Epoch 266:
	 Training Error: 0.152 | Accuracy: 0.941
Epoch 267:
	 Training Error: 0.156 | Accuracy: 0.940
Epoch 268:
	 Training Error: 0.151 | Accuracy: 0.940
Epoch 269:
	 Training Error: 0.185 | Accuracy: 0.925
Epoch 270:
	 Training Error: 0.211 | Accuracy: 0.912
Epoch 271:
	 Training Error: 0.144 | Accuracy: 0.939
Epoch 272:
	 Training Error: 0.140 | Accuracy: 0.944
Epoch 273:
	 Training Error: 0.110 | Accuracy: 0.956
	 The model improved!
Epoch 274:
	 Training Error: 0.108 | Accuracy: 0.957
	 The model improved!
Epoch 275:
	 Training Error: 0.128 | Accuracy: 0.948
Epoch 276:
	 Training Error: 0.113 | Accuracy: 0.956
Epoch 277:
	 Training Error: 0.117 | Accuracy: 0.955
Epoch 278:
	 Training Error: 0.134 | Accuracy: 0.947
Epoch 279:
	 Training Error: 0.098 | Accuracy: 0.963
	 The model improved!
Epoch 280:
	 Training Error: 0.109 | Accuracy: 0.956
Epoch 281:
	 Training Error: 0.127 | Accuracy: 0.949
Epoch 282:
	 Training Error: 0.129 | Accuracy: 0.948
Epoch 283:
	 Training Error: 0.096 | Accuracy: 0.963
Epoch 284:
	 Training Error: 0.094 | Accuracy: 0.964
	 The model improved!
Epoch 285:
	 Training Error: 0.088 | Accuracy: 0.966
	 The model improved!
Epoch 286:
	 Training Error: 0.112 | Accuracy: 0.958
Epoch 287:
	 Training Error: 0.128 | Accuracy: 0.954
Epoch 288:
	 Training Error: 0.121 | Accuracy: 0.951
Epoch 289:
	 Training Error: 0.096 | Accuracy: 0.963
Epoch 290:
	 Training Error: 0.090 | Accuracy: 0.967
	 The model improved!
Epoch 291:
	 Training Error: 0.100 | Accuracy: 0.962
Epoch 292:
	 Training Error: 0.118 | Accuracy: 0.954
Epoch 293:
	 Training Error: 0.122 | Accuracy: 0.953
Epoch 294:
	 Training Error: 0.085 | Accuracy: 0.966
Epoch 295:
	 Training Error: 0.108 | Accuracy: 0.957
Epoch 296:
	 Training Error: 0.110 | Accuracy: 0.958
Epoch 297:
	 Training Error: 0.107 | Accuracy: 0.960
Epoch 298:
	 Training Error: 0.107 | Accuracy: 0.957
Epoch 299:
	 Training Error: 0.106 | Accuracy: 0.959
Epoch 300:
	 Training Error: 0.118 | Accuracy: 0.958

# Plot
plot_graphs(model)

Predicting¶

Appling the same preprocessing pipeline to the test dataset

# Load test dataset
data_test = pd.read_csv('https://raw.githubusercontent.com/Matheus-Schmitz/Disaster_Occurance_Twitter/master/dataset_test.csv')
data_test.head()

# Normalize and tokenize text
data_test['tok_norm_text'] = [preprocess_text(tokenizer, text) for text in data_test['text']]

# Preprocess keywords
data_test['keyword'] = data_test['keyword'].apply(preprocess_keyword)

# Extract features
data_test['num_all_caps'] = data_test['text'].apply(count_all_caps)
data_test['num_caps'] = data_test['text'].apply(count_capitalized)
data_test['num_words'] = data_test['text'].apply(count_words)

# Scale
data_test[columns_to_scale] = scaler.transform(data_test[columns_to_scale])

# Sentiment Analyser
sentiment_analyze_df(data_test, 'tok_norm_text')

# Text embedding
data_test['text_embedding'] = [embedding_preprocess(embedding_model = model_glove_twitter,
                                                   use_norm = True,
                                                   seq_len = sequence_max_length,
                                                   tokenized_text = text)
                              for text in data_test['tok_norm_text']]

# Keyword embedding
data_test['keyword_embedding'] = [keyword_to_avg_vector(embedding_model = model_glove_twitter,
                                                        use_norm = True,
                                                        tokenized_keyword = keyword)
                                  for keyword in data_test['keyword']]

# Visualize
data_test.head()

# Create a final embedding representation of all features selected for training
test_embedding_matrix = concatenate_embeddings(df = data_test,
                                                embedding_model = model_glove_twitter,
                                                seq_len = sequence_max_length,
                                                sequence_embedding_col = 'text_embedding',
                                                static_embedding_cols = ['keyword_embedding'],
                                                static_singles_cols = ['num_all_caps',
                                                                        'num_caps',
                                                                        'num_words',
                                                                        'sent_neg',
                                                                        'sent_neu',
                                                                        'sent_pos',
                                                                        'sent_compound'])

# Create object with the attributes
X_test = torch.from_numpy(test_embedding_matrix).float().to(device)

# Predictions
preds = model.predict(X_test)

# Concatenate predictions and ids for each test register in a dataframe
final_preds = preds.cpu().numpy().reshape(-1,1)
ids = data_test['id'].values.reshape(-1,1)
data = np.hstack((ids, final_preds))

# Dataframe
predictions = pd.DataFrame(data = data, columns = ['id', 'target'])

# Visualize
predictions.head()

	id	keyword	location	text	target
0	1	NaN	NaN	Our Deeds are the Reason of this #earthquake M...	1
1	4	NaN	NaN	Forest fire near La Ronge Sask. Canada	1
2	5	NaN	NaN	All residents asked to 'shelter in place' are ...	1
3	6	NaN	NaN	13,000 people receive #wildfires evacuation or...	1
4	7	NaN	NaN	Just got sent this photo from Ruby #Alaska as ...	1

	id	keyword	location	text	target	tok_norm_text	num_all_caps	num_caps	num_words	sent_neg	sent_neu	sent_pos	sent_compound	text_embedding	keyword_embedding
0	1	None	NaN	Our Deeds are the Reason of this #earthquake M...	1	[our, deeds, are, the, reason, of, this, #, ea...	-0.92	-0.565217	-0.200000	0.000	0.851	0.149	0.2732	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
1	4	None	NaN	Forest fire near La Ronge Sask. Canada	1	[forest, fire, near, la, ronge, sask, ., canada]	-1.00	-0.565217	-0.600000	0.286	0.714	0.000	-0.3400	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
2	5	None	NaN	All residents asked to 'shelter in place' are ...	1	[all, residents, asked, to, ', shelter, in, pl...	-1.00	-0.826087	0.400000	0.095	0.905	0.000	-0.2960	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
3	6	None	NaN	13,000 people receive #wildfires evacuation or...	1	[<number>, people, receive, #, wildfires, evac...	-1.00	-0.913043	-0.533333	0.000	1.000	0.000	0.0000	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
4	7	None	NaN	Just got sent this photo from Ruby #Alaska as ...	1	[just, got, sent, this, photo, from, ruby, #, ...	-1.00	-0.739130	0.000000	0.000	1.000	0.000	0.0000	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...

	id	keyword	location	text
0	0	NaN	NaN	Just happened a terrible car crash
1	2	NaN	NaN	Heard about #earthquake is different cities, s...
2	3	NaN	NaN	there is a forest fire at spot pond, geese are...
3	9	NaN	NaN	Apocalypse lighting. #Spokane #wildfires
4	11	NaN	NaN	Typhoon Soudelor kills 28 in China and Taiwan

	id	keyword	location	text	tok_norm_text	num_all_caps	num_caps	num_words	sent_neg	sent_neu	sent_pos	sent_compound	text_embedding	keyword_embedding
0	0	None	NaN	Just happened a terrible car crash	[just, happened, a, terrible, car, crash]	-1.00	-0.913043	-0.666667	0.659	0.341	0.000	-0.7003	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
1	2	None	NaN	Heard about #earthquake is different cities, s...	[heard, about, #, earthquake, is, different, c...	-1.00	-0.913043	-0.466667	0.000	0.734	0.266	0.4404	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
2	3	None	NaN	there is a forest fire at spot pond, geese are...	[there, is, a, forest, fire, at, spot, pond, ,...	-0.92	-0.913043	0.200000	0.251	0.749	0.000	-0.6159	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
3	9	None	NaN	Apocalypse lighting. #Spokane #wildfires	[apocalypse, lighting, ., #, spokane, #, wildf...	-1.00	-0.826087	-0.800000	0.000	1.000	0.000	0.0000	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
4	11	None	NaN	Typhoon Soudelor kills 28 in China and Taiwan	[typhoon, soudelor, kills, <number>, in, china...	-1.00	-0.652174	-0.533333	0.333	0.667	0.000	-0.5423	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...	[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...

Introduction¶

Imports¶

Preparing the Embedding Model¶

Tokenizer¶

Cleaning and Preprocessing Functions¶

Text Cleaning¶

Feature Engineering¶

Word Embedding¶

Text Embedding¶

Keyword Embedding¶

Preprocessing Train Dataset¶

Cleaning¶

Feature Engineering¶

Textual Attributes to Word Embedding Representations¶

Embedding Representation for All Chosen Features¶

Engineered Features¶

Keyword Embedding¶

Concatenate All¶

Model Construction¶

Training¶

Predicting¶

End¶