Dataset

Using data from "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)" by Livingstone & Russo is licensed under CC BY-NA-SC 4.0.

https://zenodo.org/record/1188976

The file is Audio_Speech_Actors_01-24

Data Organization

Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) Song audio-only files (16bit, 48kHz .wav) from the RAVDESS. Full dataset of speech and song, audio and video (24.8 GB) available from Zenodo. Construction and perceptual validation of the RAVDESS is described in our Open Access paper in PLoS ONE.

Check out our Kaggle Speech emotion dataset.

Files

This portion of the RAVDESS contains 1012 files: 44 trials per actor x 23 actors = 1012. The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Song emotions includes calm, happy, sad, angry, and fearful expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.

File naming convention

Each of the 1012 files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 03-02-06-01-02-01-12.wav). These identifiers define the stimulus characteristics:

Filename identifiers

  • Modality (01 = full-AV, 02 = video-only, 03 = audio-only).

  • Vocal channel (01 = speech, 02 = song).

  • Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).

  • Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.

  • Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").

  • Repetition (01 = 1st repetition, 02 = 2nd repetition).

  • Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

Filename example: 03-02-06-01-02-01-12.wav

  • Audio-only (03)
  • Song (02)
  • Fearful (06)
  • Normal intensity (01)
  • Statement "dogs" (02)
  • 1st Repetition (01)
  • 12th Actor (12)
  • Female, as the actor ID number is even.

How to cite the RAVDESS

Academic citation

If you use the RAVDESS in an academic publication, please use the following citation: Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.

All other attributions

If you use the RAVDESS in a form other than an academic publication, such as in a blog post, school project, or non-commercial product, please use the following attribution: "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)" by Livingstone & Russo is licensed under CC BY-NA-SC 4.0.

Loading Packages

In [1]:
# Imports
import joblib
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import librosa as lr
import librosa.display
import IPython.display as ipd
import seaborn as sns
import xgboost
import sklearn
import h2o
from h2o.automl import H2OAutoML
from glob import glob
from joblib import dump, load
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
%matplotlib inline
In [2]:
# Package versions
%reload_ext watermark
%watermark --iversions
matplotlib 3.2.2
pandas     1.0.5
h2o        3.30.0.6
seaborn    0.11.0
librosa    0.7.2
sklearn    0.23.1
xgboost    0.90
numpy      1.18.5
joblib     0.15.1

Loading Data

In [3]:
%%time

# Looping through 24 folders, each with 60 samples, for a total of 1440 audio files.
# The folders are named 'Actor_01' through 'Actor_24'

# Definig the root directory name
root_dir = 'Audio_Speech_Actors_01-24'

# Dictionaries to receive the outputs
files = {}
sampling_rate = {}

# Loop through all directories
for itens in range(1, 25):
    
    # Define the folder path for that specific actor
    if len(str(itens)) == 1:
        audio_dir = root_dir + '/Actor_' + str('0') + str(itens)
    else:
        audio_dir = root_dir + '/Actor_' + str(itens)
    
    # Extract the path to all the directory's files
    audio_files = glob(audio_dir + '/*.wav')
    
    # Store the path and sampling rate for each '.wav' file in the folder
    for i in range(len(audio_files)):
        x = audio_files[i]
        audio, sfreq = lr.load(audio_files[i], sr = None)
        files[x] = len(audio) / sfreq
        sampling_rate[x] = sfreq
Wall time: 59.3 s
In [4]:
# Check if all sampling rates are equal
all(value==48000 for value in sampling_rate.values())
Out[4]:
True
In [5]:
# Put everying into a dataframe
audio_df = pd.DataFrame()

for keys, values in files.items():
    audio_df.at[keys,'file_length'] = values
    
audio_df 
Out[5]:
file_length
Audio_Speech_Actors_01-24/Actor_01\03-01-01-01-01-01-01.wav 3.303292
Audio_Speech_Actors_01-24/Actor_01\03-01-01-01-01-02-01.wav 3.336667
Audio_Speech_Actors_01-24/Actor_01\03-01-01-01-02-01-01.wav 3.269917
Audio_Speech_Actors_01-24/Actor_01\03-01-01-01-02-02-01.wav 3.169833
Audio_Speech_Actors_01-24/Actor_01\03-01-02-01-01-01-01.wav 3.536854
... ...
Audio_Speech_Actors_01-24/Actor_24\03-01-08-01-02-02-24.wav 3.403396
Audio_Speech_Actors_01-24/Actor_24\03-01-08-02-01-01-24.wav 3.937271
Audio_Speech_Actors_01-24/Actor_24\03-01-08-02-01-02-24.wav 3.970625
Audio_Speech_Actors_01-24/Actor_24\03-01-08-02-02-01-24.wav 3.670333
Audio_Speech_Actors_01-24/Actor_24\03-01-08-02-02-02-24.wav 3.636958

1440 rows × 1 columns

In [6]:
# Function to extract the audio files and their sampling rates
# This is necessary to obtain the mfcc from an audio file
def extract_audio_data(file):
    audio, sfreq = lr.load(file, sr = None)
    return audio, sfreq
In [7]:
# Checking the shape and sampling rates for an audio file
# Their rate (shape / sampling rate) is the file_length
audio, sfreq = extract_audio_data(audio_df.index[1439])
print(f'\nShape of the series representing the audio (y): {audio.shape} \nSampling Rate (sr): {sfreq}')
Shape of the series representing the audio (y): (174574,) 
Sampling Rate (sr): 48000
In [8]:
# Function to extract the MFCC from a file
def extract_mfcc(file):
    audio, sfreq = extract_audio_data(file)
    mfccs = librosa.feature.mfcc(audio, sr = sfreq)
    return mfccs
In [9]:
extract_mfcc(audio_df.index[0])
Out[9]:
array([[-861.5326, -861.5326, -861.5326, ..., -861.5326, -861.5326,
        -861.5326],
       [   0.    ,    0.    ,    0.    , ...,    0.    ,    0.    ,
           0.    ],
       [   0.    ,    0.    ,    0.    , ...,    0.    ,    0.    ,
           0.    ],
       ...,
       [   0.    ,    0.    ,    0.    , ...,    0.    ,    0.    ,
           0.    ],
       [   0.    ,    0.    ,    0.    , ...,    0.    ,    0.    ,
           0.    ],
       [   0.    ,    0.    ,    0.    , ...,    0.    ,    0.    ,
           0.    ]], dtype=float32)

Filename identifiers

  • Modality (01 = full-AV, 02 = video-only, 03 = audio-only).

  • Vocal channel (01 = speech, 02 = song).

  • Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).

  • Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.

  • Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").

  • Repetition (01 = 1st repetition, 02 = 2nd repetition).

  • Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

Filename example: 03-02-06-01-02-01-12.wav

  • Audio-only (03)
  • Song (02)
  • Fearful (06)
  • Normal intensity (01)
  • Statement "dogs" (02)
  • 1st Repetition (01)
  • 12th Actor (12)
  • Female, as the actor ID number is even.
In [10]:
%%time

# Create a list to store the identifiers for each audio file
list_data_frame = []

# Loop to extract the identifiers from files
for file_path in audio_df.index:
    
    # Call function to obtains mfccs
    data = extract_mfcc(file_path)
    
    # Transform the data to shape (n samples, n features)
    frame = pd.DataFrame(data.T, columns = ['mfcc' + str(x) for x in range (0,20)])
    
    # Extract actor gender using the encoding
    frame['ID_ACTOR_GENDER'] = file_path[53:55]
    
    # Exctract emotion using the encoding
    frame['ID_EMOTION'] = file_path[41:43]
    
    # Extract emotion intensity
    frame['ID_EMOTION_INTENSITY'] = file_path[44:46]
    
    # Append to the list
    list_data_frame.append(frame)
Wall time: 48.2 s
In [11]:
# Contatenate the list elements into a df
identifiers_df = pd.concat(list_data_frame, ignore_index = True)
In [12]:
# Check result
identifiers_df
Out[12]:
mfcc0 mfcc1 mfcc2 mfcc3 mfcc4 mfcc5 mfcc6 mfcc7 mfcc8 mfcc9 ... mfcc13 mfcc14 mfcc15 mfcc16 mfcc17 mfcc18 mfcc19 ID_ACTOR_GENDER ID_EMOTION ID_EMOTION_INTENSITY
0 -861.532593 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 01 01 01
1 -861.532593 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 01 01 01
2 -861.532593 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 01 01 01
3 -861.532593 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 01 01 01
4 -861.532593 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 01 01 01
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
500281 -693.609009 3.021235 3.011289 2.994891 2.972303 2.943882 2.910077 2.871416 2.828496 2.781949 ... 2.573641 2.519586 2.466096 2.413781 2.363167 2.314736 2.268862 24 08 02
500282 -691.884399 5.430296 5.332271 5.174513 4.964972 4.713866 4.432867 4.134216 3.829843 3.530532 ... 2.529897 2.345227 2.186139 2.050030 1.933901 1.835087 1.751647 24 08 02
500283 -692.399963 4.718255 4.670175 4.591483 4.484293 4.351463 4.196495 4.023423 3.836677 3.640923 ... 2.861344 2.688585 2.531607 2.392811 2.273795 2.175476 2.097934 24 08 02
500284 -694.268372 2.090834 2.086926 2.080427 2.071359 2.059744 2.045627 2.029049 2.010061 1.988727 ... 1.881333 1.849386 1.815585 1.780022 1.742835 1.704102 1.663948 24 08 02
500285 -692.471741 4.618038 4.574526 4.505244 4.414803 4.309120 4.195020 4.079656 3.969994 3.872242 ... 3.675385 3.677577 3.694948 3.722036 3.752502 3.779414 3.796019 24 08 02

500286 rows × 23 columns

In [13]:
# Using ID_ACTOR_GENDER to define LABEL_GENDER
identifiers_df['LABEL_GENDER'] = list(map(lambda x: 'male' if int(x)%2 == 1 else 'female',
                                         identifiers_df.ID_ACTOR_GENDER))
In [14]:
# Using ID_EMOTION to define LABEL_EMOTION
identifiers_df['LABEL_EMOTION'] = identifiers_df.ID_EMOTION.map({'01':'neutral',
                                                                 '02':'calm',
                                                                 '03':'happy',
                                                                 '04':'sad',
                                                                 '05':'angry',
                                                                 '06':'fearful',
                                                                 '07':'disgust',
                                                                 '08':'surprised'})
In [15]:
# Using ID_EMOTION_INTENSITY to define LABEL_INTENSITY
identifiers_df['LABEL_INTENSITY'] = identifiers_df.ID_EMOTION_INTENSITY.map({'01':'normal',
                                                                             '02':'strong'})
In [16]:
# Creating a GENDER + EMOTION label
identifiers_df['LABEL_GENDER_EMOTION'] = identifiers_df['LABEL_GENDER'] + '_' + identifiers_df['LABEL_EMOTION']
In [17]:
# Renaming it as df_final
df_final = identifiers_df.copy()
In [18]:
df_final
Out[18]:
mfcc0 mfcc1 mfcc2 mfcc3 mfcc4 mfcc5 mfcc6 mfcc7 mfcc8 mfcc9 ... mfcc17 mfcc18 mfcc19 ID_ACTOR_GENDER ID_EMOTION ID_EMOTION_INTENSITY LABEL_GENDER LABEL_EMOTION LABEL_INTENSITY LABEL_GENDER_EMOTION
0 -861.532593 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 01 01 01 male neutral normal male_neutral
1 -861.532593 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 01 01 01 male neutral normal male_neutral
2 -861.532593 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 01 01 01 male neutral normal male_neutral
3 -861.532593 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 01 01 01 male neutral normal male_neutral
4 -861.532593 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 01 01 01 male neutral normal male_neutral
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
500281 -693.609009 3.021235 3.011289 2.994891 2.972303 2.943882 2.910077 2.871416 2.828496 2.781949 ... 2.363167 2.314736 2.268862 24 08 02 female surprised strong female_surprised
500282 -691.884399 5.430296 5.332271 5.174513 4.964972 4.713866 4.432867 4.134216 3.829843 3.530532 ... 1.933901 1.835087 1.751647 24 08 02 female surprised strong female_surprised
500283 -692.399963 4.718255 4.670175 4.591483 4.484293 4.351463 4.196495 4.023423 3.836677 3.640923 ... 2.273795 2.175476 2.097934 24 08 02 female surprised strong female_surprised
500284 -694.268372 2.090834 2.086926 2.080427 2.071359 2.059744 2.045627 2.029049 2.010061 1.988727 ... 1.742835 1.704102 1.663948 24 08 02 female surprised strong female_surprised
500285 -692.471741 4.618038 4.574526 4.505244 4.414803 4.309120 4.195020 4.079656 3.969994 3.872242 ... 3.752502 3.779414 3.796019 24 08 02 female surprised strong female_surprised

500286 rows × 27 columns

In [19]:
df_final.dtypes
Out[19]:
mfcc0                   float32
mfcc1                   float32
mfcc2                   float32
mfcc3                   float32
mfcc4                   float32
mfcc5                   float32
mfcc6                   float32
mfcc7                   float32
mfcc8                   float32
mfcc9                   float32
mfcc10                  float32
mfcc11                  float32
mfcc12                  float32
mfcc13                  float32
mfcc14                  float32
mfcc15                  float32
mfcc16                  float32
mfcc17                  float32
mfcc18                  float32
mfcc19                  float32
ID_ACTOR_GENDER          object
ID_EMOTION               object
ID_EMOTION_INTENSITY     object
LABEL_GENDER             object
LABEL_EMOTION            object
LABEL_INTENSITY          object
LABEL_GENDER_EMOTION     object
dtype: object

Exploratory Analysis

In [20]:
# Dataset balance by gender
sns.countplot(x = 'LABEL_GENDER', data = df_final)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1510eb34948>
In [21]:
# Dataset balance by emotion
sns.countplot(x = 'LABEL_EMOTION', data = df_final)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1510ea85a08>
In [22]:
# Dataset balance by gender + emotion
plt.figure(figsize = (12,4))
sns.countplot(x = 'LABEL_GENDER_EMOTION', data = df_final)
plt.xticks(rotation = 45)
Out[22]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15]),
 <a list of 16 Text major ticklabel objects>)

Machine Learning

Model 1 - Predict Gender

In [23]:
# Getting X values
x1 = df_final[df_final.columns[:-7]].values
In [24]:
# Using Label Encoder to codify the genders
y1 = LabelEncoder().fit_transform(df_final['LABEL_GENDER'])
In [25]:
# Train test split
x1_train, x1_test, y1_train, y1_test = train_test_split(x1, y1, test_size = 0.33, stratify = y1, shuffle = True)
In [26]:
# Scale the training data
z_scaler = StandardScaler()

# First fit the scaler on the train dataset
fitted = z_scaler.fit(x1_train)

# Then transform both train and test data
x1_train = fitted.transform(x1_train)
x1_test = fitted.transform(x1_test)
In [27]:
# Creating a dictionary with ML algorithms to be used
classifiers = {'XGboost':XGBClassifier(),
               'DecisonTree':DecisionTreeClassifier(),
               'RandomForest':RandomForestClassifier()}
In [28]:
# Function to train each classifier and evaluate performance
def train_evaluate_gender(classifiers, x_train, x_test, y_train, y_test):
    
    # Loop through the classifiers dictionary
    for k, clf in classifiers.items():
        
        print("\nStarting training for model " + k + '...')
        
        # Train model
        clf.fit(x_train, y_train)
        
        # Predict
        y_pred = clf.predict(x_test)
        
        # Calculate accuracy
        acc = accuracy_score(y_test, y_pred)
        print(f'\nClassifier {k} has accuracy: {acc}')
        
        # Create Confusion Matrix
        cm = confusion_matrix(y_test, y_pred)
        print(f'\nConfusion Matrix')
        print(cm)
        
        # Save model to disk
        dump(clf, 'models/model_' + k + '_gender.joblib')
        print('\nModel ' + k + ' saved to disk')
In [29]:
%%time

# Run the function to test all algorithms
train_evaluate_gender(classifiers, x1_train, x1_test, y1_train, y1_test)

print('\nTraining finished!')
Starting training for model XGboost...

Classifier XGboost has accuracy: 0.7899633544322966

Confusion Matrix
[[67804 15825]
 [18851 62615]]

Model XGboost saved to disk

Starting training for model DecisonTree...

Classifier DecisonTree has accuracy: 0.8299463945001363

Confusion Matrix
[[69716 13913]
 [14162 67304]]

Model DecisonTree saved to disk

Starting training for model RandomForest...

Classifier RandomForest has accuracy: 0.889487870619946

Confusion Matrix
[[73615 10014]
 [ 8231 73235]]

Model RandomForest saved to disk

Training finished!
Wall time: 5min 56s

Model 2 - Predict Gender and Emotion

In [30]:
# Getting X values
x2 = df_final[df_final.columns[:-7]].values
In [31]:
# Using Label Encoder to codify the genders
y2 = LabelEncoder().fit_transform(df_final['LABEL_GENDER_EMOTION'])
In [32]:
# Train test split
x2_train, x2_test, y2_train, y2_test = train_test_split(x2, y2, test_size = 0.33, stratify = y2, shuffle = True)
In [33]:
# Scale the training data
z_scaler = StandardScaler()

# First fit the scaler on the train dataset
fitted2 = z_scaler.fit(x2_train)

# Then transform both train and test data
x2_train = fitted.transform(x2_train)
x2_test = fitted.transform(x2_test)
In [34]:
# Function to train each classifier and evaluate performance
def train_evaluate_gender_emotion(classifiers, x_train, x_test, y_train, y_test):
    
    # Loop through the classifiers dictionary
    for k, clf in classifiers.items():
        
        print("\nStarting training for model " + k + '...')
        
        # Train model
        clf.fit(x_train, y_train)
        
        # Predict
        y_pred = clf.predict(x_test)
        
        # Calculate accuracy
        acc = accuracy_score(y_test, y_pred)
        print(f'\nClassifier {k} has accuracy: {acc}')
        
        # Create Confusion Matrix
        cm = confusion_matrix(y_test, y_pred)
        print(f'\nConfusion Matrix')
        print(cm)
        
        # Save model to disk
        dump(clf, 'models/model_' + k + '_gender_emotion.joblib')
        print('\nModel ' + k + ' saved to disk')
In [35]:
%%time

# Run the function to test all algorithms
train_evaluate_gender_emotion(classifiers, x2_train, x2_test, y2_train, y2_test)

print('\nTraining finished!')
Starting training for model XGboost...

Classifier XGboost has accuracy: 0.2822193282655441

Confusion Matrix
[[4313  353  928 1059  888    5  335  987 1440   42  212  422  223    3
    68  404]
 [  94 7587  556   99   63   47  920  134   32 1083  177   74   57   13
   167  157]
 [1098 1794 2554  740  839   31 1036 1763   83  530  345  175  279   30
   192  440]
 [2066 1080  602 2870  806   19  514 1023  647   68  100  479  142    3
    82  380]
 [1911  888  917 1307 2157    8  850 1117  457  253  202  364  107   17
    85  281]
 [  49 2718  391   46   55  144  656  158   11  610  123   23   21   13
   105  149]
 [ 665 3512 1044  751  450   36 2238  587   35  623  260  150  160   10
   253  372]
 [1010 1064 1044  797  923   18  820 2937  209  118  191  264  178    2
   127  835]
 [1182  143  132  305  394    0  111  486 5005  418  567  642  628   10
   373  948]
 [   9 1382   85   78   15   10  242   96   38 7317  585   63  185   83
   933  196]
 [ 493  954  500  235  135    5  420  619  446 2099 2157  538  667   49
   963 1236]
 [ 735  349  266  555  166    5  411  545 1462 1208  722 1748  829   29
   739  614]
 [ 782  319  326  335  338    4  201  704  902 1355  823  896 1711   82
   902 1042]
 [   3  557  124    4    8    3  127  135   28 2441  276   20  172  135
   845  268]
 [ 146 1441  268  104  145    7  326  285  254 3382  647  553  555   58
  1843  815]
 [ 561  559  445  233  323    5  175  997  522 1316  883  591  861   79
   783 1877]]

Model XGboost saved to disk

Starting training for model DecisonTree...

Classifier DecisonTree has accuracy: 0.48235864199400347

Confusion Matrix
[[6628  132  611  754  683   66  318  636  479   50  263  328  296   29
   135  274]
 [ 137 6176  506  260  271  755  967  330   70  421  307  157  154  177
   381  191]
 [ 579  468 5780  567  639  332  772  661  193  188  418  243  317  120
   314  338]
 [ 722  280  587 5672  703  149  545  583  310   74  246  347  267   23
   165  208]
 [ 684  249  633  761 5738  193  533  643  225  109  214  262  250   61
   143  223]
 [  87  738  285  128  154 2217  431  193   47  210  174   76  131   91
   202  108]
 [ 339  985  727  491  552  433 5151  515  120  288  323  253  235  144
   349  241]
 [ 586  276  762  582  680  198  546 5146  209   76  300  302  303   59
   210  302]
 [ 498   58  180  313  224   40   95  191 6789  170  507  767  567  110
   317  518]
 [  46  406  200   61  117  229  311  104  138 5761  720  441  476  736
  1121  450]
 [ 301  304  405  201  232  168  348  255  479  716 4821  688  705  386
   765  742]
 [ 331  135  294  345  268   68  240  259  810  370  680 4374  698  244
   627  640]
 [ 314  147  289  264  236  104  227  307  604  476  692  671 4665  268
   623  835]
 [  30  141  131   35   71   98  146   54   98  717  376  217  290 1944
   511  287]
 [ 131  411  292  158  144  204  288  175  293 1140  794  660  645  555
  4343  596]
 [ 260  159  314  243  214  104  223  293  537  421  709  619  797  320
   567 4430]]

Model DecisonTree saved to disk

Starting training for model RandomForest...

Classifier RandomForest has accuracy: 0.6513037947848208

Confusion Matrix
[[8424  128  502  428  343   22  156  421  473   26  173  176  184   10
    62  154]
 [  23 9428  162   54   74  160  394   69   10  443  111   24   41   46
   166   55]
 [ 331  591 8219  231  339   77  474  462   77  205  241  111  198   57
   138  178]
 [ 478  427  480 7383  340   52  281  399  275   45  149  229  140   10
    50  143]
 [ 506  336  491  516 7275   22  323  413  202  139  159  152  160   47
    52  128]
 [  20 1298  160   45   64 2587  334   86   11  253  124   27   45   41
   130   47]
 [ 146 1417  540  306  276  121 6920  249   52  341  244   88  138   55
   157   96]
 [ 447  400  718  401  478   49  358 6522  183   59  233  153  178   17
   106  235]
 [ 295   33  102  159   89    6   43  132 8446  199  476  349  397   35
   210  373]
 [   5  308   76   15   23   89  120   26   17 9231  377   72  141  215
   502  100]
 [ 163  216  238  106   81   55  175  190  303  951 7001  362  491  113
   512  559]
 [ 199  109  206  206  124   15  125  137  784  584  623 5870  482  110
   424  385]
 [ 206   91  197  122  127   25  127  226  496  678  582  427 6348  130
   387  553]
 [   9  125   70   15   14   31   77   41   19 1292  278   87  179 2297
   484  128]
 [  74  341  165   69   60   85  119  123  170 1732  554  313  386  217
  6069  352]
 [ 154  109  208  133  110   25  110  260  361  629  722  416  769  162
   535 5507]]

Model RandomForest saved to disk

Training finished!
Wall time: 29min 27s

Model 3 - Predict Gender Using PCA

In [36]:
# Create a PCA object with 20 components
pca_20 = PCA(n_components = 20)
In [37]:
# Using the x/y data from model 1, since it was also for gender prediction
x1_train.shape
Out[37]:
(335191, 20)
In [39]:
# Applying PCA to the training data
pca_feature = pca_20.fit(x1_train)
In [40]:
pca_feature
Out[40]:
PCA(n_components=20)
In [41]:
# Plot
plt.figure(figsize = (12, 4))
plt.plot(pca_feature.explained_variance_ratio_, marker = 'o', label = 'Explained Variance Rate')
plt.plot(pca_feature.explained_variance_ratio_.cumsum(), marker = 'o', label = 'Cumulative Explained Variance Rate')
plt.legend()
plt.ylabel('Variance')
plt.xlabel('Number of Principal Components')
plt.title('Variance vs Number of Principal Components')
plt.xticks(np.arange(0, 20, step = 1))
Out[41]:
([<matplotlib.axis.XTick at 0x1510eb54b48>,
  <matplotlib.axis.XTick at 0x1510eb54a48>,
  <matplotlib.axis.XTick at 0x1510ea9fcc8>,
  <matplotlib.axis.XTick at 0x15112c364c8>,
  <matplotlib.axis.XTick at 0x15112c36a48>,
  <matplotlib.axis.XTick at 0x15112c33788>,
  <matplotlib.axis.XTick at 0x15112c28e48>,
  <matplotlib.axis.XTick at 0x15112c282c8>,
  <matplotlib.axis.XTick at 0x15112c27e08>,
  <matplotlib.axis.XTick at 0x15112c27f88>,
  <matplotlib.axis.XTick at 0x15112c28d88>,
  <matplotlib.axis.XTick at 0x15112c36b48>,
  <matplotlib.axis.XTick at 0x15112c29dc8>,
  <matplotlib.axis.XTick at 0x15112c25948>,
  <matplotlib.axis.XTick at 0x15112c1fe48>,
  <matplotlib.axis.XTick at 0x15112c1f4c8>,
  <matplotlib.axis.XTick at 0x15112c1eb88>,
  <matplotlib.axis.XTick at 0x15112c15e88>,
  <matplotlib.axis.XTick at 0x15112c153c8>,
  <matplotlib.axis.XTick at 0x15112c14b08>],
 <a list of 20 Text major ticklabel objects>)

The plot (orange line) shows that 15 components explain over 95% of the variance on the data, thus I'll use only 15 compoenents instead of the initial 20.

In [42]:
# Saving the components
components = pca_feature.fit_transform(x1_train)
In [43]:
# Visualize
components
Out[43]:
array([[ 0.81793505,  2.2289948 ,  6.551832  , ...,  0.7566632 ,
         1.2090428 , -0.55567753],
       [-0.406427  , -3.8125827 ,  0.01774758, ..., -0.41411147,
         0.5542691 , -0.3927645 ],
       [-1.6900054 , -0.26272348, -0.10464392, ...,  0.20780295,
        -0.09791111,  0.17072834],
       ...,
       [-2.2944653 , -0.7926662 , -1.0375288 , ...,  0.08214777,
        -0.09014749,  0.46862382],
       [-1.5170457 ,  0.3078449 , -0.2936892 , ...,  0.10762074,
         0.01201031, -0.10196748],
       [ 0.74866325,  1.5854828 , -0.6777546 , ..., -0.4836019 ,
        -0.3262427 , -0.14921625]], dtype=float32)
In [44]:
# Check type
type(components)
Out[44]:
numpy.ndarray
In [45]:
# Converting the array to dataframe
components_df = pd.DataFrame(components)
In [46]:
# Adjusting index
components_df.index = pd.RangeIndex(start = 0, stop = len(components_df.index), step = 1)
In [47]:
# View
components_df.head()
Out[47]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 0.817935 2.228995 6.551832 0.919350 -1.157539 -2.351449 1.184773 1.623234 0.657935 -2.205470 -0.191457 0.061119 -1.102526 -0.151148 -1.937881 -0.743959 -0.989330 0.756663 1.209043 -0.555678
1 -0.406427 -3.812583 0.017748 -0.988423 0.166966 -0.300564 -0.527874 1.114727 0.379517 -0.959077 0.783221 -0.439352 -0.050427 0.330571 0.413873 0.398849 0.401881 -0.414111 0.554269 -0.392765
2 -1.690005 -0.262723 -0.104644 -0.741844 0.059906 1.094772 -0.471815 -0.135232 0.029346 -1.054106 0.362792 -0.680469 -0.535968 0.015750 -0.052635 -0.473981 0.013689 0.207803 -0.097911 0.170728
3 -2.169122 0.356323 -0.581886 0.061808 -0.007264 -0.296470 -0.004363 -0.224322 0.181942 0.377460 -0.008394 0.583931 0.728293 -0.252462 -0.133282 0.010115 -0.112256 0.037229 0.023215 0.079963
4 4.793149 0.911040 -0.500708 0.387858 -1.655834 2.818822 -0.676771 -0.806656 1.783190 1.705665 0.195869 0.802911 -4.034620 -2.135656 1.371211 -0.161348 -1.030812 -0.592246 0.279889 -0.230941
In [48]:
# Extracting the top 15 components
x_train_pca = components_df[components_df.columns[0:15]].values
In [49]:
# Setting y
y_train_pca = y1_train
In [50]:
# For test data, using from model 1
x_test_pca = x1_test
y_test_pca = y1_test
In [51]:
len(x_train_pca)
Out[51]:
335191
In [52]:
len(y_train_pca)
Out[52]:
335191
In [53]:
len(x_test_pca)
Out[53]:
165095
In [54]:
len(y_test_pca)
Out[54]:
165095
In [55]:
# Function to train each classifier and evaluate performance
def train_evaluate_gender_pca(classifiers, x_train, x_test, y_train, y_test):
    
    # Loop through the classifiers dictionary
    for k, clf in classifiers.items():
        
        print("\nStarting training for model " + k + '...')
        
        # Train model
        clf.fit(x_train, y_train)
        
        # Predict
        y_pred = clf.predict(x_test)
        
        # Calculate accuracy
        acc = accuracy_score(y_test, y_pred)
        print(f'\nClassifier {k} has accuracy: {acc}')
        
        # Create Confusion Matrix
        cm = confusion_matrix(y_test, y_pred)
        print(f'\nConfusion Matrix')
        print(cm)
        
        # Save model to disk
        dump(clf, 'models/model_' + k + '_gender_pca.joblib')
        print('\nModel ' + k + ' saved to disk')
In [56]:
%%time

# Run the function to test all algorithms
train_evaluate_gender_pca(classifiers, x2_train, x2_test, y2_train, y2_test)

print('\nTraining finished!')
Starting training for model XGboost...

Classifier XGboost has accuracy: 0.2822193282655441

Confusion Matrix
[[4313  353  928 1059  888    5  335  987 1440   42  212  422  223    3
    68  404]
 [  94 7587  556   99   63   47  920  134   32 1083  177   74   57   13
   167  157]
 [1098 1794 2554  740  839   31 1036 1763   83  530  345  175  279   30
   192  440]
 [2066 1080  602 2870  806   19  514 1023  647   68  100  479  142    3
    82  380]
 [1911  888  917 1307 2157    8  850 1117  457  253  202  364  107   17
    85  281]
 [  49 2718  391   46   55  144  656  158   11  610  123   23   21   13
   105  149]
 [ 665 3512 1044  751  450   36 2238  587   35  623  260  150  160   10
   253  372]
 [1010 1064 1044  797  923   18  820 2937  209  118  191  264  178    2
   127  835]
 [1182  143  132  305  394    0  111  486 5005  418  567  642  628   10
   373  948]
 [   9 1382   85   78   15   10  242   96   38 7317  585   63  185   83
   933  196]
 [ 493  954  500  235  135    5  420  619  446 2099 2157  538  667   49
   963 1236]
 [ 735  349  266  555  166    5  411  545 1462 1208  722 1748  829   29
   739  614]
 [ 782  319  326  335  338    4  201  704  902 1355  823  896 1711   82
   902 1042]
 [   3  557  124    4    8    3  127  135   28 2441  276   20  172  135
   845  268]
 [ 146 1441  268  104  145    7  326  285  254 3382  647  553  555   58
  1843  815]
 [ 561  559  445  233  323    5  175  997  522 1316  883  591  861   79
   783 1877]]

Model XGboost saved to disk

Starting training for model DecisonTree...

Classifier DecisonTree has accuracy: 0.4820860716557134

Confusion Matrix
[[6614  140  579  771  684   78  311  653  480   49  262  342  320   32
   123  244]
 [ 156 6187  494  260  264  758  988  315   68  417  311  171  146  156
   374  195]
 [ 577  505 5748  577  659  315  750  675  187  190  396  275  336  109
   301  329]
 [ 720  291  635 5650  737  154  504  584  332   72  223  344  256   34
   156  189]
 [ 648  262  671  752 5771  170  510  646  245  114  219  249  226   62
   162  214]
 [  85  713  282  144  162 2181  443  192   47  209  176   81  127  100
   212  118]
 [ 330  967  756  506  552  406 5161  514  124  295  342  261  208  127
   348  249]
 [ 618  285  732  605  675  213  512 5148  231   88  271  292  286   64
   217  300]
 [ 525   63  202  312  241   38  105  197 6749  174  487  751  560  126
   316  498]
 [  61  418  225   77  108  233  293  100  154 5771  710  425  434  740
  1094  474]
 [ 292  291  404  196  221  175  348  276  488  718 4816  708  686  392
   766  739]
 [ 334  137  291  334  270   64  240  264  788  343  683 4406  698  259
   644  628]
 [ 287  149  289  276  239   88  229  321  594  456  673  690 4682  273
   636  840]
 [  34  133  117   36   68  103  146   60   99  713  399  202  295 1917
   530  294]
 [ 127  407  281  156  155  208  311  185  296 1186  756  621  615  566
  4348  611]
 [ 257  156  302  226  198  115  221  313  518  413  675  631  856  298
   590 4441]]

Model DecisonTree saved to disk

Starting training for model RandomForest...

Classifier RandomForest has accuracy: 0.6494684878403344

Confusion Matrix
[[8413  136  511  445  345   16  144  428  466   25  177  175  182   15
    62  142]
 [  22 9403  181   45   72  146  423   81   12  436  124   24   34   47
   167   43]
 [ 328  601 8236  241  297   73  450  438   79  227  262  114  196   53
   145  189]
 [ 468  443  481 7365  329   41  308  399  288   53  142  230  155    5
    55  119]
 [ 525  319  485  529 7219   36  329  434  218  139  157  143  160   53
    48  127]
 [  25 1363  171   44   53 2534  329   72   15  270  117   18   45   39
   121   56]
 [ 161 1440  553  328  270  113 6847  262   42  318  258   80  132   59
   180  103]
 [ 485  410  665  389  473   62  351 6525  166   56  214  156  208   12
   118  247]
 [ 317   37  109  134   87    5   38  144 8453  227  472  338  409   43
   205  326]
 [   4  297   76   20   23   99  124   33   15 9220  347   85  141  224
   517   92]
 [ 179  223  219  108   80   67  170  203  284  948 6961  399  487  117
   524  547]
 [ 214  114  191  197  121   11  155  156  777  555  630 5796  524  105
   433  404]
 [ 204  105  175  126  123   30  106  229  511  625  600  434 6355  144
   420  535]
 [  12  140   75    8   13   36   80   40   24 1276  261   99  160 2316
   459  147]
 [  77  361  168   73   66   81  116  118  154 1748  551  324  376  204
  6032  380]
 [ 161  116  192  119  106   34  102  264  387  618  680  439  779  156
   508 5549]]

Model RandomForest saved to disk

Training finished!
Wall time: 30min 56s

Looks like the performance was lower using PCA, so I'll keep the first model. Next project is doing voice prediction with AI :)