Using data from "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)" by Livingstone & Russo is licensed under CC BY-NA-SC 4.0.
https://zenodo.org/record/1188976
The file is Audio_Speech_Actors_01-24
Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) Song audio-only files (16bit, 48kHz .wav) from the RAVDESS. Full dataset of speech and song, audio and video (24.8 GB) available from Zenodo. Construction and perceptual validation of the RAVDESS is described in our Open Access paper in PLoS ONE.
Check out our Kaggle Speech emotion dataset.
Files
This portion of the RAVDESS contains 1012 files: 44 trials per actor x 23 actors = 1012. The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Song emotions includes calm, happy, sad, angry, and fearful expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.
File naming convention
Each of the 1012 files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 03-02-06-01-02-01-12.wav). These identifiers define the stimulus characteristics:
Filename identifiers
Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
Vocal channel (01 = speech, 02 = song).
Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
Repetition (01 = 1st repetition, 02 = 2nd repetition).
Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).
Filename example: 03-02-06-01-02-01-12.wav
How to cite the RAVDESS
Academic citation
If you use the RAVDESS in an academic publication, please use the following citation: Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.
All other attributions
If you use the RAVDESS in a form other than an academic publication, such as in a blog post, school project, or non-commercial product, please use the following attribution: "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)" by Livingstone & Russo is licensed under CC BY-NA-SC 4.0.
# Imports
import joblib
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import librosa as lr
import librosa.display
import IPython.display as ipd
import seaborn as sns
import xgboost
import sklearn
import h2o
from h2o.automl import H2OAutoML
from glob import glob
from joblib import dump, load
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
%matplotlib inline
# Package versions
%reload_ext watermark
%watermark --iversions
%%time
# Looping through 24 folders, each with 60 samples, for a total of 1440 audio files.
# The folders are named 'Actor_01' through 'Actor_24'
# Definig the root directory name
root_dir = 'Audio_Speech_Actors_01-24'
# Dictionaries to receive the outputs
files = {}
sampling_rate = {}
# Loop through all directories
for itens in range(1, 25):
# Define the folder path for that specific actor
if len(str(itens)) == 1:
audio_dir = root_dir + '/Actor_' + str('0') + str(itens)
else:
audio_dir = root_dir + '/Actor_' + str(itens)
# Extract the path to all the directory's files
audio_files = glob(audio_dir + '/*.wav')
# Store the path and sampling rate for each '.wav' file in the folder
for i in range(len(audio_files)):
x = audio_files[i]
audio, sfreq = lr.load(audio_files[i], sr = None)
files[x] = len(audio) / sfreq
sampling_rate[x] = sfreq
# Check if all sampling rates are equal
all(value==48000 for value in sampling_rate.values())
# Put everying into a dataframe
audio_df = pd.DataFrame()
for keys, values in files.items():
audio_df.at[keys,'file_length'] = values
audio_df
# Function to extract the audio files and their sampling rates
# This is necessary to obtain the mfcc from an audio file
def extract_audio_data(file):
audio, sfreq = lr.load(file, sr = None)
return audio, sfreq
# Checking the shape and sampling rates for an audio file
# Their rate (shape / sampling rate) is the file_length
audio, sfreq = extract_audio_data(audio_df.index[1439])
print(f'\nShape of the series representing the audio (y): {audio.shape} \nSampling Rate (sr): {sfreq}')
# Function to extract the MFCC from a file
def extract_mfcc(file):
audio, sfreq = extract_audio_data(file)
mfccs = librosa.feature.mfcc(audio, sr = sfreq)
return mfccs
extract_mfcc(audio_df.index[0])
Filename identifiers
Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
Vocal channel (01 = speech, 02 = song).
Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
Repetition (01 = 1st repetition, 02 = 2nd repetition).
Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).
Filename example: 03-02-06-01-02-01-12.wav
%%time
# Create a list to store the identifiers for each audio file
list_data_frame = []
# Loop to extract the identifiers from files
for file_path in audio_df.index:
# Call function to obtains mfccs
data = extract_mfcc(file_path)
# Transform the data to shape (n samples, n features)
frame = pd.DataFrame(data.T, columns = ['mfcc' + str(x) for x in range (0,20)])
# Extract actor gender using the encoding
frame['ID_ACTOR_GENDER'] = file_path[53:55]
# Exctract emotion using the encoding
frame['ID_EMOTION'] = file_path[41:43]
# Extract emotion intensity
frame['ID_EMOTION_INTENSITY'] = file_path[44:46]
# Append to the list
list_data_frame.append(frame)
# Contatenate the list elements into a df
identifiers_df = pd.concat(list_data_frame, ignore_index = True)
# Check result
identifiers_df
# Using ID_ACTOR_GENDER to define LABEL_GENDER
identifiers_df['LABEL_GENDER'] = list(map(lambda x: 'male' if int(x)%2 == 1 else 'female',
identifiers_df.ID_ACTOR_GENDER))
# Using ID_EMOTION to define LABEL_EMOTION
identifiers_df['LABEL_EMOTION'] = identifiers_df.ID_EMOTION.map({'01':'neutral',
'02':'calm',
'03':'happy',
'04':'sad',
'05':'angry',
'06':'fearful',
'07':'disgust',
'08':'surprised'})
# Using ID_EMOTION_INTENSITY to define LABEL_INTENSITY
identifiers_df['LABEL_INTENSITY'] = identifiers_df.ID_EMOTION_INTENSITY.map({'01':'normal',
'02':'strong'})
# Creating a GENDER + EMOTION label
identifiers_df['LABEL_GENDER_EMOTION'] = identifiers_df['LABEL_GENDER'] + '_' + identifiers_df['LABEL_EMOTION']
# Renaming it as df_final
df_final = identifiers_df.copy()
df_final
df_final.dtypes
# Dataset balance by gender
sns.countplot(x = 'LABEL_GENDER', data = df_final)
# Dataset balance by emotion
sns.countplot(x = 'LABEL_EMOTION', data = df_final)
# Dataset balance by gender + emotion
plt.figure(figsize = (12,4))
sns.countplot(x = 'LABEL_GENDER_EMOTION', data = df_final)
plt.xticks(rotation = 45)
# Getting X values
x1 = df_final[df_final.columns[:-7]].values
# Using Label Encoder to codify the genders
y1 = LabelEncoder().fit_transform(df_final['LABEL_GENDER'])
# Train test split
x1_train, x1_test, y1_train, y1_test = train_test_split(x1, y1, test_size = 0.33, stratify = y1, shuffle = True)
# Scale the training data
z_scaler = StandardScaler()
# First fit the scaler on the train dataset
fitted = z_scaler.fit(x1_train)
# Then transform both train and test data
x1_train = fitted.transform(x1_train)
x1_test = fitted.transform(x1_test)
# Creating a dictionary with ML algorithms to be used
classifiers = {'XGboost':XGBClassifier(),
'DecisonTree':DecisionTreeClassifier(),
'RandomForest':RandomForestClassifier()}
# Function to train each classifier and evaluate performance
def train_evaluate_gender(classifiers, x_train, x_test, y_train, y_test):
# Loop through the classifiers dictionary
for k, clf in classifiers.items():
print("\nStarting training for model " + k + '...')
# Train model
clf.fit(x_train, y_train)
# Predict
y_pred = clf.predict(x_test)
# Calculate accuracy
acc = accuracy_score(y_test, y_pred)
print(f'\nClassifier {k} has accuracy: {acc}')
# Create Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(f'\nConfusion Matrix')
print(cm)
# Save model to disk
dump(clf, 'models/model_' + k + '_gender.joblib')
print('\nModel ' + k + ' saved to disk')
%%time
# Run the function to test all algorithms
train_evaluate_gender(classifiers, x1_train, x1_test, y1_train, y1_test)
print('\nTraining finished!')
# Getting X values
x2 = df_final[df_final.columns[:-7]].values
# Using Label Encoder to codify the genders
y2 = LabelEncoder().fit_transform(df_final['LABEL_GENDER_EMOTION'])
# Train test split
x2_train, x2_test, y2_train, y2_test = train_test_split(x2, y2, test_size = 0.33, stratify = y2, shuffle = True)
# Scale the training data
z_scaler = StandardScaler()
# First fit the scaler on the train dataset
fitted2 = z_scaler.fit(x2_train)
# Then transform both train and test data
x2_train = fitted.transform(x2_train)
x2_test = fitted.transform(x2_test)
# Function to train each classifier and evaluate performance
def train_evaluate_gender_emotion(classifiers, x_train, x_test, y_train, y_test):
# Loop through the classifiers dictionary
for k, clf in classifiers.items():
print("\nStarting training for model " + k + '...')
# Train model
clf.fit(x_train, y_train)
# Predict
y_pred = clf.predict(x_test)
# Calculate accuracy
acc = accuracy_score(y_test, y_pred)
print(f'\nClassifier {k} has accuracy: {acc}')
# Create Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(f'\nConfusion Matrix')
print(cm)
# Save model to disk
dump(clf, 'models/model_' + k + '_gender_emotion.joblib')
print('\nModel ' + k + ' saved to disk')
%%time
# Run the function to test all algorithms
train_evaluate_gender_emotion(classifiers, x2_train, x2_test, y2_train, y2_test)
print('\nTraining finished!')
# Create a PCA object with 20 components
pca_20 = PCA(n_components = 20)
# Using the x/y data from model 1, since it was also for gender prediction
x1_train.shape
# Applying PCA to the training data
pca_feature = pca_20.fit(x1_train)
pca_feature
# Plot
plt.figure(figsize = (12, 4))
plt.plot(pca_feature.explained_variance_ratio_, marker = 'o', label = 'Explained Variance Rate')
plt.plot(pca_feature.explained_variance_ratio_.cumsum(), marker = 'o', label = 'Cumulative Explained Variance Rate')
plt.legend()
plt.ylabel('Variance')
plt.xlabel('Number of Principal Components')
plt.title('Variance vs Number of Principal Components')
plt.xticks(np.arange(0, 20, step = 1))
The plot (orange line) shows that 15 components explain over 95% of the variance on the data, thus I'll use only 15 compoenents instead of the initial 20.
# Saving the components
components = pca_feature.fit_transform(x1_train)
# Visualize
components
# Check type
type(components)
# Converting the array to dataframe
components_df = pd.DataFrame(components)
# Adjusting index
components_df.index = pd.RangeIndex(start = 0, stop = len(components_df.index), step = 1)
# View
components_df.head()
# Extracting the top 15 components
x_train_pca = components_df[components_df.columns[0:15]].values
# Setting y
y_train_pca = y1_train
# For test data, using from model 1
x_test_pca = x1_test
y_test_pca = y1_test
len(x_train_pca)
len(y_train_pca)
len(x_test_pca)
len(y_test_pca)
# Function to train each classifier and evaluate performance
def train_evaluate_gender_pca(classifiers, x_train, x_test, y_train, y_test):
# Loop through the classifiers dictionary
for k, clf in classifiers.items():
print("\nStarting training for model " + k + '...')
# Train model
clf.fit(x_train, y_train)
# Predict
y_pred = clf.predict(x_test)
# Calculate accuracy
acc = accuracy_score(y_test, y_pred)
print(f'\nClassifier {k} has accuracy: {acc}')
# Create Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(f'\nConfusion Matrix')
print(cm)
# Save model to disk
dump(clf, 'models/model_' + k + '_gender_pca.joblib')
print('\nModel ' + k + ' saved to disk')
%%time
# Run the function to test all algorithms
train_evaluate_gender_pca(classifiers, x2_train, x2_test, y2_train, y2_test)
print('\nTraining finished!')
Looks like the performance was lower using PCA, so I'll keep the first model. Next project is doing voice prediction with AI :)