Multimodal Emotion Recognition

Matheus Schmitz
LinkedIn
Github Portfolio

Imports

Load Data

Configuration

Standardize Data

Standardize Timesteps

One-Hot Encode Labels

Data Loader

Model Architecture

Model 1: Acoustic

Dataloaders

Model

Training

Plotting

Model 2: Visual

Dataloaders

Model

Training

Plotting

Model 3: Late Fusion

Model 4: Early Fusion

Data Preparation

Dataloaders

Model

Training

Plotting

Model Comparison

Macro F1 Score

Accuracy

Balanced Accuracy

Confusion Matrix

Receiver Operating Characteristic (ROC) - Global

Receiver Operating Characteristic (ROC) - Stratified

Precision-Recall Curve (PRC) - Stratified

Analysis of Results

All metrics were in accordance as to each model's performance, agreeing that from best to worst the models are ranked as:

  1. Early Fusion
  2. Late Fusion
  3. Visual
  4. Acoustic

From this ranking the first obvious observation is that multimodal features can improve classification performance, given that both models using multimodal features outperform the two other models built on unimodal features.

Early fusion outperforms late fusion, which is what I would expect, give that early fusion allows the model to learn about co-dependant effects between acoustic and visual features. Interestingly enough, even though there is a significant performance gap between acoustic and visual, all other models which in one way or another consider visual features have only incremental performance gains. That is to say visual features seem to be the most useful, and they can be made slightly better with the addition of acoustic features, especially under early fusion.

Possibly one reason for such a small gap is my choice to keep the model architecture fixed between all modalities, so as to be fair. Yet, given that it is reasonable to expect there to be more room for learning nuances when working with multimodal data, that means that potentially such model would benefit from a deeper or more complex architecture. This is to say we might be able to extract better performance from the early fusion model, and thus see it's performance margin widen, by optimizing model architecture.

End

Matheus Schmitz
LinkedIn
Github Portfolio