Introduction¶

DSCI 552 - Machine Learning for Data Science

Homework 5

Matheus Schmitz

USC ID: 5039286453

Imports¶

# Data Manipulation
import numpy as np
import pandas as pd

# Scikit-Learn
from sklearn.model_selection import train_test_split, KFold
from sklearn.svm import SVC, LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, hamming_loss, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# SMOTE
from imblearn.over_sampling import SMOTE

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Control downsampling of data to avoid long processing time during development
# 1 means 100% of data (no downsampling), 0.5 means 50% of data, and so on
DEV_DOWNSAMPLING = 1

(a) Dataset¶

# Read csv
df = pd.read_csv('../data/Frogs_MFCCs.csv')
df = df.sample(frac=DEV_DOWNSAMPLING)
print(f'df.shape: {df.shape}')
df.head(3)

df.shape: (7195, 26)

# Train-test split
df_train, df_test = train_test_split(df, test_size=0.3)
df_train.reset_index(inplace=True, drop=True)
df_test.reset_index(inplace=True, drop=True)
print(f'df_train.shape: {df_train.shape}')
print(f'df_test.shape: {df_test.shape}')

df_train.shape: (5036, 26)
df_test.shape: (2159, 26)

(b) Multi-class and Multi-Label Classification¶

# Extract labels to be used
labels = [i for i in df.columns[-4:-1]]
labels

['Family', 'Genus', 'Species']

# Dataframes for Exact Match Loss
pred_train_labels = pd.DataFrame()
pred_test_labels = pd.DataFrame()

true_train_multilabel = df_train[labels].stack().groupby(level=0).apply(''.join).to_frame('true_train')
true_test_multilabel = df_test[labels].stack().groupby(level=0).apply(''.join).to_frame('true_test')

# Dataframes to store all results for comparison
summary = pd.DataFrame()
summary_multilabel = pd.DataFrame()

(i) Exact Match and Hamming Score¶

Exact Match only considers a classification as correct if all labels of the sample are correctly classified. It's a strict metric.

Hamming Score is the fraction of labels that are incorrectly predicted. It's a more lenient metric.

How to create a scorer with sklearn's make_scorer: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html

# Create a loss function using the hamming_loss metric
hamm_loss = make_scorer(hamming_loss, greater_is_better=False)

(ii) Gaussian Kernel SVM¶

# Pipeline to standardize then run SVC
svc =  Pipeline([("standardize", StandardScaler()),
                 ("svc", SVC(kernel="rbf", decision_function_shape='ovr'))])

# Grid with parameters to be tested via CV
param_grid = {'svc__C': np.logspace(-3, 3, 7),
              'svc__gamma': np.logspace(-3, 3, 7)}

# Instantiate GridSearchCV using hamming_loss as the scorer
gridCV = GridSearchCV(svc, param_grid, cv=10, n_jobs=-1, scoring=hamm_loss)

# Train one model for each label
for label in labels:

    # Get X's and Y's
    x_train = df_train.iloc[:, :-4].copy()
    y_train = df_train[label].copy()
    x_test = df_test.iloc[:, :-4].copy()
    y_test = df_test[label].copy()   

    # Fit using grid search to find the best params
    gridCV.fit(x_train, y_train)
    
    # Predict
    pred_train = gridCV.predict(x_train)
    pred_test = gridCV.predict(x_test)
    pred_train_labels[label] = pred_train
    pred_test_labels[label] = pred_test
    
    # Store data for later comparison
    summary.at['C', f'SVM_{label}'] = gridCV.best_params_['svc__C']
    summary.at['gamma', f'SVM_{label}'] = gridCV.best_params_['svc__gamma']
    summary.at['strict_train', f'SVM_{label}'] = 1 - accuracy_score(y_true=y_train, y_pred=pred_train)
    summary.at['strict_test', f'SVM_{label}'] = 1 - accuracy_score(y_true=y_test, y_pred=pred_test)
    summary.at['lenient_train', f'SVM_{label}'] = hamming_loss(y_true=y_train, y_pred=pred_train)
    summary.at['lenient_test', f'SVM_{label}'] = hamming_loss(y_true=y_test, y_pred=pred_test)
    
    # Print model results for current label
    print(f'------------------------------ {label} ------------------------------')
    print('Best C Parameter: ', summary.at['C', f'SVM_{label}'])
    print('Best Gamma Parameter: ', summary.at['gamma', f'SVM_{label}'])
    print()
    print('Exact Match Loss | Training: ', summary.at['strict_train', f'SVM_{label}'])
    print('Exact Match Loss | Testing: ', summary.at['strict_test', f'SVM_{label}'])
    print()
    print('Hamming Loss | Training: ', summary.at['lenient_train', f'SVM_{label}'])
    print('Hamming Loss | Testing: ', summary.at['lenient_test', f'SVM_{label}'])
    print()
    print()
    
# Model Overall metrics

# Join all predicted label strings to calculate exact match loss
pred_train_multilabel = pred_train_labels.stack().groupby(level=0).apply(''.join).to_frame('pred_train')
pred_test_multilabel = pred_test_labels.stack().groupby(level=0).apply(''.join).to_frame('pred_test')

# Multilabel Multiclass Exact Match
summary_multilabel.at['strict_train', 'SVM'] = 1 - accuracy_score(y_true=true_train_multilabel, y_pred=pred_train_multilabel)
summary_multilabel.at['strict_test', 'SVM'] = 1 - accuracy_score(y_true=true_test_multilabel, y_pred=pred_test_multilabel)

# The overall hamming loss is simply the average across all labels
summary_multilabel.at['lenient_train', 'SVM'] = summary.iloc[-2, -3:].mean()
summary_multilabel.at['lenient_test', 'SVM'] = summary.iloc[-1, -3:].mean()

# Print model results for entire model
print(f'------------------------------ MODEL OVERALL ------------------------------') 
print('Exact Match Loss | Training: ', summary_multilabel.at['strict_train', f'SVM'])
print('Exact Match Loss | Testing: ', summary_multilabel.at['strict_test', f'SVM'])
print()
print('Hamming Loss | Training: ', summary_multilabel.at['lenient_train', f'SVM'])
print('Hamming Loss | Testing: ', summary_multilabel.at['lenient_test', f'SVM'])
print()
print()

------------------------------ Family ------------------------------
Best C Parameter:  10.0
Best Gamma Parameter:  0.1

Exact Match Loss | Training:  0.0
Exact Match Loss | Testing:  0.012505789717461746

Hamming Loss | Training:  0.0
Hamming Loss | Testing:  0.012505789717461788


------------------------------ Genus ------------------------------
Best C Parameter:  10.0
Best Gamma Parameter:  0.01

Exact Match Loss | Training:  0.0027799841143765214
Exact Match Loss | Testing:  0.014358499305233918

Hamming Loss | Training:  0.0027799841143764893
Hamming Loss | Testing:  0.014358499305233904


------------------------------ Species ------------------------------
Best C Parameter:  10.0
Best Gamma Parameter:  0.01

Exact Match Loss | Training:  0.0019857029388403724
Exact Match Loss | Testing:  0.014821676702176934

Hamming Loss | Training:  0.0019857029388403494
Hamming Loss | Testing:  0.014821676702176934


------------------------------ MODEL OVERALL ------------------------------
Exact Match Loss | Training:  0.003177124702144596
Exact Match Loss | Testing:  0.024085224641037573

Hamming Loss | Training:  0.0015885623510722795
Hamming Loss | Testing:  0.013895321908290875

(iii) L1-Penalized SVM¶

From Scikit-Learn about the dual parameter on LinearSVC: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

dual bool, default=True

Select the algorithm to either solve the dual or primal optimization problem.

Prefer dual=False when n_samples > n_features.

# Pipeline to standardize then run SVC
svc =  Pipeline([("standardize", StandardScaler()),
                 ("svc", LinearSVC(penalty="l1", multi_class='ovr', dual=False))])

# Grid with parameters to be tested via CV
param_grid = {'svc__C': np.logspace(-3, 3, 7)}

# Instantiate GridSearchCV using hamming_loss as the scorer
gridCV = GridSearchCV(svc, param_grid, cv=10, n_jobs=-1, scoring=hamm_loss)

# Train one model for each label
for label in labels:

    # Get X's and Y's
    x_train = df_train.iloc[:, :-4].copy()
    y_train = df_train[label].copy()
    x_test = df_test.iloc[:, :-4].copy()
    y_test = df_test[label].copy()   

    # Fit using grid search to find the best params
    gridCV.fit(x_train, y_train)
    
    # Predict
    pred_train = gridCV.predict(x_train)
    pred_test = gridCV.predict(x_test)
    pred_train_labels[label] = pred_train
    pred_test_labels[label] = pred_test
    
    # Store data for later comparison
    summary.at['C', f'L1_{label}'] = gridCV.best_params_['svc__C']
    summary.at['strict_train', f'L1_{label}'] = 1 - accuracy_score(y_true=y_train, y_pred=pred_train)
    summary.at['strict_test', f'L1_{label}'] = 1 - accuracy_score(y_true=y_test, y_pred=pred_test)
    summary.at['lenient_train', f'L1_{label}'] = hamming_loss(y_true=y_train, y_pred=pred_train)
    summary.at['lenient_test', f'L1_{label}'] = hamming_loss(y_true=y_test, y_pred=pred_test)

    # Print model results for current label    
    print(f'------------------------------ {label} ------------------------------')
    print('Best C Parameter: ', summary.at['C', f'L1_{label}'])
    print()
    print('Exact Match Loss | Training: ', summary.at['strict_train', f'L1_{label}'])
    print('Exact Match Loss | Testing: ', summary.at['strict_test', f'L1_{label}'])
    print()
    print('Hamming Loss | Training: ', summary.at['lenient_train', f'L1_{label}'])
    print('Hamming Loss | Testing: ', summary.at['lenient_test', f'L1_{label}'])
    print()
    print()
    
# Model Overall metrics

# Join all predicted label strings to calculate exact match loss
pred_train_multilabel = pred_train_labels.stack().groupby(level=0).apply(''.join).to_frame('pred_train')
pred_test_multilabel = pred_test_labels.stack().groupby(level=0).apply(''.join).to_frame('pred_test')

# Multilabel Multiclass Exact Match
summary_multilabel.at['strict_train', 'L1'] = 1 - accuracy_score(y_true=true_train_multilabel, y_pred=pred_train_multilabel)
summary_multilabel.at['strict_test', 'L1'] = 1 - accuracy_score(y_true=true_test_multilabel, y_pred=pred_test_multilabel)

# The overall hamming loss is simply the average across all labels
summary_multilabel.at['lenient_train', 'L1'] = summary.iloc[-2, -3:].mean()
summary_multilabel.at['lenient_test', 'L1'] = summary.iloc[-1, -3:].mean()

# Print model results for entire model
print(f'------------------------------ MODEL OVERALL ------------------------------') 
print('Exact Match Loss | Training: ', summary_multilabel.at['strict_train', f'L1'])
print('Exact Match Loss | Testing: ', summary_multilabel.at['strict_test', f'L1'])
print()
print('Hamming Loss | Training: ', summary_multilabel.at['lenient_train', f'L1'])
print('Hamming Loss | Testing: ', summary_multilabel.at['lenient_test', f'L1'])
print()
print()

------------------------------ Family ------------------------------
Best C Parameter:  10.0

Exact Match Loss | Training:  0.05996822875297858
Exact Match Loss | Testing:  0.07132931912922647

Hamming Loss | Training:  0.05996822875297855
Hamming Loss | Testing:  0.07132931912922649


------------------------------ Genus ------------------------------
Best C Parameter:  1000.0

Exact Match Loss | Training:  0.046068308181096085
Exact Match Loss | Testing:  0.05326540064844831

Hamming Loss | Training:  0.046068308181096106
Hamming Loss | Testing:  0.053265400648448355


------------------------------ Species ------------------------------
Best C Parameter:  10.0

Exact Match Loss | Training:  0.034551231135822036
Exact Match Loss | Testing:  0.04261232051875874

Hamming Loss | Training:  0.034551231135822084
Hamming Loss | Testing:  0.042612320518758684


------------------------------ MODEL OVERALL ------------------------------
Exact Match Loss | Training:  0.07525814138204923
Exact Match Loss | Testing:  0.0917091245947198

Hamming Loss | Training:  0.046862589356632255
Hamming Loss | Testing:  0.05573568009881117

(iv) SMOTE + L1-Penalized SVM¶

Route #1: Applying Smote to the Whole Dataset

# SMOTE: Data Preparation
smote = SMOTE(n_jobs=-1)

# Dictionaries to store the datasets for each SMOTE round (SMOTE needs to be done once per label)
master_dict_train = {}
master_dict_test = {}

# Create one SMOTE'd dataset per label
for label in labels:
    
    # Split data (required for smote)
    x_train = df_train.iloc[:, :-4].copy()
    y_train = df_train[label].copy()
    x_test = df_test.iloc[:, :-4].copy()
    y_test = df_test[label].copy()

    # Apply SMOTE 
    tuple_train_smote = smote.fit_sample(x_train, y_train)
    tuple_test_smote = smote.fit_sample(x_test, y_test) # (code here for experimentation purposes)

    # UNDO the SMOTE on the test dataset, as it doesn't make sense to apply SMOTE to it 
    tuple_test_smote = (x_test, y_test)
    
    # Get original column names
    col_names = [i for i in df.columns[:-4]]
    col_names.append(label)

    # Reconstruct the dataframes
    df_train_smote = pd.concat([tuple_train_smote[0], tuple_train_smote[1]], axis=1)
    df_train_smote.columns = col_names
    df_test_smote = pd.concat([tuple_test_smote[0], tuple_test_smote[1]], axis=1)
    df_test_smote.columns = col_names
    
    # Save dataframes to the dict
    master_dict_train[label] = df_train_smote
    master_dict_test[label] = df_test_smote

# Pipeline to standardize then run SVC
svc =  Pipeline([("standardize", StandardScaler()),
                 ("svc", LinearSVC(penalty="l1", multi_class='ovr', dual=False))])

# Grid with parameters to be tested via CV
param_grid = {'svc__C': np.logspace(-3, 3, 7)}

# Instantiate GridSearchCV using hamming_loss as the scorer
gridCV = GridSearchCV(svc, param_grid, cv=10, n_jobs=-1, scoring=hamm_loss)

# Train one model for each label
for label in labels:

    # Get X's and Y's
    x_train = master_dict_train[label].iloc[:, :-1].copy()
    y_train = master_dict_train[label][label].copy()
    x_test = master_dict_test[label].iloc[:, :-1].copy()
    y_test = master_dict_test[label][label].copy()

    # Fit using grid search to find the best params
    gridCV.fit(x_train, y_train)
    
    # Predict
    pred_train = gridCV.predict(x_train)
    pred_test = gridCV.predict(x_test)
    #pred_train_labels[label] = pred_train
    pred_test_labels[label] = pred_test
    
    # Store data for later comparison
    summary.at['C', f'SMOTE_{label}'] = gridCV.best_params_['svc__C']
    summary.at['strict_train', f'SMOTE_{label}'] = 1 - accuracy_score(y_true=y_train, y_pred=pred_train)
    summary.at['strict_test', f'SMOTE_{label}'] = 1 - accuracy_score(y_true=y_test, y_pred=pred_test)
    summary.at['lenient_train', f'SMOTE_{label}'] = hamming_loss(y_true=y_train, y_pred=pred_train)
    summary.at['lenient_test', f'SMOTE_{label}'] = hamming_loss(y_true=y_test, y_pred=pred_test)

    # Print model results for current label    
    print(f'------------------------------ {label} ------------------------------')
    print('Best C Parameter: ', summary.at['C', f'SMOTE_{label}'])
    print()
    print('Exact Match Loss | Training: ', summary.at['strict_train', f'SMOTE_{label}'])
    print('Exact Match Loss | Testing: ', summary.at['strict_test', f'SMOTE_{label}'])
    print()
    print('Hamming Loss | Training: ', summary.at['lenient_train', f'SMOTE_{label}'])
    print('Hamming Loss | Testing: ', summary.at['lenient_test', f'SMOTE_{label}'])
    print()
    print()
    
# Model Overall metrics

# Join all predicted label strings to calculate exact match loss
#pred_train_multilabel = pred_train_labels.stack().groupby(level=0).apply(''.join).to_frame('pred_train')
pred_test_multilabel = pred_test_labels.stack().groupby(level=0).apply(''.join).to_frame('pred_test')

# Multilabel Multiclass Exact Match
#summary_multilabel.at['strict_train', 'SMOTE'] = 1 - accuracy_score(y_true=true_train_multilabel, y_pred=pred_train_multilabel)
summary_multilabel.at['strict_test', 'SMOTE'] = 1 - accuracy_score(y_true=true_test_multilabel, y_pred=pred_test_multilabel)

# The overall hamming loss is simply the average across all labels
summary_multilabel.at['lenient_train', 'SMOTE'] = summary.iloc[-2, -3:].mean()
summary_multilabel.at['lenient_test', 'SMOTE'] = summary.iloc[-1, -3:].mean()

# Print model results for entire model
print(f'------------------------------ MODEL OVERALL ------------------------------') 
#print('Exact Match Loss | Training: ', summary_multilabel.at['strict_train', f'SMOTE'])
print('As each label on training data has its unique SMOTEd dataset, training Exact Match Loss cannot be calculated')
print('Exact Match Loss | Testing: ', summary_multilabel.at['strict_test', f'SMOTE'])
print()
print('Hamming Loss | Training: ', summary_multilabel.at['lenient_train', f'SMOTE'])
print('Hamming Loss | Testing: ', summary_multilabel.at['lenient_test', f'SMOTE'])
print()
print()

------------------------------ Family ------------------------------
Best C Parameter:  10.0

Exact Match Loss | Training:  0.048983214977404765
Exact Match Loss | Testing:  0.08568781843446038

Hamming Loss | Training:  0.04898321497740478
Hamming Loss | Testing:  0.0856878184344604


------------------------------ Genus ------------------------------
Best C Parameter:  10.0

Exact Match Loss | Training:  0.04365966964900203
Exact Match Loss | Testing:  0.08568781843446038

Hamming Loss | Training:  0.043659669649002066
Hamming Loss | Testing:  0.0856878184344604


------------------------------ Species ------------------------------
Best C Parameter:  10.0

Exact Match Loss | Training:  0.04034299714169054
Exact Match Loss | Testing:  0.04539138490041683

Hamming Loss | Training:  0.040342997141690484
Hamming Loss | Testing:  0.04539138490041686


------------------------------ MODEL OVERALL ------------------------------
As each label on training data has its unique SMOTEd dataset, training Exact Match Loss cannot be calculated
Exact Match Loss | Testing:  0.1445113478462251

Hamming Loss | Training:  0.04432862725603245
Hamming Loss | Testing:  0.07225567392311255

Route #2: Applying Smote to K-1 Folds

This approach takes much slower then the above one which applies SMOTE once to the whole dataset and then performs GridSearchCV

To make the processing time more reasonable I shrank the param_grid, keeping only the C range found in the above summary, plus and minus one log.

Also, since the cross-validation is already being performed via K-fold splitting, I reduced the cv parameter inside GridSearch to 5 (down from 10), which halves the processing time, as my CPU has 8 cores, which means a cv of 10 requires two "cycles" (as 2 of the folds will have to wait for a core to become available), while a cv of 5 can be completed in one "cycle".

Still, the cell below takes about 1 hour to run.

%%time
# This cell takes close to 1 hour

# Instantiate SMOTE
smote = SMOTE(n_jobs=-1)

# Pipeline to standardize then run SVC
svc =  Pipeline([("standardize", StandardScaler()),
                 ("svc", LinearSVC(penalty="l1", multi_class='ovr', dual=False))])

# Grid with parameters to be tested via CV
param_grid = {'svc__C': np.logspace(0, 3, 4)}

# Instantiate GridSearchCV using hamming_loss as the scorer
gridCV = GridSearchCV(svc, param_grid, cv=5, n_jobs=-1, scoring=hamm_loss)

# KFold
kf = KFold(n_splits=10, shuffle=True)

# For each label, split the data in 10 folds, using 9 for training and 1 for validation
for label in labels:
    print(f'------------------------------ {label} ------------------------------')
    kfold_intermediate_results = pd.DataFrame()
    for fold_num, (idx_train, idx_valid) in enumerate(kf.split(df_train), 1):
        
        # Print current label and fold
        print(f'Working on Fold: {fold_num}')

        # Select all folds to be smoted except for the validation fold
        x_train, y_train = smote.fit_sample(df_train.iloc[idx_train,:-4], df_train[label].iloc[idx_train])
        x_valid = df_train.iloc[idx_valid,:-4] 
        y_valid = df_train[label].iloc[idx_valid]
          
        # Fit using grid search to find the best params
        gridCV.fit(x_train, y_train)

        # Predict on the train and validation folds to calculate metrics
        pred_train = gridCV.predict(x_train)   
        pred_valid = gridCV.predict(x_valid)   
        
        # Store K-Fold intermedaite results
        kfold_intermediate_results.at['C', f'{fold_num}'] = gridCV.best_params_['svc__C']
        kfold_intermediate_results.at['strict_train', f'{fold_num}'] = 1 - accuracy_score(y_true=y_train, y_pred=pred_train)
        kfold_intermediate_results.at['strict_valid', f'{fold_num}'] = 1 - accuracy_score(y_true=y_valid, y_pred=pred_valid)
        kfold_intermediate_results.at['lenient_train', f'{fold_num}'] = hamming_loss(y_true=y_train, y_pred=pred_train)
        kfold_intermediate_results.at['lenient_valid', f'{fold_num}'] = hamming_loss(y_true=y_valid, y_pred=pred_valid)
    
    # After running all K-Folds get average results for the label
    kfold_intermediate_results['mean'] = kfold_intermediate_results.mean(axis=1)
    
    print()
    print(f'--- K-Fold Cross-Validation Results ---')
    print(f'Mean C Parameter: {kfold_intermediate_results["mean"]["C"]}')
    print()
    print(f'Mean Exact Match Loss | Training : {kfold_intermediate_results["mean"]["strict_train"]}')
    print(f'Mean Exact Match Loss | Validation : {kfold_intermediate_results["mean"]["strict_valid"]}')
    print()
    print(f'Mean Hamming Loss | Training : {kfold_intermediate_results["mean"]["lenient_train"]}')
    print(f'Mean Hamming Loss | Validation : {kfold_intermediate_results["mean"]["lenient_valid"]}')
    print()
    
    # Create a classifier using the mean C value
    svc_kfold = LinearSVC(penalty="l1", multi_class='ovr', dual=False,
                          C=kfold_intermediate_results.at['C', 'mean'])
    
    # Get X's and Y's - This time using the full datasets for trainin and testing
    x_train, y_train = smote.fit_sample(df_train.iloc[:,:-4], df_train[label])
    x_test = df_test.iloc[:, :-4].copy()
    y_test = df_test[label].copy()
    
    # Fit using the SVM model created with the mean C from K-Fold cross-validation
    svc_kfold.fit(x_train, y_train)
    
    # Predict
    pred_train = svc_kfold.predict(x_train)
    pred_test = svc_kfold.predict(x_test)
    #pred_train_labels[label] = pred_train
    pred_test_labels[label] = pred_test
    
    # Store data for later comparison
    summary.at['C', f'SMOTE_KF_{label}'] = svc_kfold.C
    summary.at['strict_train', f'SMOTE_KF_{label}'] = 1 - accuracy_score(y_true=y_train, y_pred=pred_train)
    summary.at['strict_test', f'SMOTE_KF_{label}'] = 1 - accuracy_score(y_true=y_test, y_pred=pred_test)
    summary.at['lenient_train', f'SMOTE_KF_{label}'] = hamming_loss(y_true=y_train, y_pred=pred_train)
    summary.at['lenient_test', f'SMOTE_KF_{label}'] = hamming_loss(y_true=y_test, y_pred=pred_test)

    # Print model results for current label    
    print(f'--- Full Dataset Results ---')
    print('Exact Match Loss | Training: ', summary.at['strict_train', f'SMOTE_KF_{label}'])
    print('Exact Match Loss | Testing: ', summary.at['strict_test', f'SMOTE_KF_{label}'])
    print()
    print('Hamming Loss | Training: ', summary.at['lenient_train', f'SMOTE_KF_{label}'])
    print('Hamming Loss | Testing: ', summary.at['lenient_test', f'SMOTE_KF_{label}'])
    print()
    print()
    
# Model Overall metrics

# Join all predicted label strings to calculate exact match loss
#pred_train_multilabel = pred_train_labels.stack().groupby(level=0).apply(''.join).to_frame('pred_train')
pred_test_multilabel = pred_test_labels.stack().groupby(level=0).apply(''.join).to_frame('pred_test')

# Multilabel Multiclass Exact Match
#summary_multilabel.at['strict_train', 'SMOTE_KF'] = 1 - accuracy_score(y_true=true_train_multilabel, y_pred=pred_train_multilabel)
summary_multilabel.at['strict_test', 'SMOTE_KF'] = 1 - accuracy_score(y_true=true_test_multilabel, y_pred=pred_test_multilabel)

# The overall hamming loss is simply the average across all labels
summary_multilabel.at['lenient_train', 'SMOTE_KF'] = summary.iloc[-2, -3:].mean()
summary_multilabel.at['lenient_test', 'SMOTE_KF'] = summary.iloc[-1, -3:].mean()

# Print model results for entire model
print(f'------------------------------ MODEL OVERALL ------------------------------') 
#print('Exact Match Loss | Training: ', summary_multilabel.at['strict_train', f'SMOTE_KF'])
print('As each label on training data has its unique SMOTEd dataset, training Exact Match Loss cannot be calculated')
print('Exact Match Loss | Testing: ', summary_multilabel.at['strict_test', f'SMOTE_KF'])
print()
print('Hamming Loss | Training: ', summary_multilabel.at['lenient_train', f'SMOTE_KF'])
print('Hamming Loss | Testing: ', summary_multilabel.at['lenient_test', f'SMOTE_KF'])
print()
print()

------------------------------ Family ------------------------------
Working on Fold: 1
Working on Fold: 2
Working on Fold: 3
Working on Fold: 4
Working on Fold: 5
Working on Fold: 6
Working on Fold: 7
Working on Fold: 8
Working on Fold: 9
Working on Fold: 10

--- K-Fold Cross-Validation Results ---
Mean C Parameter: 135.1

Mean Exact Match Loss | Training : 0.046624881761727945
Mean Exact Match Loss | Validation : 0.081022989049828

Mean Hamming Loss | Training : 0.046624881761727924
Mean Hamming Loss | Validation : 0.081022989049828

--- Full Dataset Results ---
Exact Match Loss | Training:  0.04728857327307945
Exact Match Loss | Testing:  0.0861509958314034

Hamming Loss | Training:  0.04728857327307941
Hamming Loss | Testing:  0.08615099583140343


------------------------------ Genus ------------------------------
Working on Fold: 1
Working on Fold: 2
Working on Fold: 3
Working on Fold: 4
Working on Fold: 5
Working on Fold: 6
Working on Fold: 7
Working on Fold: 8
Working on Fold: 9
Working on Fold: 10

--- K-Fold Cross-Validation Results ---
Mean C Parameter: 72.1

Mean Exact Match Loss | Training : 0.04169982766399112
Mean Exact Match Loss | Validation : 0.08142178673987818

Mean Hamming Loss | Training : 0.04169982766399111
Mean Hamming Loss | Validation : 0.08142178673987818

--- Full Dataset Results ---
Exact Match Loss | Training:  0.04538024776324845
Exact Match Loss | Testing:  0.09217230199166282

Hamming Loss | Training:  0.04538024776324845
Hamming Loss | Testing:  0.0921723019916628


------------------------------ Species ------------------------------
Working on Fold: 1
Working on Fold: 2
Working on Fold: 3
Working on Fold: 4
Working on Fold: 5
Working on Fold: 6
Working on Fold: 7
Working on Fold: 8
Working on Fold: 9
Working on Fold: 10

--- K-Fold Cross-Validation Results ---
Mean C Parameter: 244.0

Mean Exact Match Loss | Training : 0.03758647791035265
Mean Exact Match Loss | Validation : 0.04447757897062071

Mean Hamming Loss | Training : 0.037586477910352634
Mean Hamming Loss | Validation : 0.044477578970620726

--- Full Dataset Results ---
Exact Match Loss | Training:  0.04332380563495308
Exact Match Loss | Testing:  0.04585456229735985

Hamming Loss | Training:  0.04332380563495304
Hamming Loss | Testing:  0.04585456229735989


------------------------------ MODEL OVERALL ------------------------------
As each label on training data has its unique SMOTEd dataset, training Exact Match Loss cannot be calculated
Exact Match Loss | Testing:  0.1472904122278833

Hamming Loss | Training:  0.045330875557093635
Hamming Loss | Testing:  0.07472595337347537


Wall time: 53min 49s

Quite interestingly, for all labels a lower training and validation error is achieved when the model's parameters are found this this approach, versus the approach that uses only GridSearchCV but no K-Folds.

Yet, this does not translate into lower test error, suggesting that either the model is starting to overfit, or it has reached the limit of the dataset.

Summary of Classifiers¶

row_names = {'strict_train': 'Exact Match Loss | Train',
             'strict_test': 'Exact Match Loss | Test',
             'lenient_train': 'Hamming Loss | Train',
             'lenient_test': 'Hamming Loss | Test'}

# Summary of single-label classifiers
summary.rename(row_names)

Exact Match Loss is only more strict that Hamming Losss when there is more than 1 label to be predicted at the same time. Hence it was already expected that they would match for the single-label problems.

# Summary of multi-label classifiers
# As each label on training data has its unique SMOTEd dataset, training Exact Match Loss cannot be calculated
summary_multilabel.rename(row_names)

The original SVM Classifier was the best performing model, which was expected given the known fact that L1 penalization can at best match an un-penalized model, and most likely will have a higher error, thus the following models which employed L1 were bound to underperform, as a tradeoff for the feature selection they provide.

The more interesting aspect is how SMOTE seems to have worsened (increased) the misclassification rate (error). One way to dive deeper into this issue would be to check the class-stratified misclassification rate, to see if the error for the rare classes got reduced at the expense of error in the majority class increasing.

Introduction¶

DSCI 552 - Machine Learning for Data Science

Homework 5

Matheus Schmitz

USC ID: 5039286453

Imports¶

# tqdm is a progress bar
# Quite useful to know things are running the the processing time is long
!pip install tqdm

Requirement already satisfied: tqdm in c:\users\matheus\anaconda3\lib\site-packages (4.47.0)

# Data Manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# K-Means
from sklearn.cluster import KMeans, MiniBatchKMeans

# Metrics
from sklearn.metrics import silhouette_score, hamming_loss

# Label Encoding
from sklearn.preprocessing import LabelEncoder

# Progress Bar
from tqdm.notebook import tqdm

# Warnings
import warnings
warnings.filterwarnings('ignore')

Dataset¶

# Read csv
df = pd.read_csv('../data/Frogs_MFCCs.csv')
print(f'df.shape: {df.shape}')
df.head(3)

df.shape: (7195, 26)

# Split features and labels
df_features = df.iloc[:, :-4]
df_labels = df.iloc[:, -4:-1]

(a) K-Means Clustering¶

# KMeans: Takes about 3 minutes to train on all k values
# MiniBatchKMeans: Takes about 1 minute to train on all k values

# Dictionary to store silhouette score for each k
silhouettes = {}

# Train, predict and score KMean on each k
for k in tqdm(range(2,51)):
    kmeans = KMeans(n_clusters=k)
    #kmeans = MiniBatchKMeans(n_clusters=k)
    clusters = kmeans.fit_predict(df_features)
    silhouettes[k] = silhouette_score(df_features, clusters)

# Get the best K value and the associated Silhouette Score
best_k = max(silhouettes, key=lambda key: silhouettes[key])
print(f'Best K: {best_k}')
best_silhouette = silhouettes[best_k]
print(f'Silhouette Score: {best_silhouette:.5f}')

Best K: 4
Silhouette Score: 0.37885

(b) Majority Labels per Cluster¶

# Instance a K-Means clusterer using the best_k
kmeans = KMeans(n_clusters=best_k)
#kmeans = MiniBatchKMeans(n_clusters=best_k)

# Train the K-Means and predict the clusters
clusters = kmeans.fit_predict(df_features)

# Add the predicted clusters to the dataframe with labels
df_labels['Cluster'] = clusters

# Group the dataframe by cluster
df_clusters = df_labels.groupby('Cluster')

# For each of the labels, check the most frequent class (the mode)
cluster_family = df_clusters['Family'].agg(pd.Series.mode)
cluster_genus = df_clusters['Genus'].agg(pd.Series.mode)
cluster_species = df_clusters['Species'].agg(pd.Series.mode)

# Summarize all on a dataframe
majority_classes = pd.DataFrame(data=[cluster_family, cluster_genus, cluster_species]).T
majority_classes

(c) Hamming Distance, Hamming Score, Hamming Loss¶

Hamming Distance - From Scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html

The Hamming distance between 1-D arrays u and v, is simply the proportion of disagreeing components in u and v.

From this I assume the hamming distance between N-D arrays is the sum of the distances between their "inner" 1-D arrays.

Hamming Loss - From Scikit-Learn: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html

The Hamming loss is the fraction of labels that are incorrectly predicted.

Hence Scikit-Learn will average the loss over all labels. Therefore in order to obtain the Hamming Distance one can simply multiply sklearn's hamming_loss by the number of labels in the data.

Hamming Score is the inverse of the hamming loss.

# Use the majority_classes dataframe to assign predicted classes to each sample
df_labels['pred_Family'] = df_labels['Cluster'].map(majority_classes['Family'])
df_labels['pred_Genus'] = df_labels['Cluster'].map(majority_classes['Genus'])
df_labels['pred_Species'] = df_labels['Cluster'].map(majority_classes['Species'])

# Need to convert labels from strings to numeric in order to calculate hamming metrics
# LabelBinazer and OneHotEncoder cannot be used as they would double the error as [0, 1] and [1, 0] have a hamming distance of 2
# While [0] and [3] have a hamming distance of one, which is the correct since classes do not have a hierarchy
LE = LabelEncoder()
df_labels['true_Family_encoded'] = LE.fit_transform(df_labels['Family'])
df_labels['pred_Family_encoded'] = LE.transform(df_labels['pred_Family'])
df_labels['true_Genus_encoded'] = LE.fit_transform(df_labels['Genus'])
df_labels['pred_Genus_encoded'] = LE.transform(df_labels['pred_Genus'])
df_labels['true_Species_encoded'] = LE.fit_transform(df_labels['Species'])
df_labels['pred_Species_encoded'] = LE.transform(df_labels['pred_Species'])

# Extract the true and predicted labels as arrays so they can be compared
true_labels_encoded = [data[['true_Family_encoded', 'true_Genus_encoded', 'true_Species_encoded']].values for clster, data in df_labels.groupby('Cluster')]
pred_labels_encoded = [data[['pred_Family_encoded', 'pred_Genus_encoded', 'pred_Species_encoded']].values for clster, data in df_labels.groupby('Cluster')]

# Calculate metrics
cluster_hamming_loss = [hamming_loss(np.vstack(true_labels_encoded).flatten(), np.vstack(pred_labels_encoded).flatten())]
cluster_hamming_score = [1-loss for loss in cluster_hamming_loss]
cluster_hamming_dist = [loss*len(majority_classes.columns) for loss in cluster_hamming_loss]

# Print average metrics
print(f'Average Hamming Loss:     {np.mean(cluster_hamming_loss):.5f}')
print(f'Average Hamming Score:    {np.mean(cluster_hamming_score):.5f}')
print(f'Average Hamming Distance: {np.mean(cluster_hamming_dist):.5f}')

Average Hamming Loss:     0.22242
Average Hamming Score:    0.77758
Average Hamming Distance: 0.66727

Monte-Carlo Simulation¶

# Read csv
df = pd.read_csv('../data/Frogs_MFCCs.csv')

# List to store the hamming distance in each iteration
hammings = []

# Perform the previous procedure (a + b + c) 50 times:
for iteration in tqdm(range(1, 51), desc='Monte-Carlo Simulation', ncols='90%'):
    
    # Split features and labels
    df_features = df.iloc[:, :-4]
    df_labels = df.iloc[:, -4:-1]
    
    #-------------------------------------------------------#
    #   (A) K-MEANS CLUSTERING                              #
    #-------------------------------------------------------#
    
    # Dictionary to store silhouette score for each k
    silhouettes = {}

    # Train, predict and score KMean on each k
    # Note here we change the highest K to 10, based on the previous finding of best_k = 4
    for k in range(2,11):
    #for k in tqdm(range(2,51), desc='K-Means K ∈ {2, 3, ..., 50}', ncols='66%'):
        kmeans = KMeans(n_clusters=k, random_state=iteration)
        #kmeans = MiniBatchKMeans(n_clusters=k, random_state=iteration)
        clusters = kmeans.fit_predict(df_features)
        silhouettes[k] = silhouette_score(df_features, clusters)
        
    # Get the best K value and the associated Silhouette Score
    best_k = max(silhouettes, key=lambda key: silhouettes[key])
    print(f'Iteration {iteration} | Best K: {best_k}')
    best_silhouette = silhouettes[best_k]
    print(f'Iteration {iteration} | Silhouette Score: {best_silhouette:.5f}')  

    
    #-------------------------------------------------------#
    #    (B) MAJORITY LABELS PER CLUSTER                    #
    #-------------------------------------------------------#
    
    # Instance a K-Means clusterer using the best_k
    kmeans = KMeans(n_clusters=best_k, random_state=iteration)
    #kmeans = MiniBatchKMeans(n_clusters=best_k, random_state=iteration)

    # Train the K-Means and predict the clusters
    clusters = kmeans.fit_predict(df_features)

    # Add the predicted clusters to the dataframe with labels
    df_labels['Cluster'] = clusters

    # Group the dataframe by cluster
    df_clusters = df_labels.groupby('Cluster')

    # For each of the labels, check the most frequent class (the mode)
    cluster_family = df_clusters['Family'].agg(pd.Series.mode)
    cluster_genus = df_clusters['Genus'].agg(pd.Series.mode)
    cluster_species = df_clusters['Species'].agg(pd.Series.mode)

    # Summarize all on a dataframe
    majority_classes = pd.DataFrame(data=[cluster_family, cluster_genus, cluster_species]).T
    
    
    #-------------------------------------------------------#
    #   (c) HAMMING DISTANCE, HAMMING SCORE, HAMMING LOSS   #
    #-------------------------------------------------------#

    # Use the majority_classes dataframe to assign predicted classes to each sample
    df_labels['pred_Family'] = df_labels['Cluster'].map(majority_classes['Family'])
    df_labels['pred_Genus'] = df_labels['Cluster'].map(majority_classes['Genus'])
    df_labels['pred_Species'] = df_labels['Cluster'].map(majority_classes['Species'])

    # Need to convert labels from strings to numeric in order to calculate hamming metrics
    # LabelBinazer and OneHotEncoder cannot be used as they would double the error as [0, 1] and [1, 0] have a hamming distance of 2
    # While [0] and [3] have a hamming distance of one, which is the correct since classes do not have a hierarchy
    LE = LabelEncoder()
    df_labels['true_Family_encoded'] = LE.fit_transform(df_labels['Family'])
    df_labels['pred_Family_encoded'] = LE.transform(df_labels['pred_Family'])
    df_labels['true_Genus_encoded'] = LE.fit_transform(df_labels['Genus'])
    df_labels['pred_Genus_encoded'] = LE.transform(df_labels['pred_Genus'])
    df_labels['true_Species_encoded'] = LE.fit_transform(df_labels['Species'])
    df_labels['pred_Species_encoded'] = LE.transform(df_labels['pred_Species'])

    # Extract the true and predicted labels as arrays so they can be compared
    true_labels_encoded = [data[['true_Family_encoded', 'true_Genus_encoded', 'true_Species_encoded']].values for clster, data in df_labels.groupby('Cluster')]
    pred_labels_encoded = [data[['pred_Family_encoded', 'pred_Genus_encoded', 'pred_Species_encoded']].values for clster, data in df_labels.groupby('Cluster')]

    # Calculate metrics
    cluster_hamming_loss = [hamming_loss(np.vstack(true_labels_encoded).flatten(), np.vstack(pred_labels_encoded).flatten())]
    cluster_hamming_score = [1-loss for loss in cluster_hamming_loss]
    cluster_hamming_dist = [loss*len(majority_classes.columns) for loss in cluster_hamming_loss]

    # Print average metrics
    print(f'Iteration {iteration} | Average Hamming Loss:     {np.mean(cluster_hamming_loss):.5f}')
    print(f'Iteration {iteration} | Average Hamming Score:    {np.mean(cluster_hamming_score):.5f}')
    print(f'Iteration {iteration} | Average Hamming Distance: {np.mean(cluster_hamming_dist):.5f}')
    print()
    
    #-------------------------------------------------------#
    #   ITERATION METRICS                                   #
    #-------------------------------------------------------# 
    
    mean_hamming_distance = np.mean(cluster_hamming_dist)
    hammings.append(mean_hamming_distance)

Iteration 1 | Best K: 4
Iteration 1 | Silhouette Score: 0.37875
Iteration 1 | Average Hamming Loss:     0.22242
Iteration 1 | Average Hamming Score:    0.77758
Iteration 1 | Average Hamming Distance: 0.66727

Iteration 2 | Best K: 4
Iteration 2 | Silhouette Score: 0.37875
Iteration 2 | Average Hamming Loss:     0.22242
Iteration 2 | Average Hamming Score:    0.77758
Iteration 2 | Average Hamming Distance: 0.66727

Iteration 3 | Best K: 4
Iteration 3 | Silhouette Score: 0.37875
Iteration 3 | Average Hamming Loss:     0.22242
Iteration 3 | Average Hamming Score:    0.77758
Iteration 3 | Average Hamming Distance: 0.66727

Iteration 4 | Best K: 4
Iteration 4 | Silhouette Score: 0.37875
Iteration 4 | Average Hamming Loss:     0.22242
Iteration 4 | Average Hamming Score:    0.77758
Iteration 4 | Average Hamming Distance: 0.66727

Iteration 5 | Best K: 4
Iteration 5 | Silhouette Score: 0.37875
Iteration 5 | Average Hamming Loss:     0.22242
Iteration 5 | Average Hamming Score:    0.77758
Iteration 5 | Average Hamming Distance: 0.66727

Iteration 6 | Best K: 4
Iteration 6 | Silhouette Score: 0.37875
Iteration 6 | Average Hamming Loss:     0.22242
Iteration 6 | Average Hamming Score:    0.77758
Iteration 6 | Average Hamming Distance: 0.66727

Iteration 7 | Best K: 4
Iteration 7 | Silhouette Score: 0.37875
Iteration 7 | Average Hamming Loss:     0.22242
Iteration 7 | Average Hamming Score:    0.77758
Iteration 7 | Average Hamming Distance: 0.66727

Iteration 8 | Best K: 4
Iteration 8 | Silhouette Score: 0.38523
Iteration 8 | Average Hamming Loss:     0.24526
Iteration 8 | Average Hamming Score:    0.75474
Iteration 8 | Average Hamming Distance: 0.73579

Iteration 9 | Best K: 4
Iteration 9 | Silhouette Score: 0.37875
Iteration 9 | Average Hamming Loss:     0.22242
Iteration 9 | Average Hamming Score:    0.77758
Iteration 9 | Average Hamming Distance: 0.66727

Iteration 10 | Best K: 4
Iteration 10 | Silhouette Score: 0.37863
Iteration 10 | Average Hamming Loss:     0.22247
Iteration 10 | Average Hamming Score:    0.77753
Iteration 10 | Average Hamming Distance: 0.66741

Iteration 11 | Best K: 4
Iteration 11 | Silhouette Score: 0.37891
Iteration 11 | Average Hamming Loss:     0.22205
Iteration 11 | Average Hamming Score:    0.77795
Iteration 11 | Average Hamming Distance: 0.66616

Iteration 12 | Best K: 4
Iteration 12 | Silhouette Score: 0.37875
Iteration 12 | Average Hamming Loss:     0.22242
Iteration 12 | Average Hamming Score:    0.77758
Iteration 12 | Average Hamming Distance: 0.66727

Iteration 13 | Best K: 4
Iteration 13 | Silhouette Score: 0.37885
Iteration 13 | Average Hamming Loss:     0.22164
Iteration 13 | Average Hamming Score:    0.77836
Iteration 13 | Average Hamming Distance: 0.66491

Iteration 14 | Best K: 4
Iteration 14 | Silhouette Score: 0.37875
Iteration 14 | Average Hamming Loss:     0.22242
Iteration 14 | Average Hamming Score:    0.77758
Iteration 14 | Average Hamming Distance: 0.66727

Iteration 15 | Best K: 4
Iteration 15 | Silhouette Score: 0.37875
Iteration 15 | Average Hamming Loss:     0.22242
Iteration 15 | Average Hamming Score:    0.77758
Iteration 15 | Average Hamming Distance: 0.66727

Iteration 16 | Best K: 4
Iteration 16 | Silhouette Score: 0.37875
Iteration 16 | Average Hamming Loss:     0.22242
Iteration 16 | Average Hamming Score:    0.77758
Iteration 16 | Average Hamming Distance: 0.66727

Iteration 17 | Best K: 4
Iteration 17 | Silhouette Score: 0.37865
Iteration 17 | Average Hamming Loss:     0.22201
Iteration 17 | Average Hamming Score:    0.77799
Iteration 17 | Average Hamming Distance: 0.66602

Iteration 18 | Best K: 4
Iteration 18 | Silhouette Score: 0.37875
Iteration 18 | Average Hamming Loss:     0.22242
Iteration 18 | Average Hamming Score:    0.77758
Iteration 18 | Average Hamming Distance: 0.66727

Iteration 19 | Best K: 4
Iteration 19 | Silhouette Score: 0.37863
Iteration 19 | Average Hamming Loss:     0.22247
Iteration 19 | Average Hamming Score:    0.77753
Iteration 19 | Average Hamming Distance: 0.66741

Iteration 20 | Best K: 4
Iteration 20 | Silhouette Score: 0.37875
Iteration 20 | Average Hamming Loss:     0.22242
Iteration 20 | Average Hamming Score:    0.77758
Iteration 20 | Average Hamming Distance: 0.66727

Iteration 21 | Best K: 4
Iteration 21 | Silhouette Score: 0.37875
Iteration 21 | Average Hamming Loss:     0.22242
Iteration 21 | Average Hamming Score:    0.77758
Iteration 21 | Average Hamming Distance: 0.66727

Iteration 22 | Best K: 4
Iteration 22 | Silhouette Score: 0.37875
Iteration 22 | Average Hamming Loss:     0.22242
Iteration 22 | Average Hamming Score:    0.77758
Iteration 22 | Average Hamming Distance: 0.66727

Iteration 23 | Best K: 4
Iteration 23 | Silhouette Score: 0.37875
Iteration 23 | Average Hamming Loss:     0.22242
Iteration 23 | Average Hamming Score:    0.77758
Iteration 23 | Average Hamming Distance: 0.66727

Iteration 24 | Best K: 4
Iteration 24 | Silhouette Score: 0.37863
Iteration 24 | Average Hamming Loss:     0.22247
Iteration 24 | Average Hamming Score:    0.77753
Iteration 24 | Average Hamming Distance: 0.66741

Iteration 25 | Best K: 4
Iteration 25 | Silhouette Score: 0.37885
Iteration 25 | Average Hamming Loss:     0.22164
Iteration 25 | Average Hamming Score:    0.77836
Iteration 25 | Average Hamming Distance: 0.66491

Iteration 26 | Best K: 4
Iteration 26 | Silhouette Score: 0.37875
Iteration 26 | Average Hamming Loss:     0.22242
Iteration 26 | Average Hamming Score:    0.77758
Iteration 26 | Average Hamming Distance: 0.66727

Iteration 27 | Best K: 4
Iteration 27 | Silhouette Score: 0.37875
Iteration 27 | Average Hamming Loss:     0.22242
Iteration 27 | Average Hamming Score:    0.77758
Iteration 27 | Average Hamming Distance: 0.66727

Iteration 28 | Best K: 4
Iteration 28 | Silhouette Score: 0.37875
Iteration 28 | Average Hamming Loss:     0.22242
Iteration 28 | Average Hamming Score:    0.77758
Iteration 28 | Average Hamming Distance: 0.66727

Iteration 29 | Best K: 4
Iteration 29 | Silhouette Score: 0.37885
Iteration 29 | Average Hamming Loss:     0.22164
Iteration 29 | Average Hamming Score:    0.77836
Iteration 29 | Average Hamming Distance: 0.66491

Iteration 30 | Best K: 4
Iteration 30 | Silhouette Score: 0.37889
Iteration 30 | Average Hamming Loss:     0.22191
Iteration 30 | Average Hamming Score:    0.77809
Iteration 30 | Average Hamming Distance: 0.66574

Iteration 31 | Best K: 4
Iteration 31 | Silhouette Score: 0.37875
Iteration 31 | Average Hamming Loss:     0.22242
Iteration 31 | Average Hamming Score:    0.77758
Iteration 31 | Average Hamming Distance: 0.66727

Iteration 32 | Best K: 4
Iteration 32 | Silhouette Score: 0.38405
Iteration 32 | Average Hamming Loss:     0.23336
Iteration 32 | Average Hamming Score:    0.76664
Iteration 32 | Average Hamming Distance: 0.70007

Iteration 33 | Best K: 4
Iteration 33 | Silhouette Score: 0.37875
Iteration 33 | Average Hamming Loss:     0.22242
Iteration 33 | Average Hamming Score:    0.77758
Iteration 33 | Average Hamming Distance: 0.66727

Iteration 34 | Best K: 4
Iteration 34 | Silhouette Score: 0.37875
Iteration 34 | Average Hamming Loss:     0.22242
Iteration 34 | Average Hamming Score:    0.77758
Iteration 34 | Average Hamming Distance: 0.66727

Iteration 35 | Best K: 4
Iteration 35 | Silhouette Score: 0.38526
Iteration 35 | Average Hamming Loss:     0.24512
Iteration 35 | Average Hamming Score:    0.75488
Iteration 35 | Average Hamming Distance: 0.73537

Iteration 36 | Best K: 4
Iteration 36 | Silhouette Score: 0.37875
Iteration 36 | Average Hamming Loss:     0.22242
Iteration 36 | Average Hamming Score:    0.77758
Iteration 36 | Average Hamming Distance: 0.66727

Iteration 37 | Best K: 4
Iteration 37 | Silhouette Score: 0.37875
Iteration 37 | Average Hamming Loss:     0.22242
Iteration 37 | Average Hamming Score:    0.77758
Iteration 37 | Average Hamming Distance: 0.66727

Iteration 38 | Best K: 4
Iteration 38 | Silhouette Score: 0.37875
Iteration 38 | Average Hamming Loss:     0.22242
Iteration 38 | Average Hamming Score:    0.77758
Iteration 38 | Average Hamming Distance: 0.66727

Iteration 39 | Best K: 4
Iteration 39 | Silhouette Score: 0.37875
Iteration 39 | Average Hamming Loss:     0.22242
Iteration 39 | Average Hamming Score:    0.77758
Iteration 39 | Average Hamming Distance: 0.66727

Iteration 40 | Best K: 4
Iteration 40 | Silhouette Score: 0.37885
Iteration 40 | Average Hamming Loss:     0.22164
Iteration 40 | Average Hamming Score:    0.77836
Iteration 40 | Average Hamming Distance: 0.66491

Iteration 41 | Best K: 4
Iteration 41 | Silhouette Score: 0.38405
Iteration 41 | Average Hamming Loss:     0.23336
Iteration 41 | Average Hamming Score:    0.76664
Iteration 41 | Average Hamming Distance: 0.70007

Iteration 42 | Best K: 4
Iteration 42 | Silhouette Score: 0.37875
Iteration 42 | Average Hamming Loss:     0.22242
Iteration 42 | Average Hamming Score:    0.77758
Iteration 42 | Average Hamming Distance: 0.66727

Iteration 43 | Best K: 4
Iteration 43 | Silhouette Score: 0.37875
Iteration 43 | Average Hamming Loss:     0.22242
Iteration 43 | Average Hamming Score:    0.77758
Iteration 43 | Average Hamming Distance: 0.66727

Iteration 44 | Best K: 4
Iteration 44 | Silhouette Score: 0.37875
Iteration 44 | Average Hamming Loss:     0.22242
Iteration 44 | Average Hamming Score:    0.77758
Iteration 44 | Average Hamming Distance: 0.66727

Iteration 45 | Best K: 4
Iteration 45 | Silhouette Score: 0.37867
Iteration 45 | Average Hamming Loss:     0.22215
Iteration 45 | Average Hamming Score:    0.77785
Iteration 45 | Average Hamming Distance: 0.66644

Iteration 46 | Best K: 4
Iteration 46 | Silhouette Score: 0.37875
Iteration 46 | Average Hamming Loss:     0.22242
Iteration 46 | Average Hamming Score:    0.77758
Iteration 46 | Average Hamming Distance: 0.66727

Iteration 47 | Best K: 4
Iteration 47 | Silhouette Score: 0.38526
Iteration 47 | Average Hamming Loss:     0.24512
Iteration 47 | Average Hamming Score:    0.75488
Iteration 47 | Average Hamming Distance: 0.73537

Iteration 48 | Best K: 4
Iteration 48 | Silhouette Score: 0.37875
Iteration 48 | Average Hamming Loss:     0.22242
Iteration 48 | Average Hamming Score:    0.77758
Iteration 48 | Average Hamming Distance: 0.66727

Iteration 49 | Best K: 4
Iteration 49 | Silhouette Score: 0.37875
Iteration 49 | Average Hamming Loss:     0.22242
Iteration 49 | Average Hamming Score:    0.77758
Iteration 49 | Average Hamming Distance: 0.66727

Iteration 50 | Best K: 4
Iteration 50 | Silhouette Score: 0.37875
Iteration 50 | Average Hamming Loss:     0.22242
Iteration 50 | Average Hamming Score:    0.77758
Iteration 50 | Average Hamming Distance: 0.66727

print('Monte-Carlo Simulation Results:')
print(f'Hamming Distance |            average = {np.mean(hammings):.5f}')
print(f'Hamming Distance | standard deviation = {np.std(hammings):.5f}')

Monte-Carlo Simulation Results:
Hamming Distance |            average = 0.67240
Hamming Distance | standard deviation = 0.01722

	MFCCs_ 1	MFCCs_ 2	MFCCs_ 3	MFCCs_ 4	MFCCs_ 5	MFCCs_ 6	MFCCs_ 7	MFCCs_ 8	MFCCs_ 9	MFCCs_10	...	MFCCs_17	MFCCs_18	MFCCs_19	MFCCs_20	MFCCs_21	MFCCs_22	Family	Genus	Species	RecordID
6928	0.85101	0.660306	1.000000	0.397236	-0.090140	0.271438	0.234177	-0.135289	-0.103329	0.128048	...	-0.180056	0.063995	0.071453	-0.166078	0.082321	0.005049	Hylidae	Osteocephalus	OsteocephalusOophagus	50
6316	1.00000	0.553396	0.411580	0.250057	0.055848	0.130814	0.053127	-0.039142	0.099058	0.109447	...	0.003955	-0.003439	0.026140	-0.019745	-0.020744	0.070301	Hylidae	Hypsiboas	HypsiboasCordobae	42
4559	1.00000	0.098045	0.187311	0.565586	0.255289	0.058520	-0.169619	-0.063253	0.208254	0.049440	...	0.245413	-0.129530	-0.222279	-0.062927	0.190488	0.210858	Leptodactylidae	Adenomera	AdenomeraHylaedactylus	24

	SVM_Family	SVM_Genus	SVM_Species	L1_Family	L1_Genus	L1_Species	SMOTE_Family	SMOTE_Genus	SMOTE_Species	SMOTE_KF_Family	SMOTE_KF_Genus	SMOTE_KF_Species
C	10.000000	10.000000	10.000000	10.000000	1000.000000	10.000000	10.000000	10.000000	10.000000	135.100000	72.100000	244.000000
gamma	0.100000	0.010000	0.010000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Exact Match Loss \| Train	0.000000	0.002780	0.001986	0.059968	0.046068	0.034551	0.048983	0.043660	0.040343	0.047289	0.045380	0.043324
Exact Match Loss \| Test	0.012506	0.014358	0.014822	0.071329	0.053265	0.042612	0.085688	0.085688	0.045391	0.086151	0.092172	0.045855
Hamming Loss \| Train	0.000000	0.002780	0.001986	0.059968	0.046068	0.034551	0.048983	0.043660	0.040343	0.047289	0.045380	0.043324
Hamming Loss \| Test	0.012506	0.014358	0.014822	0.071329	0.053265	0.042612	0.085688	0.085688	0.045391	0.086151	0.092172	0.045855

	MFCCs_ 1	MFCCs_ 2	MFCCs_ 3	MFCCs_ 4	MFCCs_ 5	MFCCs_ 6	MFCCs_ 7	MFCCs_ 8	MFCCs_ 9	MFCCs_10	...	MFCCs_17	MFCCs_18	MFCCs_19	MFCCs_20	MFCCs_21	MFCCs_22	Family	Genus	Species	RecordID
0	1.0	0.152936	-0.105586	0.200722	0.317201	0.260764	0.100945	-0.150063	-0.171128	0.124676	...	-0.108351	-0.077623	-0.009568	0.057684	0.118680	0.014038	Leptodactylidae	Adenomera	AdenomeraAndre	1
1	1.0	0.171534	-0.098975	0.268425	0.338672	0.268353	0.060835	-0.222475	-0.207693	0.170883	...	-0.090974	-0.056510	-0.035303	0.020140	0.082263	0.029056	Leptodactylidae	Adenomera	AdenomeraAndre	1
2	1.0	0.152317	-0.082973	0.287128	0.276014	0.189867	0.008714	-0.242234	-0.219153	0.232538	...	-0.050691	-0.023590	-0.066722	-0.025083	0.099108	0.077162	Leptodactylidae	Adenomera	AdenomeraAndre	1

	Family	Genus	Species
Cluster
0	Hylidae	Hypsiboas	HypsiboasCinerascens
1	Hylidae	Hypsiboas	HypsiboasCordobae
2	Leptodactylidae	Adenomera	AdenomeraHylaedactylus
3	Dendrobatidae	Ameerega	Ameeregatrivittata

HW5-1

Introduction¶

Imports¶

(a) Dataset¶

(b) Multi-class and Multi-Label Classification¶

(i) Exact Match and Hamming Score¶

(ii) Gaussian Kernel SVM¶

(iii) L1-Penalized SVM¶

(iv) SMOTE + L1-Penalized SVM¶

Summary of Classifiers¶

HW5-2

Introduction¶

Imports¶

Dataset¶

(a) K-Means Clustering¶

(b) Majority Labels per Cluster¶

(c) Hamming Distance, Hamming Score, Hamming Loss¶

Monte-Carlo Simulation¶

	SVM	L1	SMOTE	SMOTE_KF
Exact Match Loss \| Train	0.003177	0.075258	NaN	NaN
Exact Match Loss \| Test	0.024085	0.091709	0.144511	0.147290
Hamming Loss \| Train	0.001589	0.046863	0.044329	0.045331
Hamming Loss \| Test	0.013895	0.055736	0.072256	0.074726