Matheus Schmitz
LinkedIn
Github Portfolio
import pandas as pd
from tqdm import tqdm
imdb = pd.read_csv('./movies1/csv_files/imdb.csv').drop_duplicates()
print(f'imdb.shape: {imdb.shape}')
imdb[['Name', 'Director']].tail(3)
imdb.shape: (6407, 13)
Name | Director | |
---|---|---|
6404 | The Runner | Austin Stark |
6405 | Free Radicals: A History of Experimental Film | Pip Chodorov |
6406 | Experimenter | Michael Almereyda |
rotten = pd.read_csv('./movies1/csv_files/rotten_tomatoes.csv').drop_duplicates()
print(f'rotten.shape: {rotten.shape}')
rotten[['Name', 'Director']].tail(3)
rotten.shape: (7390, 17)
Name | Director | |
---|---|---|
7387 | 99 Homes | Ramin Bahrani |
7388 | Experimenter | Michael Almereyda |
7389 | The Gift | Joel Edgerton |
all_entries_df = pd.concat([imdb[['Name', 'Director']], rotten[['Name', 'Director']]], ignore_index=True).drop_duplicates()
print(f'all_entries_df.shape: {all_entries_df.shape}')
all_entries_df.tail(3)
all_entries_df.shape: (10946, 2)
Name | Director | |
---|---|---|
13790 | Great Directors | Angela Ismailos |
13791 | Still Screaming: The Ultimate Scary Movie Retr... | Ryan Turek |
13793 | Adjust Your Tracking | Dan M. Kinem,Levi Peretic |
Reusing dataframe manipulation code from a previous project: https://matheus-schmitz.github.io/TED_Talks_Data_Analysis/
# As of now, for the same Movie the Directors column has multiple director names separated by commas
# Generate multiple columns, each containing at most one director name for a given movie
movie_directors_df = pd.concat([all_entries_df[['Name']], all_entries_df['Director'].str.split(',', expand=True)], axis=1).drop_duplicates()
print(f'movie_directors_df.shape: {movie_directors_df.shape}')
movie_directors_df.tail(3)
movie_directors_df.shape: (10946, 33)
Name | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13790 | Great Directors | Angela Ismailos | None | None | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
13791 | Still Screaming: The Ultimate Scary Movie Retr... | Ryan Turek | None | None | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
13793 | Adjust Your Tracking | Dan M. Kinem | Levi Peretic | None | None | None | None | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
3 rows × 33 columns
# Turn the data back into having a single column for Diretors, now with repeating rows for movies with more than one director
df_melted = movie_directors_df.melt(id_vars='Name').drop_duplicates()
print(f'df_melted.shape: {df_melted.shape}')
df_melted
df_melted.shape: (300074, 3)
Name | variable | value | |
---|---|---|---|
0 | City of Missing Girls | 0 | Elmer Clifton |
1 | The Gay Divorcee | 0 | Mark Sandrich |
2 | The Divorcée | 0 | Robert Z. Leonard |
3 | Hells Angels on Wheels | 0 | Richard Rush |
4 | The Miracle Woman | 0 | Ketan Mehta |
... | ... | ... | ... |
350267 | Girl 27 | 31 | None |
350268 | American Grindhouse | 31 | None |
350269 | Great Directors | 31 | None |
350270 | Still Screaming: The Ultimate Scary Movie Retr... | 31 | None |
350271 | Adjust Your Tracking | 31 | None |
300074 rows × 3 columns
# Clean the dataframe so gain it has only two columns: Movie and Director
df_drops = df_melted.dropna(subset=['value']).copy()
df_drops.drop(labels=['variable'], axis=1, inplace=True)
df_drops.reset_index(inplace=True, drop=True)
df_drops.columns = ['Movie', 'Director']
df_drops.drop_duplicates(inplace=True)
print(f'df_drops.shape: {df_drops.shape}')
df_drops.tail(3)
df_drops.shape: (13125, 2)
Movie | Director | |
---|---|---|
13650 | The ABCs of Death | Simon Barrette |
13651 | ABCs of Death 2 | Vincenzo Natali |
13652 | The ABCs of Death | Simon Barrett |
# Strip extra whitespaces from names as that was causing errors down the line
df_drops = df_drops.apply(lambda s: s.str.strip())
print(f'df_drops.shape: {df_drops.shape}')
df_drops.tail(3)
df_drops.shape: (13125, 2)
Movie | Director | |
---|---|---|
13650 | The ABCs of Death | Simon Barrette |
13651 | ABCs of Death 2 | Vincenzo Natali |
13652 | The ABCs of Death | Simon Barrett |
# Get all unique Movie names
df_movies = df_drops['Movie'].copy()
df_movies.drop_duplicates(inplace=True)
print(f'df_movies.shape: {df_movies.shape}')
df_movies.tail(3)
df_movies.shape: (9283,)
10704 Great Directors 10705 Still Screaming: The Ultimate Scary Movie Retr... 10706 Adjust Your Tracking Name: Movie, dtype: object
# Generate IDs for movies
ID_to_movie = {}
movie_to_ID = {}
for idx, item in df_movies.iteritems():
ID_to_movie[idx] = item
movie_to_ID[item] = idx
# Get all unique Director names
df_directors = df_drops['Director'].copy()
df_directors.drop_duplicates(inplace=True)
print(f'df_directors.shape: {df_directors.shape}')
df_directors.tail(3)
df_directors.shape: (7122,)
13649 Todd Rohal 13650 Simon Barrette 13651 Vincenzo Natali Name: Director, dtype: object
# Generate IDs for directors (start at 50000 to avoid bumping into movie IDs)
ID_to_director = {}
director_to_ID = {}
for idx, item in df_directors.iteritems():
ID_to_director[idx + 50000] = item
director_to_ID[item] = idx + 50000
# Plug the IDs back into the DataFrame with Movie/Director
df_ids = pd.DataFrame(columns = ['Movie', 'Director', 'Movie_ID', 'Director_ID'])
for idx, row in tqdm(df_drops.iterrows(), total=df_drops.shape[0]):
df_ids = df_ids.append({'Movie': row.Movie,
'Director': row.Director,
'Movie_ID': movie_to_ID[row.Movie],
'Director_ID': director_to_ID[row.Director]},
ignore_index=True)
print(f'df_ids.shape: {df_ids.shape}')
df_ids.tail(3)
100%|███████████████████████████████████████████████████████████████████████████| 13125/13125 [00:59<00:00, 220.32it/s]
df_ids.shape: (13125, 4)
Movie | Director | Movie_ID | Director_ID | |
---|---|---|---|---|
13122 | The ABCs of Death | Simon Barrette | 5538 | 63650 |
13123 | ABCs of Death 2 | Vincenzo Natali | 5540 | 63651 |
13124 | The ABCs of Death | Simon Barrett | 5538 | 55545 |
import string
# Function to block based on first letter of each word in the string, ignoring numbers (Ice Age x Ice Age 2 get the same block)
# Also ignore "The" in the movie names, as that is often withheld from titles, which can mess with blocks
def generate_blocks(full_name):
block = ''.join(sorted([s[0].upper() for s in full_name.split() if s != "The" and s != "the"]))
# Strip number from the block as long as it won't result in a Null block
if len(block.strip('0123456789')) > 0:
block = block.strip('0123456789')
return block
# There are some missing values represented as "???" which become None if stripping all punction, which then generates errors
def strip_punctuation_all(s):
return s.translate(str.maketrans('', '', string.punctuation))
# Stripping just certain punctuations for text clearning
def strip_punctuation_selective(s):
s = s.replace('.', ' ')
s = s.replace('(', ' ')
s = s.replace(')', ' ')
s = s.strip()
return s
# For generating director blocks, ensure that names with initials are used too
df_ids['DirectorBlock'] = df_ids['Director'].apply(strip_punctuation_selective).apply(generate_blocks)
df_ids['MovieBlock'] = df_ids['Movie'].apply(strip_punctuation_selective).apply(generate_blocks)
df_ids.tail(3)
Movie | Director | Movie_ID | Director_ID | DirectorBlock | MovieBlock | |
---|---|---|---|---|---|---|
13122 | The ABCs of Death | Simon Barrette | 5538 | 63650 | BS | ADO |
13123 | ABCs of Death 2 | Vincenzo Natali | 5540 | 63651 | NV | ADO |
13124 | The ABCs of Death | Simon Barrett | 5538 | 55545 | BS | ADO |
directorBlock_obs = df_ids[['Director_ID', 'DirectorBlock']].drop_duplicates()
directorBlock_obs.to_csv('full_directorBlock_obs.txt', sep='\t', header=False, index=False)
print(f'directorBlock_obs.shape: {directorBlock_obs.shape}')
directorBlock_obs.shape: (7122, 2)
directorName_obs = df_ids[['Director_ID', 'Director']].drop_duplicates()
directorName_obs.to_csv('full_directorName_obs.txt', sep='\t', header=False, index=False)
print(f'directorName_obs.shape: {directorName_obs.shape}')
directorName_obs.shape: (7122, 2)
directorOf_obs = df_ids[['Director_ID', 'Movie_ID']].drop_duplicates()
directorOf_obs.to_csv('full_directorOf_obs.txt', sep='\t', header=False, index=False)
print(f'directorOf_obs.shape: {directorOf_obs.shape}')
directorOf_obs.shape: (13125, 2)
movieBlock_obs = df_ids[['Movie_ID', 'MovieBlock']].drop_duplicates()
movieBlock_obs.to_csv('full_movieBlock_obs.txt', sep='\t', header=False, index=False)
print(f'movieBlock_obs.shape: {movieBlock_obs.shape}')
movieBlock_obs.shape: (9283, 2)
movieTitle_obs = df_ids[['Movie_ID', 'Movie']].drop_duplicates()
movieTitle_obs.to_csv('full_movieTitle_obs.txt', sep='\t', header=False, index=False)
print(f'movieTitle_obs.shape: {movieTitle_obs.shape}')
movieTitle_obs.shape: (9283, 2)
from strsimpy.jaro_winkler import JaroWinkler
from strsimpy.normalized_levenshtein import NormalizedLevenshtein
jaro_winkler = JaroWinkler().similarity
with open('full_simDirector_obs.txt', 'w') as f_out:
# Iterate over blocks
for block in tqdm(df_ids['DirectorBlock'].unique()):
# Then do pairwise comparison for all items on the current block
for left_item in df_ids[df_ids['DirectorBlock'] == block]['Director'].unique():
for right_item in df_ids[df_ids['DirectorBlock'] == block]['Director'].unique():
# Calculate similarity
similarity = jaro_winkler(left_item, right_item)
# Get IDs
left_id = director_to_ID[left_item]
right_id = director_to_ID[right_item]
# And write to the similarities file
f_out.write(f'{left_id}' + '\t' + f'{right_id}' + '\t' + f'{similarity}' + '\n')
100%|████████████████████████████████████████████████████████████████████████████████| 966/966 [00:22<00:00, 42.37it/s]
normalized_levenshtein = NormalizedLevenshtein().similarity
with open('full_simMovie_obs.txt', 'w') as f_out:
# Iterate over blocks
for block in tqdm(df_ids['MovieBlock'].unique()):
# Then do pairwise comparison for all items on the current block
for left_item in df_ids[df_ids['MovieBlock'] == block]['Movie'].unique():
for right_item in df_ids[df_ids['MovieBlock'] == block]['Movie'].unique():
# Calculate similarity
similarity = normalized_levenshtein(left_item.lower(), right_item.lower())
# Get IDs
left_id = movie_to_ID[left_item]
right_id = movie_to_ID[right_item]
# And write to the similarities file
f_out.write(f'{left_id}' + '\t' + f'{right_id}' + '\t' + f'{similarity}' + '\n')
100%|██████████████████████████████████████████████████████████████████████████████| 3609/3609 [00:42<00:00, 83.96it/s]
# All pairwise comparisons for each block of Movies
with open('full_sameMovie_target.txt', 'w') as f_out:
# Iterate over blocks
for block in tqdm(df_ids['MovieBlock'].unique()):
# Then do pairwise comparison for all items on the current block
for left_item in df_ids[df_ids['MovieBlock'] == block]['Movie_ID'].unique():
for right_item in df_ids[df_ids['MovieBlock'] == block]['Movie_ID'].unique():
# And write to the similarities file
f_out.write(f'{left_item}' + '\t' + f'{right_item}' + '\n')
100%|█████████████████████████████████████████████████████████████████████████████| 3609/3609 [00:21<00:00, 167.25it/s]
# All pairwise comparisons for each block of Directors
with open('full_sameDirector_target.txt', 'w') as f_out:
# Iterate over blocks
for block in tqdm(df_ids['DirectorBlock'].unique()):
# Then do pairwise comparison for all items on the current block
for left_item in df_ids[df_ids['DirectorBlock'] == block]['Director_ID'].unique():
for right_item in df_ids[df_ids['DirectorBlock'] == block]['Director_ID'].unique():
# And write to the similarities file
f_out.write(f'{left_item}' + '\t' + f'{right_item}' + '\n')
100%|████████████████████████████████████████████████████████████████████████████████| 966/966 [00:14<00:00, 68.64it/s]
According to the thread "Multiple Directors" on BlackBoard, we only need to evaluate on movies.
labeled = pd.read_csv('./movies1/csv_files/labeled_data.csv', skiprows=5)
print(f'labeled.shape: {labeled.shape}')
labeled.tail(3)
labeled.shape: (600, 10)
_id | ltable.Id | rtable.Id | ltable.Director | ltable.Name | ltable.Year | rtable.Director | rtable.Name | rtable.YearRange | gold | |
---|---|---|---|---|---|---|---|---|---|---|
597 | 11103 | tt4562728 | the_divergent_series_allegiant | Robert Schwartzman | MF | 2015 | Robert Schwentke | The Divergent Series: Allegiant | 2015 2016 2017 | 0 |
598 | 11120 | tt4795692 | sisters_2015 | Sean Hanish | Sister Cities | 2016 | Em Cooper,Jason Moore | Sisters | 2014 2015 2016 | 0 |
599 | 11138 | tt4984930 | the_mend | Su Rynard | The Messenger | 2015 | John Magary | The Mend | 2014 2015 2016 | 0 |
# Get a dataframe with only the movie matchings and their gold labels
labeled_movie = labeled[['ltable.Name', 'rtable.Name', 'gold']]
print(f'labeled_movie.shape: {labeled_movie.shape}')
labeled_movie.tail(3)
labeled_movie.shape: (600, 3)
ltable.Name | rtable.Name | gold | |
---|---|---|---|
597 | MF | The Divergent Series: Allegiant | 0 |
598 | Sister Cities | Sisters | 0 |
599 | The Messenger | The Mend | 0 |
# Encode the labeled_movie DataFrame into IDs
labeled_movie_IDs = pd.DataFrame(columns = ['l_movie_id', 'r_movie_id', 'gold'])
for idx, row in labeled_movie.iterrows():
# There is one movie "Dj Vu" which somehow is on the labeled dataset but not on either of the original source datasets
# It will generate an error, so I'm just skipping that one movie
try:
labeled_movie_IDs = labeled_movie_IDs.append({'l_movie_id': movie_to_ID[row["ltable.Name"]],
'r_movie_id': movie_to_ID[row["rtable.Name"]],
'gold': row.gold}, ignore_index=True)
except:
pass
# There is also one entry which seems to be duplicate and can cause errors later
labeled_movie_IDs.drop_duplicates(inplace=True)
print(f'labeled_movie_IDs.shape: {labeled_movie_IDs.shape}')
labeled_movie_IDs.tail(3)
labeled_movie_IDs.shape: (598, 3)
l_movie_id | r_movie_id | gold | |
---|---|---|---|
596 | 7356 | 5865 | 0 |
597 | 8344 | 263 | 0 |
598 | 6158 | 582 | 0 |
# Save as a "tab-separated .txt" file which is what PSL requires
# Truth
labeled_movie_IDs.to_csv('labeled_sameMovie_truth.txt', sep='\t', header=False, index=False)
# Target
labeled_movie_IDs.drop('gold', axis=1).to_csv('labeled_sameMovie_target.txt', sep='\t', header=False, index=False)
Using all samples in the candidate dataset was generating a file too large for my computer to handle. So one reasonable approach is to downsample the data while ensuring that all samples from the labeled dataset are present in the downsampled version. The samples outiside the labeled dataset are still important at they help the Collective KG-based ER Rules, but since using all of them proved too much, then downsampling them instead of removing them is a reasonable compromise.
# First downsample the DataFrame with all candidates
df_ids_downsampled = df_ids.sample(frac=0.5)
print(f'df_ids_downsampled.shape: {df_ids_downsampled.shape}')
df_ids_downsampled.tail(3)
df_ids_downsampled.shape: (6562, 6)
Movie | Director | Movie_ID | Director_ID | DirectorBlock | MovieBlock | |
---|---|---|---|---|---|---|
10126 | Creative Control | Benjamin Dickinson | 10126 | 60126 | BD | CC |
1099 | The Dark Knight | Christopher Nolan | 1099 | 51089 | CN | DK |
10772 | Bambi | Graham Heid | 361 | 60793 | GH | B |
# Then add all pairs in the labeled data to the downsampled DataFrame
for idx, row in tqdm(labeled_movie_IDs.iterrows(), total=labeled_movie_IDs.shape[0]):
# Fetch the movie names
left_movie_name = ID_to_movie[row.l_movie_id]
right_movie_name = ID_to_movie[row.r_movie_id]
# Fetch the director IDs
left_director_id = directorOf_obs[directorOf_obs['Movie_ID'] == row.l_movie_id].Director_ID.values[0]
right_director_id = directorOf_obs[directorOf_obs['Movie_ID'] == row.r_movie_id].Director_ID.values[0]
# Fetch director names
left_director_name = ID_to_director[left_director_id]
right_director_name = ID_to_director[right_director_id]
# Generate director blocks
left_director_block = pd.Series(left_director_name).apply(strip_punctuation_selective).apply(generate_blocks)[0]
right_director_block = pd.Series(right_director_name).apply(strip_punctuation_selective).apply(generate_blocks)[0]
# Generate movie blocks
left_movie_block = pd.Series(left_movie_name).apply(strip_punctuation_selective).apply(generate_blocks)[0]
right_movie_block = pd.Series(right_movie_name).apply(strip_punctuation_selective).apply(generate_blocks)[0]
# Create one dictionary per movie (left and right)
left_dict = {'Movie': left_movie_name,
'Director': left_director_name,
'Movie_ID': row.l_movie_id,
'Director_ID': left_director_id,
'DirectorBlock': left_director_block,
'MovieBlock': left_movie_block}
right_dict = {'Movie': right_movie_name,
'Director': right_director_name,
'Movie_ID': row.r_movie_id,
'Director_ID': right_director_id,
'DirectorBlock': right_director_block,
'MovieBlock': right_movie_block}
# Append both samples to the downsamples DataFrame
df_ids_downsampled = df_ids_downsampled.append([left_dict, right_dict], ignore_index=True)
# Appending dictionaries changes the column types to "object", return them to proper types
df_ids_downsampled = df_ids_downsampled.astype('string')
df_ids_downsampled['Movie_ID'] = df_ids_downsampled['Movie_ID'].astype('int')
df_ids_downsampled['Director_ID'] = df_ids_downsampled['Director_ID'].astype('int')
# Remove any duplicates in the downsampled DataFrame
df_ids_downsampled.drop_duplicates(inplace=True)
print(f'df_ids_downsampled.shape: {df_ids_downsampled.shape}')
df_ids_downsampled.tail(3)
100%|████████████████████████████████████████████████████████████████████████████████| 598/598 [00:06<00:00, 93.15it/s]
df_ids_downsampled.shape: (7004, 6)
Movie | Director | Movie_ID | Director_ID | DirectorBlock | MovieBlock | |
---|---|---|---|---|---|---|
7751 | The Metropolitan Opera: Cavalleria Rusticana/p... | David McVicar | 3278 | 53278 | DM | CMOR |
7753 | The Divergent Series: Allegiant | Robert Schwentke | 5865 | 55103 | RS | ADS |
7756 | The Messenger | Oren Moverman | 6158 | 53979 | MO | M |
Filter All Files Based on IDs on the Downsampled File
# All pairwise comparisons for each downsampled block of Movies
with open('downsampled_sameMovie_target.txt', 'w') as f_out:
# Iterate over blocks
for block in tqdm(df_ids_downsampled['MovieBlock'].unique()):
# Then do pairwise comparison for all items on the current block
for left_item in df_ids_downsampled[df_ids_downsampled['MovieBlock'] == block]['Movie_ID'].unique():
for right_item in df_ids_downsampled[df_ids_downsampled['MovieBlock'] == block]['Movie_ID'].unique():
# And write to the similarities file
f_out.write(f'{left_item}' + '\t' + f'{right_item}' + '\n')
100%|█████████████████████████████████████████████████████████████████████████████| 2362/2362 [00:13<00:00, 175.82it/s]
# All pairwise comparisons for each downsampled block of Directors
with open('downsampled_sameDirector_target.txt', 'w') as f_out:
# Iterate over blocks
for block in tqdm(df_ids_downsampled['DirectorBlock'].unique()):
# Then do pairwise comparison for all items on the current block
for left_item in df_ids_downsampled[df_ids_downsampled['DirectorBlock'] == block]['Director_ID'].unique():
for right_item in df_ids_downsampled[df_ids_downsampled['DirectorBlock'] == block]['Director_ID'].unique():
# And write to the similarities file
f_out.write(f'{left_item}' + '\t' + f'{right_item}' + '\n')
100%|████████████████████████████████████████████████████████████████████████████████| 782/782 [00:09<00:00, 85.00it/s]
downsampled_directorBlock_obs = directorBlock_obs[directorBlock_obs['Director_ID'].isin(df_ids_downsampled['Director_ID'].values)]
downsampled_directorBlock_obs.to_csv('downsampled_directorBlock_obs.txt', sep='\t', header=False, index=False)
print(f'downsampled_directorBlock_obs.shape: {downsampled_directorBlock_obs.shape}')
downsampled_directorBlock_obs.shape: (4452, 2)
downsampled_directorName_obs = directorName_obs[directorName_obs['Director_ID'].isin(df_ids_downsampled['Director_ID'].values)]
downsampled_directorName_obs.to_csv('downsampled_directorName_obs.txt', sep='\t', header=False, index=False)
print(f'downsampled_directorName_obs.shape: {downsampled_directorName_obs.shape}')
downsampled_directorName_obs.shape: (4452, 2)
downsampled_directorOf_obs = directorOf_obs[directorOf_obs['Director_ID'].isin(df_ids_downsampled['Director_ID'].values) | directorOf_obs['Movie_ID'].isin(df_ids_downsampled['Movie_ID'].values)]
downsampled_directorOf_obs.to_csv('downsampled_directorOf_obs.txt', sep='\t', header=False, index=False)
print(f'downsampled_directorOf_obs.shape: {downsampled_directorOf_obs.shape}')
downsampled_directorOf_obs.shape: (11331, 2)
downsampled_movieBlock_obs = movieBlock_obs[movieBlock_obs['Movie_ID'].isin(df_ids_downsampled['Movie_ID'].values)]
downsampled_movieBlock_obs.to_csv('downsampled_movieBlock_obs.txt', sep='\t', header=False, index=False)
print(f'downsampled_movieBlock_obs.shape: {downsampled_movieBlock_obs.shape}')
downsampled_movieBlock_obs.shape: (5616, 2)
downsampled_movieTitle_obs = movieBlock_obs[movieBlock_obs['Movie_ID'].isin(df_ids_downsampled['Movie_ID'].values)]
downsampled_movieTitle_obs.to_csv('downsampled_movieTitle_obs.txt', sep='\t', header=False, index=False)
print(f'downsampled_movieTitle_obs.shape: {downsampled_movieTitle_obs.shape}')
downsampled_movieTitle_obs.shape: (5616, 2)
jaro_winkler = JaroWinkler().similarity
with open('downsampled_simDirector_obs.txt', 'w') as f_out:
# Iterate over blocks
for block in tqdm(df_ids_downsampled['DirectorBlock'].unique()):
# Then do pairwise comparison for all items on the current block
for left_item in df_ids_downsampled[df_ids_downsampled['DirectorBlock'] == block]['Director'].unique():
for right_item in df_ids[df_ids['DirectorBlock'] == block]['Director'].unique():
# Calculate similarity
similarity = jaro_winkler(left_item, right_item)
# Get IDs
left_id = director_to_ID[left_item]
right_id = director_to_ID[right_item]
# And write to the similarities file
f_out.write(f'{left_id}' + '\t' + f'{right_id}' + '\t' + f'{similarity}' + '\n')
100%|████████████████████████████████████████████████████████████████████████████████| 782/782 [00:14<00:00, 53.07it/s]
normalized_levenshtein = NormalizedLevenshtein().similarity
with open('downsampled_simMovie_obs.txt', 'w') as f_out:
# Iterate over blocks
for block in tqdm(df_ids_downsampled['MovieBlock'].unique()):
# Then do pairwise comparison for all items on the current block
for left_item in df_ids_downsampled[df_ids_downsampled['MovieBlock'] == block]['Movie'].unique():
for right_item in df_ids_downsampled[df_ids_downsampled['MovieBlock'] == block]['Movie'].unique():
# Calculate similarity
similarity = normalized_levenshtein(left_item.lower(), right_item.lower())
# Get IDs
left_id = movie_to_ID[left_item]
right_id = movie_to_ID[right_item]
# And write to the similarities file
f_out.write(f'{left_id}' + '\t' + f'{right_id}' + '\t' + f'{similarity}' + '\n')
100%|█████████████████████████████████████████████████████████████████████████████| 2362/2362 [00:22<00:00, 106.09it/s]
with open("data_file.data", 'w') as data_file:
data_file.write("""
predicates:
DirectorName/2:
- closed
- types:
- UniqueIntID
- UniqueStringID
DirectorBlock/2:
- closed
- block
- types:
- UniqueIntID
- UniqueStringID
DirectorOf/2:
- closed
- types:
- UniqueIntID
- UniqueIntID
MovieTitle/2:
- closed
- types:
- UniqueIntID
- UniqueStringID
MovieBlock/2:
- closed
- block
- types:
- UniqueIntID
- UniqueStringID
SimDirector/2:
- closed
- types:
- UniqueStringID
- UniqueStringID
SimMovie/2:
- closed
- types:
- UniqueStringID
- UniqueStringID
SameMovie/2:
- open
- types:
- UniqueIntID
- UniqueIntID
SameDirector/2:
- open
- types:
- UniqueIntID
- UniqueIntID
observations:
DirectorName: downsampled_directorName_obs.txt
DirectorBlock: downsampled_directorBlock_obs.txt
DirectorOf: downsampled_directorOf_obs.txt
MovieTitle: downsampled_movieTitle_obs.txt
MovieBlock: downsampled_movieBlock_obs.txt
SimDirector: downsampled_simDirector_obs.txt
SimMovie: downsampled_simMovie_obs.txt
targets:
SameMovie: downsampled_sameMovie_target.txt
SameDirector: downsampled_sameDirector_target.txt
""")
with open("psl_file.psl", 'w') as data_file:
data_file.write("""
// Look for text similarity.
40.0: DirectorName(D1, N1) & DirectorName(D2, N2) & SimDirector(N1, N2) & (D1 != D2) -> SameDirector(D1, D2) ^2
40.0: MovieTitle(M1, T1) & MovieTitle(M2, T2) & SimMovie(T1, T2) & (M1 != M2) -> SameMovie(M1, M2) ^2
// Pure transitivity (Director)
20.0: DirectorBlock(D1, B) & DirectorBlock(D2, B) & DirectorBlock(D3, B)
& SameDirector(D1, D2) & SameDirector(D2, D3)
& (D1 != D3) & (D1 != D2) & (D2 != D3)
-> SameDirector(D1, D3) ^2
// Pure transitivity (Movie)
20.0: MovieBlock(M1, B) & MovieBlock(M2, B) & MovieBlock(M3, B)
& SameMovie(M1, M2) & SameMovie(M2, M3)
& (M1 != M3) & (M1 != M2) & (M2 != M3)
-> SameMovie(M1, M3) ^2
// Collective KG-Based ER Rules (Codirector rectangle closure)
20.0: DirectorBlock(D1, B1) & DirectorBlock(D2, B1) & DirectorBlock(CD1, B2) & DirectorBlock(CD2, B2) &
& DirectorOf(D1, M1) & DirectorOf(D2, M2)
& DirectorOf(CD1, M1) & DirectorOf(CD2, M2) & SameDirector(CD1, CD2)
& (D1 != CD1) & (D2 != CD2) & (M1 != M2)
-> SameDirector(D1, D2) ^2
// Collective KG-Based ER Rules (Movie rectangle closure)
10.0: DirectorBlock(D1, B1) & DirectorBlock(D2, B1)
& DirectorOf(D1, M1) & DirectorOf(D2, M2)
& SameMovie(M1, M2)
-> SameDirector(D1, D2) ^2
// Self-refernece.
SameDirector(A, A) = 1.0 .
SameMovie(P, P) = 1.0 .
// Negative priors.
1.0: !SameDirector(D1, D2) ^2
1.0: !SameMovie(D1, D2) ^2
""")
%%time
# This might take a while...
!java -Xmx6g -jar psl-cli-2.2.1.jar --infer --model psl_file.psl --data data_file.data --output inferred_predicates
^C
Ground Truth
df_truth = pd.read_csv('labeled_sameMovie_truth.txt', sep='\t', header=None)
print(f'df_truth.shape: {df_truth.shape}')
df_truth.head(3)
df_truth.shape: (598, 3)
0 | 1 | 2 | |
---|---|---|---|
0 | 7012 | 1050 | 0 |
1 | 46 | 20 | 0 |
2 | 26 | 72 | 0 |
# Convert to a dict with double index
dict_truth = {}
for idx, (idx_a, idx_b, truth) in tqdm(df_truth.iterrows(), total=df_truth.shape[0]):
dict_truth[(int(idx_a), int(idx_b))] = int(truth)
len(dict_truth)
100%|██████████████████████████████████████████████████████████████████████████████| 598/598 [00:00<00:00, 9875.56it/s]
598
# Crete a inverse dictionary
dict_truth_inverse = {0:[], 1:[]}
for idx, (idx_a, idx_b, truth) in tqdm(df_truth.iterrows(), total=df_truth.shape[0]):
if truth >= 0.5:
dict_truth_inverse[1].append((int(idx_a), int(idx_b)))
else:
dict_truth_inverse[0].append((int(idx_a), int(idx_b)))
len(dict_truth_inverse[0]) + len(dict_truth_inverse[1])
100%|█████████████████████████████████████████████████████████████████████████████| 598/598 [00:00<00:00, 12454.84it/s]
598
Predictions
df_pred = pd.read_csv('./inferred_predicates/SAMEMOVIE.txt', sep='\t', header=None)
print(f'df_pred.shape: {df_pred.shape}')
df_pred.head(3)
df_pred.shape: (109876, 3)
0 | 1 | 2 | |
---|---|---|---|
0 | 103 | 1115 | 0.009894 |
1 | 13 | 4453 | 0.000428 |
2 | 79 | 2012 | 0.001250 |
The size of the my DataFrame with predictions is smaller than that of the DataFrame with the truth because the truth DataFrame containts many true non-matches, which weren't even generated by my code as nearly all of those non-matches didn't even fall within the same block, making it impossible for me to even predict their non-matchess.
# Convert to a dict with double index
dict_pred = {}
for idx, (idx_a, idx_b, call) in tqdm(df_pred.iterrows(), total=df_pred.shape[0]):
dict_pred[(int(idx_a), int(idx_b))] = 1 if call >= 0.5 else 0
len(dict_pred)
100%|███████████████████████████████████████████████████████████████████████| 109876/109876 [00:06<00:00, 17207.69it/s]
109876
# Crete a inverse dictionary
dict_pred_inverse = {0:[], 1:[]}
for idx, (idx_a, idx_b, call) in tqdm(df_pred.iterrows(), total=df_pred.shape[0]):
if call >= 0.5:
dict_pred_inverse[1].append((int(idx_a), int(idx_b)))
else:
dict_pred_inverse[0].append((int(idx_a), int(idx_b)))
len(dict_pred_inverse[0]) + len(dict_pred_inverse[1])
100%|███████████████████████████████████████████████████████████████████████| 109876/109876 [00:09<00:00, 11830.95it/s]
109876
"Of all calls I made, how many were correctly made?"
TP = 0
FP = 0
# dict_pred_inverse[1] = all the calls I made
for match in dict_pred_inverse[1]:
# Consider only the samples in the labeled dataset
if match in dict_truth.keys():
if dict_truth.get(match) == 1:
TP += 1
else:
FP += 1
precision = TP / (TP + FP)
print(f'TP: {TP:>4}')
print(f'FP: {FP:>4}')
print(f'Precision: {precision:.5f}')
TP: 173 FP: 2 Precision: 0.98857
"Of all calls I should have made, how many did I make?"
TP = 0
FN = 0
# dict_truth_inverse[1] = all the calls I should have made
for match in dict_truth_inverse[1]:
if dict_pred.get(match) == 1:
TP += 1
else:
FN += 1
recall = TP / (TP + FN)
print(f'TP: {TP:>4}')
print(f'FN: {FN:>4}')
print(f'Recall: {recall:.5f}')
TP: 173 FN: 16 Recall: 0.91534
F1 = (2 * precision * recall) / (precision + recall)
print(f'F1 Score: {F1:.5f}')
F1 Score: 0.95055
Grid Search Over Classification Threshold
# DataFrame to keep track of results at each threshold
grid_search_df = pd.DataFrame(columns = ['TP', 'FP', 'FN', 'Precision', 'Recall', 'F1 Score'])
# Generate thresholds in increments of 5%
thresholds = [x/100 for x in range(0,101,5)]
# Then loop over the thresholds
for threshold in tqdm(thresholds):
################################
### Predictions Dictionaries ###
################################
dict_pred = {}
for idx, (idx_a, idx_b, call) in df_pred.iterrows():
dict_pred[(int(idx_a), int(idx_b))] = 1 if call >= threshold else 0
dict_pred_inverse = {0:[], 1:[]}
for idx, (idx_a, idx_b, call) in df_pred.iterrows():
if call >= threshold:
dict_pred_inverse[1].append((int(idx_a), int(idx_b)))
else:
dict_pred_inverse[0].append((int(idx_a), int(idx_b)))
#################
### Precision ###
#################
TP = 0
FP = 0
# dict_pred_inverse[1] = all the calls I made
for match in dict_pred_inverse[1]:
# Consider only the samples in the labeled dataset
if match in dict_truth.keys():
if dict_truth.get(match) == 1:
TP += 1
else:
FP += 1
precision = TP / (TP + FP)
##############
### Recall ###
##############
TP = 0
FN = 0
# dict_truth_inverse[1] = all the calls I should have made
for match in dict_truth_inverse[1]:
if dict_pred.get(match) == 1:
TP += 1
else:
FN += 1
recall = TP / (TP + FN)
################
### F1-Score ###
################
F1 = (2 * precision * recall) / (precision + recall)
########################################
### DataFrame with Iteration Results ###
########################################
grid_search_df.at[threshold, 'TP'] = TP
grid_search_df.at[threshold, 'FP'] = FP
grid_search_df.at[threshold, 'FN'] = FN
grid_search_df.at[threshold, 'Precision'] = precision
grid_search_df.at[threshold, 'Recall'] = recall
grid_search_df.at[threshold, 'F1 Score'] = F1
# View results
grid_search_df
100%|██████████████████████████████████████████████████████████████████████████████████| 21/21 [05:18<00:00, 15.15s/it]
TP | FP | FN | Precision | Recall | F1 Score | |
---|---|---|---|---|---|---|
0.00 | 183 | 38 | 6 | 0.828054 | 0.968254 | 0.892683 |
0.05 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.10 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.15 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.20 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.25 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.30 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.35 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.40 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.45 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.50 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.55 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.60 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.65 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.70 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.75 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.80 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.85 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.90 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
0.95 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
1.00 | 173 | 2 | 16 | 0.988571 | 0.915344 | 0.950549 |
Precision-Recall Curve
grid_search_df.plot(x='Recall', y='Precision',
xlim=(0,1), ylim=(0,1),
xlabel='Recall', ylabel='Precison',
figsize=(6,6), legend=None,
title="Precision Recall Curve")
<AxesSubplot:title={'center':'Precision Recall Curve'}, xlabel='Recall', ylabel='Precison'>
Turns out that the PSL model with its collective rules is pretty good at pushing the probabilities to the extremes (either 0 or 1), to the point that there there is no need to tweak the classification thresholds to improve the model. The default 0.5 threshold is as good as it gets already!
Threshold-F1 Curve
grid_search_df.plot(y='F1 Score',
xlim=(0,1), ylim=(0,1),
xlabel='Threshold', ylabel='Precison',
figsize=(6,6), legend=None,
title="Threshold F1 Curve")
<AxesSubplot:title={'center':'Threshold F1 Curve'}, xlabel='Threshold', ylabel='Precison'>
As expected from before: nothing to optimize here!
Matheus Schmitz
LinkedIn
Github Portfolio