Entity Resolution with Probabilistic Soft Logic (PSL)

Matheus Schmitz
LinkedIn
Github Portfolio

1. Merge Both Sources Into a Single Dataset with All Movies/Directors

2. Break Entries With Multiple Directors per Movie into Multiple Entries with One Director per Movie

Reusing dataframe manipulation code from a previous project: https://matheus-schmitz.github.io/TED_Talks_Data_Analysis/

3. Generate Unique IDs

4. Generate Blocks

5. Create Observation Files

6. Calculate String Similarities

7. Generate Targets From All Candidates

8. Generate Truth from Labeled Data

According to the thread "Multiple Directors" on BlackBoard, we only need to evaluate on movies.

9. Generate Downsampled Files

Using all samples in the candidate dataset was generating a file too large for my computer to handle. So one reasonable approach is to downsample the data while ensuring that all samples from the labeled dataset are present in the downsampled version. The samples outiside the labeled dataset are still important at they help the Collective KG-based ER Rules, but since using all of them proved too much, then downsampling them instead of removing them is a reasonable compromise.

Filter All Files Based on IDs on the Downsampled File

10. Define Data Structure

11. Define Probabilistic Soft Logic (PSL) Rules

12. Run PSL Inference

13. Load Truth and Predictions

Ground Truth

Predictions

The size of the my DataFrame with predictions is smaller than that of the DataFrame with the truth because the truth DataFrame containts many true non-matches, which weren't even generated by my code as nearly all of those non-matches didn't even fall within the same block, making it impossible for me to even predict their non-matchess.

14. Evaluate PSL Model Performance

Precision

"Of all calls I made, how many were correctly made?"

Recall

"Of all calls I should have made, how many did I make?"

F1-Score

15. Improve the Model by Tweaking the Classification Threshold

Grid Search Over Classification Threshold

Precision-Recall Curve

Turns out that the PSL model with its collective rules is pretty good at pushing the probabilities to the extremes (either 0 or 1), to the point that there there is no need to tweak the classification thresholds to improve the model. The default 0.5 threshold is as good as it gets already!

Threshold-F1 Curve

As expected from before: nothing to optimize here!

The End

Matheus Schmitz
LinkedIn
Github Portfolio