Information Extraction with Snorkel

Matheus Schmitz
LinkedIn

Github Portfolio

State-of-the-art extraction techniques require massive labeled training set but it is costly to obtain. To overcome this problem, Snorkel helps rapidly create training sets using the new data programming paradigm. To start, developers focus on writing a set of labeling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but Snorkel uses a generative model to learn how to use those labeling functions to label more data. The new labeled data now can be used to train high-quality end models.

Prepara Data

Prepare Snorkel Environment

Lets install the packages we will use. Through my testing, Snorkel v0.7 works the best with Python 3.6

Now let's uncompress the package and install Snorkel

Creating a Development Set

We need to preprocess our documents using Snorkel utilities, parsing them into a simple hierarchy of component parts of our input data, which we refer as contexts. We'll also create candidates out of these contexts, which are the objects we want to classify, in this case, possible mentions of schools and colleges that the cast have attended. Finally, we'll load some gold labels for evaluation.

All of this preprocessed input data is saved to a database. In Snorkel, if no database is specified, then a SQLite database at ./snorkel.db is created by default -- so no setup is needed here!

Initializing a SnorkelSession

Loading the Corpus

Next, we load and pre-process the corpus of documents.

Running a CorpusParser

We'll use Spacy, an NLP preprocessing tool, to split our documents into sentences and tokens, and provide named entity annotations.

We can then use simple database queries (written in the syntax of SQLAlchemy, which Snorkel uses) to check how many documents and sentences were parsed:

Generating Candidates

The next step is to extract candidates from our corpus. A Candidate in Snorkel is an object for which we want to make a prediction. In this case, the candidates are pairs of performances and directors mentioned in sentences.

The Spacy parser we used performs named entity recognition for us. Next, we'll split up the documents into train and development splits; and collect the associated sentences.

Writing a Director Name Matching Function

Our simple name matcher makes use of the fact that the names of the directors are mentions of person-type named entities in the documents. Fonduer provides a list of built-in matchers that can be used in many information extraction tasks. We will use PersonMatcher to extract director names.

Writing a Performance (Movie) Matching Function

We know that normally each director name will contain at least two words (first name, last name). Considering additional middle names, we expect a maximum of four words per name.

Similarly, we assume the performance name to be a span of one to seven words.

We use the default Ngrams class provided by Fonduer to define these properties:

We create a candidate that is composed of a performance and a director mention as we defined above. We name this candidate performance_director. And we will extract all

Create the Development Set

We create our development set by generating a dev_ids.csv file, which has one column id and contains 50 random biography URLs. You can choose any subset of 50 biographies that have performance and director.

Finally, we'll apply the candidate extractor to the two sets of sentences. The results will be persisted in the database backend.

Label Documents in Development Set

Using SentenceNgramViewer to label each mention. You can click the green button to mark the candidate as correct, red button to mark as incorrect. Your labeling result is automatically stored in the database.

SentenceNgramViewer only show candidates that are matched by the matchers. Therefore, the annotation is under an assumption that the matchers work perfectly.

Save labeled data as CSV

Define Labeling Functions (LFs)

Define the LFs which Snorkel uses to create noise-aware training set.

Train Generative Model

Now, we'll train a model of the LFs to estimate their accuracies. Once the model is trained, we can combine the outputs of the LFs into a single, noise-aware training label set for our extractor. Intuitively, we'll model the LFs by observing how they overlap and conflict with each other.

Get detailed statistics of LFs before training the model

Check Performance Before Generative Model Training

Report the weights of the LFs after generative model training

Now that we have learned the generative model, we will measure its performances using the provided test set

Get detailed statistics of LFs learned by the model

Report the performance of your LFs after generative model training

We now apply the generative model to the training candidates to get the noise-aware training label set. We'll refer to these as the training marginals:

We'll look at the distribution of the training marginals:

Check distribution of the training marginals

The distribution seems good, given that most classifiers are very close to either 0 or 1, as while it was initially very bad, but as I kept adding more labeling functions based on the FPs and FNs the distribution began to improve. The current distribution differentiates well between the classes, although its clear from the plot that defining what is a match is much easier than defining what is NOT a match.

Look at some examples in one of the error buckets to improve the LFs. Below are the false positives that we did not correctly label correctly.

3. Adding Distant Supervision Labeling Function

Distant supervision generates training data automatically using an external, imperfectly aligned training resource, such as a Knowledge Base.

Defining an additional distant-supervision-based labeling function which DBpedia.

Check LFs Performance Before Generative Model Training

Check distribution of the training marginals

The distribution seems good, given that most classifiers are very close to either 0 or 1, as while it was initially very bad, but as I kept adding more labeling functions based on the FPs and FNs the distribution began to improve. The current distribution differentiates well between the classes, although its clear from the plot that defining what is a match is much easier than defining what is NOT a match.

4. Training an Discriminative Model

Using the noisy training labels we generated to train our end extraction model. In particular, we will be training a Bi-LSTM.

Try tuning the hyper-parameters below to get your best F1 score

Hyper-Parameter Tuning

Surprisingly, I found that 50 neurons in the hidden layer were actually too much for this ultra-small (82 samples) dataset, and when I cut the neurons to 20 I saw a good increase in F1 score (mostly because with more neurons recall starts to worsen).

I reduced the learning rate by an order of magnitudef which improved performance too. Presumably because given such a small, overfit-prone dataset, we really have to go slow then training.

Given the small train dataset size, one of the most significative changes I made was reducing the batch sizea from 64 to 32, which seems to have made overfitting slightly less of an issue.

Lastly, I've increased the Dropout rate from 0.2 to 0.33 as that showed improvements (further increases didn't).

Report Performance of the Final Extractor

It took a lot of tweaking, but in the end I managed to convert the good distribution of marginals into good F1 Scores with the trained model. This was mostly a results of tuning hyperparameters with the small dataset in mind.

Generally speaking my model's weak point is in accurately predicting non-matches, as defining what is NOT a match in a string proved to be the hardest aspect of developing the labeling functions, as can be seen by the difficulty of getting a bar stacked at 0 on the marginals.

The low accuracy for the negative class is a result of too many false positieves, which I was forced to content which precisely because of the difficulty of defining a non-match. If I use more relaxed labelings for that, then the model predicts nothing in the negative class, which makes for worse results, and thus I opted for a model with errs on the side of predicitng negative, which was the least of the evils.

Using the new model to extract relation in testing documents, and save it to JSON files.

End

Matheus Schmitz
LinkedIn

Github Portfolio