TED Talks Recommender

In [1]:
# General Packages
import numpy as np
import pandas as pd
import os
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Iterators
from collections import Counter
from itertools import islice
from operator import itemgetter
from tqdm import tqdm

# Text
import re
from textblob import TextBlob
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize, MWETokenizer
from nltk.stem import porter, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.util import ngrams

# Scikit-Learn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD, NMF
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.neighbors import NearestNeighbors
In [2]:
# Download nltk packages
nltk.download('punkt')
nltk.download('brown')
nltk.download('stopwords')
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Matheus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Matheus\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Matheus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[2]:
True
In [3]:
# Path to CSVs
path = 'C:\Portfolio\TED_Talks_Recommender/'

# Load TED Main
ted_main = pd.read_csv(path + 'ted_main.csv')

# Load TED Transcripts
ted_transcripts = pd.read_csv(path + 'transcripts.csv')

# Meger them
ted_transcripts = pd.merge(ted_main, ted_transcripts, on='url')
ted_transcripts.head(3)
Out[3]:
comments description duration event film_date languages main_speaker name num_speaker published_date ratings related_talks speaker_occupation tags title url views transcript
0 4553 Sir Ken Robinson makes an entertaining and pro... 1164 TED2006 1140825600 60 Ken Robinson Ken Robinson: Do schools kill creativity? 1 1151367060 [{'id': 7, 'name': 'Funny', 'count': 19645}, {... [{'id': 865, 'hero': 'https://pe.tedcdn.com/im... Author/educator ['children', 'creativity', 'culture', 'dance',... Do schools kill creativity? https://www.ted.com/talks/ken_robinson_says_sc... 47227110 Good morning. How are you?(Laughter)It's been ...
1 265 With the same humor and humanity he exuded in ... 977 TED2006 1140825600 43 Al Gore Al Gore: Averting the climate crisis 1 1151367060 [{'id': 7, 'name': 'Funny', 'count': 544}, {'i... [{'id': 243, 'hero': 'https://pe.tedcdn.com/im... Climate advocate ['alternative energy', 'cars', 'climate change... Averting the climate crisis https://www.ted.com/talks/al_gore_on_averting_... 3200520 Thank you so much, Chris. And it's truly a gre...
2 124 New York Times columnist David Pogue takes aim... 1286 TED2006 1140739200 26 David Pogue David Pogue: Simplicity sells 1 1151367060 [{'id': 7, 'name': 'Funny', 'count': 964}, {'i... [{'id': 1725, 'hero': 'https://pe.tedcdn.com/i... Technology columnist ['computers', 'entertainment', 'interface desi... Simplicity sells https://www.ted.com/talks/david_pogue_says_sim... 1636292 (Music: "The Sound of Silence," Simon & Garfun...

Exploring the Transcripts

In [4]:
# Return first n items of the iterable as a list
def take(n, iterable):
    return list(islice(iterable, n))

# Apply word counter, then sort by frequency, then extract top words
WC = Counter(" ".join(ted_transcripts['transcript']).split())
WC_sorted = {k: v for k, v in sorted(WC.items(), key=lambda item: item[1], reverse=True)}
top_words = take(15, WC_sorted.items())
top_words
Out[4]:
[('the', 225138),
 ('to', 144790),
 ('of', 131538),
 ('and', 123312),
 ('a', 119543),
 ('that', 88820),
 ('in', 83116),
 ('I', 75769),
 ('is', 67588),
 ('you', 57039),
 ('we', 53599),
 ('And', 42367),
 ('this', 40881),
 ('it', 40334),
 ('was', 34616)]

As far as common words go, everything seems to be fine

In [5]:
# Reading a couple of the transcripts
ted_transcripts['transcript'][2020][0:1000]
Out[5]:
"Nicole Paris: TEDYouth, make some noise!(Beatboxing) TEDYouth, make some —(Beatboxing)(Beatboxing ends)Are you ready?(Cheers and applause)Are you ready?Ed Cage: Yeah, yeah, yeah!(Beatboxing)(Laughter)EC: Y'all like that? Let me show you how we used to do it —NP: Get it pops, go ahead.EC: ... when I was growing up in the '90s.(Beatboxing)(Beatboxing ends)(Laughter)(Beatboxing)NP: Pops, pops, pops, pops, pops, pops, hold up, hold up, hold up, hold up! Oh my God. OK, he's trying to battle me. Hold on, right now, hold on. Do you remember when you used to beatbox me to sleep?EC: Yeah, yeah, I remember. That's when she was a little baby. We would do something like this.(Beatboxing)NP: I remember that.(Beatboxing)NP: All right, pops, pops, pops, chill out, chill out. Hold up, hold up, hold up.EC: Y'all remember the video. This is like a little payback or something for 50 million people calling me the loser.NP: Hold up, hold up. But a lot of people out there don't really know what beatboxing i"

There seem to be some non-speech sounds or other occurances which are added to the transcripts. In the example above we can see (Beatboxing) and (Laughter). Luckily those non-speech sounds included in the transcripts seem to be demarked as parentheticals (parathesis () or brackets []), which makes it easier to remove them using regular expressions.

In [6]:
# Checking how many different elements in the transcript have () or []
non_speech_elements = [key for key, value in WC_sorted.items() if '(' in key or '[' in key]
print("Number of non-speech elements:", len(non_speech_elements))
print("Most common non-speech elements:", non_speech_elements[:20])
Number of non-speech elements: 9399
Most common non-speech elements: ['(Laughter)', '(Applause)', 'you.(Applause)', 'much.(Applause)', '(Music)', '(Laughter)So', '(Laughter)And', 'you.(Applause)Thank', '(Laughter)But', '(Laughter)I', '(Laughter)So,', '(Applause)So', '[unclear]', '(Laughter)Now,', '(Applause)And', '(Audience:', '(Laughs)', 'you.(Applause)Chris', '(Applause)Thank', '(Laughter)This']
In [7]:
# Regular Expression approach to removing elements inside parenthesis
removed_parenthesis = re.sub(r'\([^)]*\)', ' ', 
                             ted_transcripts['transcript'][2020])

# Check result
removed_parenthesis[0:1000]
Out[7]:
"Nicole Paris: TEDYouth, make some noise!  TEDYouth, make some —  Are you ready? Are you ready?Ed Cage: Yeah, yeah, yeah!  EC: Y'all like that? Let me show you how we used to do it —NP: Get it pops, go ahead.EC: ... when I was growing up in the '90s.    NP: Pops, pops, pops, pops, pops, pops, hold up, hold up, hold up, hold up! Oh my God. OK, he's trying to battle me. Hold on, right now, hold on. Do you remember when you used to beatbox me to sleep?EC: Yeah, yeah, I remember. That's when she was a little baby. We would do something like this. NP: I remember that. NP: All right, pops, pops, pops, chill out, chill out. Hold up, hold up, hold up.EC: Y'all remember the video. This is like a little payback or something for 50 million people calling me the loser.NP: Hold up, hold up. But a lot of people out there don't really know what beatboxing is, where it started from.EC: Right, right.NP: Where it came from. So why don't you give them a little history — just a tickle — a bit of history of"

Exploring Text Cleaning Approaches

Tokenization

Exploring a couple methods for tokenizing words:

1) Word tokenization with TextBlob().words

In [8]:
tokens = TextBlob(removed_parenthesis).words
tokens[:150]
Out[8]:
WordList(['Nicole', 'Paris', 'TEDYouth', 'make', 'some', 'noise', 'TEDYouth', 'make', 'some', '—', 'Are', 'you', 'ready', 'Are', 'you', 'ready', 'Ed', 'Cage', 'Yeah', 'yeah', 'yeah', 'EC', "Y'all", 'like', 'that', 'Let', 'me', 'show', 'you', 'how', 'we', 'used', 'to', 'do', 'it', '—NP', 'Get', 'it', 'pops', 'go', 'ahead.EC', 'when', 'I', 'was', 'growing', 'up', 'in', 'the', "'90s", 'NP', 'Pops', 'pops', 'pops', 'pops', 'pops', 'pops', 'hold', 'up', 'hold', 'up', 'hold', 'up', 'hold', 'up', 'Oh', 'my', 'God', 'OK', 'he', "'s", 'trying', 'to', 'battle', 'me', 'Hold', 'on', 'right', 'now', 'hold', 'on', 'Do', 'you', 'remember', 'when', 'you', 'used', 'to', 'beatbox', 'me', 'to', 'sleep', 'EC', 'Yeah', 'yeah', 'I', 'remember', 'That', "'s", 'when', 'she', 'was', 'a', 'little', 'baby', 'We', 'would', 'do', 'something', 'like', 'this', 'NP', 'I', 'remember', 'that', 'NP', 'All', 'right', 'pops', 'pops', 'pops', 'chill', 'out', 'chill', 'out', 'Hold', 'up', 'hold', 'up', 'hold', 'up.EC', "Y'all", 'remember', 'the', 'video', 'This', 'is', 'like', 'a', 'little', 'payback', 'or', 'something', 'for', '50', 'million', 'people', 'calling', 'me', 'the', 'loser.NP'])

OK, he's trying to battle me. 'OK', 'he', "'s", 'trying', 'to', 'battle', 'me'

2) Word tokenization with NLTK's word_tokenize()

This approach keeps punctuation

In [9]:
tokens = word_tokenize(removed_parenthesis)
print(tokens[:150])
['Nicole', 'Paris', ':', 'TEDYouth', ',', 'make', 'some', 'noise', '!', 'TEDYouth', ',', 'make', 'some', '—', 'Are', 'you', 'ready', '?', 'Are', 'you', 'ready', '?', 'Ed', 'Cage', ':', 'Yeah', ',', 'yeah', ',', 'yeah', '!', 'EC', ':', "Y'all", 'like', 'that', '?', 'Let', 'me', 'show', 'you', 'how', 'we', 'used', 'to', 'do', 'it', '—NP', ':', 'Get', 'it', 'pops', ',', 'go', 'ahead.EC', ':', '...', 'when', 'I', 'was', 'growing', 'up', 'in', 'the', "'90s", '.', 'NP', ':', 'Pops', ',', 'pops', ',', 'pops', ',', 'pops', ',', 'pops', ',', 'pops', ',', 'hold', 'up', ',', 'hold', 'up', ',', 'hold', 'up', ',', 'hold', 'up', '!', 'Oh', 'my', 'God', '.', 'OK', ',', 'he', "'s", 'trying', 'to', 'battle', 'me', '.', 'Hold', 'on', ',', 'right', 'now', ',', 'hold', 'on', '.', 'Do', 'you', 'remember', 'when', 'you', 'used', 'to', 'beatbox', 'me', 'to', 'sleep', '?', 'EC', ':', 'Yeah', ',', 'yeah', ',', 'I', 'remember', '.', 'That', "'s", 'when', 'she', 'was', 'a', 'little', 'baby', '.', 'We', 'would', 'do', 'something', 'like', 'this']

OK, he's trying to battle me. 'OK', ',', 'he', "'s", 'trying', 'to', 'battle', 'me', '.'

3) Word tokenization with NLTK's wordpunct_tokenize()

This approach keeps punctuation AND splits contractions into three parts (eg: he's = "he", "'", "s")

In [10]:
tokens = wordpunct_tokenize(removed_parenthesis)
print(tokens[:150])
['Nicole', 'Paris', ':', 'TEDYouth', ',', 'make', 'some', 'noise', '!', 'TEDYouth', ',', 'make', 'some', '—', 'Are', 'you', 'ready', '?', 'Are', 'you', 'ready', '?', 'Ed', 'Cage', ':', 'Yeah', ',', 'yeah', ',', 'yeah', '!', 'EC', ':', 'Y', "'", 'all', 'like', 'that', '?', 'Let', 'me', 'show', 'you', 'how', 'we', 'used', 'to', 'do', 'it', '—', 'NP', ':', 'Get', 'it', 'pops', ',', 'go', 'ahead', '.', 'EC', ':', '...', 'when', 'I', 'was', 'growing', 'up', 'in', 'the', "'", '90s', '.', 'NP', ':', 'Pops', ',', 'pops', ',', 'pops', ',', 'pops', ',', 'pops', ',', 'pops', ',', 'hold', 'up', ',', 'hold', 'up', ',', 'hold', 'up', ',', 'hold', 'up', '!', 'Oh', 'my', 'God', '.', 'OK', ',', 'he', "'", 's', 'trying', 'to', 'battle', 'me', '.', 'Hold', 'on', ',', 'right', 'now', ',', 'hold', 'on', '.', 'Do', 'you', 'remember', 'when', 'you', 'used', 'to', 'beatbox', 'me', 'to', 'sleep', '?', 'EC', ':', 'Yeah', ',', 'yeah', ',', 'I', 'remember', '.', 'That', "'", 's', 'when', 'she', 'was', 'a', 'little']

OK, he's trying to battle me. 'OK', ',', 'he', "'", 's', 'trying', 'to', 'battle', 'me', '.'

4) Sentence tokenization with TextBlob().sentences

In [11]:
tokens = TextBlob(removed_parenthesis).sentences
tokens[:10]
Out[11]:
[Sentence("Nicole Paris: TEDYouth, make some noise!"),
 Sentence("TEDYouth, make some —  Are you ready?"),
 Sentence("Are you ready?Ed Cage: Yeah, yeah, yeah!"),
 Sentence("EC: Y'all like that?"),
 Sentence("Let me show you how we used to do it —NP: Get it pops, go ahead.EC: ... when I was growing up in the '90s."),
 Sentence("NP: Pops, pops, pops, pops, pops, pops, hold up, hold up, hold up, hold up!"),
 Sentence("Oh my God."),
 Sentence("OK, he's trying to battle me."),
 Sentence("Hold on, right now, hold on."),
 Sentence("Do you remember when you used to beatbox me to sleep?EC: Yeah, yeah, I remember.")]

OK, he's trying to battle me. OK, he's trying to battle me.

5) Sentence tokenization with NLTK's sent_tokenize()

In [12]:
tokens = sent_tokenize(removed_parenthesis)
print(tokens[:10])
['Nicole Paris: TEDYouth, make some noise!', 'TEDYouth, make some —  Are you ready?', 'Are you ready?Ed Cage: Yeah, yeah, yeah!', "EC: Y'all like that?", "Let me show you how we used to do it —NP: Get it pops, go ahead.EC: ... when I was growing up in the '90s.", 'NP: Pops, pops, pops, pops, pops, pops, hold up, hold up, hold up, hold up!', 'Oh my God.', "OK, he's trying to battle me.", 'Hold on, right now, hold on.', 'Do you remember when you used to beatbox me to sleep?EC: Yeah, yeah, I remember.']

OK, he's trying to battle me. OK, he's trying to battle me.

6) Noun phrases with TextBlob().noun_phrases

In [13]:
tokens = TextBlob(removed_parenthesis).noun_phrases
tokens[:150]
Out[13]:
WordList(['nicole paris', 'tedyouth', 'tedyouth', 'ed cage', 'yeah', 'ec', "y'all", 'np', 'pops', 'oh', 'god', 'hold', 'ec', 'yeah', 'np', 'np', 'hold', "y'all", 'hold', 'right', 'history —', 'tickle —', 'beatbox', 'york', 'york', 'york', 'yeah', 'well', 'louis', 'np', "y'all hands", 'ec', 'york', 'dj', 'simple —', 'simple beats', 'well', 'jam sessions', 'jam sessions consist', "'ll look", "'ll text", 'kitchen cooking', 'road trips', 'standing', 'aw', 'dad', 'naw', 'jam session', 'yeah.np', 'tiny bit', 'jam session', 'np', "y'all", 'jam session', 'ec', "y'all", 'jam session', 'np', 'sorry', "ca n't", 'yeah', 'kick', 'np', 'ec', "y'all", 'np', 'thank', 'eg', 'thank', 'np', 'thank'])

Well, this one definitely doesn't cut it!

Lemmatization

In [14]:
# Using WordNetLemmatizer()
lemmatizer = WordNetLemmatizer()
lemmatized_text = [lemmatizer.lemmatize(w) for w in TextBlob(removed_parenthesis).words]
print(lemmatized_text[:100])
['Nicole', 'Paris', 'TEDYouth', 'make', 'some', 'noise', 'TEDYouth', 'make', 'some', '—', 'Are', 'you', 'ready', 'Are', 'you', 'ready', 'Ed', 'Cage', 'Yeah', 'yeah', 'yeah', 'EC', "Y'all", 'like', 'that', 'Let', 'me', 'show', 'you', 'how', 'we', 'used', 'to', 'do', 'it', '—NP', 'Get', 'it', 'pop', 'go', 'ahead.EC', 'when', 'I', 'wa', 'growing', 'up', 'in', 'the', "'90s", 'NP', 'Pops', 'pop', 'pop', 'pop', 'pop', 'pop', 'hold', 'up', 'hold', 'up', 'hold', 'up', 'hold', 'up', 'Oh', 'my', 'God', 'OK', 'he', "'s", 'trying', 'to', 'battle', 'me', 'Hold', 'on', 'right', 'now', 'hold', 'on', 'Do', 'you', 'remember', 'when', 'you', 'used', 'to', 'beatbox', 'me', 'to', 'sleep', 'EC', 'Yeah', 'yeah', 'I', 'remember', 'That', "'s", 'when', 'she']

Stemming

In [15]:
stemmer = nltk.stem.porter.PorterStemmer()
stemmed_text = [stemmer.stem(w) for w in TextBlob(removed_parenthesis).words]
print(stemmed_text[:100])
['nicol', 'pari', 'tedyouth', 'make', 'some', 'nois', 'tedyouth', 'make', 'some', '—', 'are', 'you', 'readi', 'are', 'you', 'readi', 'Ed', 'cage', 'yeah', 'yeah', 'yeah', 'EC', "y'all", 'like', 'that', 'let', 'me', 'show', 'you', 'how', 'we', 'use', 'to', 'do', 'it', '—np', 'get', 'it', 'pop', 'go', 'ahead.ec', 'when', 'I', 'wa', 'grow', 'up', 'in', 'the', "'90", 'NP', 'pop', 'pop', 'pop', 'pop', 'pop', 'pop', 'hold', 'up', 'hold', 'up', 'hold', 'up', 'hold', 'up', 'Oh', 'my', 'god', 'OK', 'he', "'s", 'tri', 'to', 'battl', 'me', 'hold', 'on', 'right', 'now', 'hold', 'on', 'Do', 'you', 'rememb', 'when', 'you', 'use', 'to', 'beatbox', 'me', 'to', 'sleep', 'EC', 'yeah', 'yeah', 'I', 'rememb', 'that', "'s", 'when', 'she']

Applying Best Text Cleaning Approches

In [59]:
def corpus_cleaner(corpus, stem='lemmatizer'):

    '''
    Take a corpus of documents and apply the best cleaning steps from above.
    
    1. Remove parantheticals
    2. Tokenize into words using TextBlob().words
    3. Set to lower case and remove stopwords
    4. Lemmatize
    5. Again lowercase and remove stopwords
    
    Output = A list (corpus) of lists (cleaned documents)
    '''
    
    # Define stemmer 
    if stem == 'lemmatizer':
        lemmatizer = nltk.stem.WordNetLemmatizer()
    elif stem == 'stemmer':
        stemmer = nltk.stem.porter.PorterStemmer()
    else:
        raise Error("Invalid stemmer, choose either 'lemmatizer' or 'stemmer'.") 
    
    # Set stopwords
    stop = stopwords.words('english')
    stop += ['.', ',',':','...','!"','?"', "'", '"',' - ',' — ',',"','."','!', ';','♫♫','♫',\
             '.\'"','[',']','—',".\'", 'ok','okay','yeah','ya','stuff', ' 000 ',' em ',\
             ' oh ','thank','thanks','la','was','wa','?','like','go',' le ',' ca ',' I '," ? ","s", " t ","ve","re", \
             'oh', 'sort', 'maybe', 'guy', 'applause']
   
    output_corpus = []
    
    for document in corpus:
        cleaned_doc = []
        
        # Remove parentheticals
        clean_parens = re.sub(r'\([^)]*\)', ' ', document)
        
        # Tokenize
        for word  in wordpunct_tokenize(clean_parens):

            # Remove stopwords
            if word.lower() not in stop:
                
                # Lemmatize or Stem
                if stem == 'lemmatizer':
                    cleaned_word = lemmatizer.lemmatize(word.lower())
                elif stem == 'stemmer':
                    cleaned_word = stemmer.stem(word.lower())
                else:
                    raise Error("Invalid stemmer, choose either 'lemmatizer' or 'stemmer'.") 
            
                # Add to document
                cleaned_doc.append(cleaned_word.lower())
            
        # After cleaning all words fora document, add it to the corpus
        output_corpus.append(' '.join(cleaned_doc))
        
    return output_corpus
In [60]:
%%time
# Execute function to clean all corpus
cleaned_corpus = corpus_cleaner(ted_transcripts['transcript'])
Wall time: 33.8 s
In [61]:
# Check result
print(cleaned_corpus[2020][0:100])
nicole paris tedyouth make noise tedyouth make ready ready ed cage ec let show used np get pop ahead

N-gram Models

Checking whether unigrams, bigrams or trigrams make for more useful representations

In [19]:
# Function to extract most common n-grams
def top_n_gram(corpus, n=2, top_ngrams=10):
    counter = Counter()
    for doc in tqdm(corpus):
        words = TextBlob(doc).words
        n_grams = ngrams(words, n)
        counter += Counter(n_grams)
    for n_gram, count in counter.most_common(top_ngrams):
        print('%30s - %i' % (' '.join(n_gram), count))
In [20]:
# 15 Most common Unigrams
top_n_gram(cleaned_corpus, n=1, top_ngrams=15)
100%|██████████████████████████████████████| 2467/2467 [00:36<00:00, 66.76it/s]
                           one - 21266
                        people - 19800
                         thing - 14872
                          know - 13592
                          year - 13109
                         going - 12878
                          time - 12621
                         think - 12320
                             u - 12071
                           get - 11941
                           see - 11806
                         would - 11611
                        really - 11046
                           way - 10647
                         world - 10532

In [21]:
# 15 Most common Bigrams
top_n_gram(cleaned_corpus, n=2, top_ngrams=15)
100%|██████████████████████████████████████| 2467/2467 [02:52<00:00, 14.28it/s]
                      year ago - 2082
                    little bit - 1607
                      year old - 1365
                  united state - 1103
                     one thing - 1041
                  around world - 938
                      new york - 894
                       can not - 877
                    first time - 751
                     every day - 692
                   many people - 656
                     last year - 604
                  every single - 573
                       one day - 559
                       10 year - 542
In [22]:
# 15 Most common Trigrams
top_n_gram(cleaned_corpus, n=3, top_ngrams=15)
100%|██████████████████████████████████████| 2467/2467 [03:12<00:00, 12.78it/s]
                 new york city - 236
                  000 year ago - 135
                 new york time - 127
                   10 year ago - 118
              every single day - 109
              million year ago - 109
           people around world - 100
                  two year ago - 100
                  world war ii - 99
                 one two three - 97
               couple year ago - 96
                   20 year ago - 83
                 five year old - 78
               talk little bit - 71
                spend lot time - 71
In [23]:
# The trigrams don't seem to have much useful information that can't be captured by bigrams
# Hence I'll limit my N-gram range to unigrams and bigrams
NGRAM_MIN = 1
NGRAM_MAX = 2

Vectorizing Data

Vectorizing = Turning words into numerical representations

In [69]:
# Instantiate TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range = (NGRAM_MIN, NGRAM_MAX),
                             stop_words = 'english',
                             max_df = 0.5,
                             max_features = len(cleaned_corpus))

# Obtain vectorized data
vect_data = vectorizer.fit_transform(cleaned_corpus)

# Check result
plt.figure(figsize=(8, 8))
plt.spy(vect_data, markersize=0.01)
plt.show()

TF-IDF vectorization doesn't seem to go well along other topic modeling techniques down the pipeline, so I'll use CounteVectorizer instead.

From taking a look at the LDA paper, this seems to happen because LDA has some sort of built in TF-IDF...

Source: http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf

In [70]:
# Instantiate CountVectorizer
vectorizer = CountVectorizer(ngram_range = (NGRAM_MIN, NGRAM_MAX),
                             stop_words = 'english',
                             max_df = 0.5,
                             max_features = len(cleaned_corpus))

# Obtain vectorized data
vect_data = vectorizer.fit_transform(cleaned_corpus)

# Check result
plt.figure(figsize=(8, 8))
plt.spy(vect_data, markersize=0.01)
plt.show()

The word vectors seems to be well distributed in their correlations with the documents.

Topic Modeling

In [71]:
NUM_TOPICS = 15
#NUM_TOPICS = np.random.randint(10, 25)
print(f'Modeling with {NUM_TOPICS} topics!')
Modeling with 15 topics!

1) Latent Dirichlet Allocation (LDA)

In [72]:
# Instantiate object
LDA_obj = LatentDirichletAllocation(n_components = NUM_TOPICS,
                                    max_iter = NUM_TOPICS,
                                    batch_size = 32,
                                    learning_method = 'online',
                                    n_jobs = -1)

# Obtain clustered data
LDA_data = LDA_obj.fit_transform(vect_data)

# Create dictonary with most common words in each topic
LDA_topics_dict = {}
for idx, topic in tqdm(enumerate(LDA_obj.components_), total=len(LDA_obj.components_)):
    LDA_topics_dict[idx] = [vectorizer.get_feature_names()[i] for i in topic.argsort()][:-15:-1]

# Print the topics and their common words
for k, v in LDA_topics_dict.items():
    print(f'Topic {k}:')
    print(" ".join(v))
    print('')
100%|██████████████████████████████████████████| 15/15 [04:02<00:00, 16.15s/it]
Topic 0:
country percent government money state company business africa global dollar million market economy social

Topic 1:
school kid child story family old home student year old teacher parent young told community

Topic 2:
feel experience word believe story love self live mind person social culture reason sense

Topic 3:
robot art rule sound image object body light space eye pattern create artist color

Topic 4:
guy love word book wanted minute try hand feel yes person thinking everybody pretty

Topic 5:
car energy percent ca dollar 000 oil water million cost power billion climate technology

Topic 6:
data technology computer information internet phone medium online digital using video device example company

Topic 7:
brain machine science example technology model understand information experiment computer language scientist learning behavior

Topic 8:
food plant eat fish choice farmer percent feed farm crop eating diet grow seed

Topic 9:
planet earth water animal ocean specie universe sea 000 star light ice mar fish

Topic 10:
cell patient cancer disease drug health body doctor medical blood treatment gene medicine heart

Topic 11:
city building design space project material water built map build create street house architecture

Topic 12:
war violence state police country american refugee group prison military law conflict weapon muslim

Topic 13:
woman men child girl baby family sex mother black female care boy male percent

Topic 14:
game play music video dna playing player real genome played video game piece toy tool


In [73]:
# For each document, classify it's theme as the highest scoring topic
document_topics = np.argmax(LDA_data, axis=1)

# Then convert to dataframe
pred_labels = pd.DataFrame(document_topics)
pred_labels.head()
Out[73]:
0
0 1
1 5
2 4
3 11
4 0
In [74]:
# Covert the numbers to more understandable topic names
topic_names = pred_labels.copy()
for topic_code in range(pred_labels.nunique()[0]):
    topic_names[0][topic_names[0] == topic_code] = ' '.join(LDA_topics_dict[topic_code][:5])
    
topic_names.head()
Out[74]:
0
0 school kid child story family
1 car energy percent ca dollar
2 guy love word book wanted
3 city building design space project
4 country percent government money state
In [75]:
# Visualize topic frequency
fig, ax = plt.subplots(figsize=(15,12))
plt.tick_params(labelsize=15)
sns.countplot(y=topic_names[0].values)
plt.show()
In [76]:
# Iterative visualization for topic modeling with pyLDAvis
import pyLDAvis, pyLDAvis.sklearn
from IPython.display import display 
    
# Setup to run in Jupyter notebook
pyLDAvis.enable_notebook()

# Create the visualization
vis = pyLDAvis.sklearn.prepare(LDA_obj, vect_data, vectorizer)

# Export as a standalone HTML web page
pyLDAvis.save_html(vis, 'lda.html')

# Let's view it!
display(vis)
In [32]:
# Option to skip testing other methods since we'll be using LDA
SKIP = False

2) Non-Negative Matrix Factorization (NMF)

In [33]:
if not SKIP:

    # Instantiate NMF object
    NMF_obj = NMF(n_components = NUM_TOPICS)

    # Obtain clustered data
    NMF_data = NMF_obj.fit_transform(vect_data)

    # Create dictonary with most common words in each topic
    NMF_topics_dict = {}
    for idx, topic in tqdm(enumerate(NMF_obj.components_), total=len(NMF_obj.components_)):
        NMF_topics_dict[idx] = [vectorizer.get_feature_names()[i] for i in topic.argsort()][:-15:-1]

    # Print the topics and their common words
    for k, v in NMF_topics_dict.items():
        print(f'Topic {k}:')
        print(" ".join(v))
        print('')
100%|██████████████████████████████████████████| 15/15 [02:41<00:00, 10.74s/it]
Topic 0:
love feel word person experience guy friend hand man god mind moment old wanted

Topic 1:
woman men girl man sex gender boy young female mother black male pm violence

Topic 2:
brain neuron cell body memory animal area sleep light mind ability child region control

Topic 3:
country state government percent africa china global united economic india money war political economy

Topic 4:
water ocean food animal planet fish specie earth sea plant 000 percent area tree

Topic 5:
cell cancer patient disease drug body blood tumor health stem cell stem doctor organ medicine

Topic 6:
child school kid family food teacher education student parent old mother girl percent community

Topic 7:
city building space design street community public neighborhood flag project built new york york architecture

Topic 8:
computer technology design machine sort example project information using building internet create learning book

Topic 9:
universe space planet galaxy star black light earth hole black hole telescope particle energy sun

Topic 10:
data information number patient health using web decision drug company percent algorithm map study

Topic 11:
car energy ca dollar percent power oil em cost company money nuclear mile fuel

Topic 12:
game play video video game playing real player hour sound social online music feel win

Topic 13:
robot body build building animal leg play sort foot rule ant lab video task

Topic 14:
story book film tell story told read telling live picture mother character africa movie wanted


3) Latent Semantic Analysis (LSA)

In [34]:
if not SKIP:

    # Instantiate LSA object
    LSA_obj = TruncatedSVD(n_components = NUM_TOPICS)

    # Obtain clustered data
    LSA_data = LSA_obj.fit_transform(vect_data)

    # Create dictonary with most common words in each topic
    LSA_topics_dict = {}
    for idx, topic in tqdm(enumerate(LSA_obj.components_), total=len(LSA_obj.components_)):
        LSA_topics_dict[idx] = [vectorizer.get_feature_names()[i] for i in topic.argsort()][:-15:-1]

    # Print the topics and their common words
    for k, v in LSA_topics_dict.items():
        print(f'Topic {k}:')
        print(" ".join(v))
        print('')
100%|██████████████████████████████████████████| 15/15 [02:41<00:00, 10.78s/it]
Topic 0:
woman country story percent child brain technology school million 000 example kid course city

Topic 1:
woman country men child girl school story family kid man mother young community boy

Topic 2:
brain woman cell men cancer body patient child girl love neuron disease story man

Topic 3:
country cell cancer percent brain disease patient health drug africa state government data dollar

Topic 4:
woman water planet earth men cancer cell space universe energy ocean light star black

Topic 5:
brain country woman neuron state china energy planet power men global universe political government

Topic 6:
child water school kid food brain planet city family earth ocean animal fish area

Topic 7:
city building brain car design woman street cell community public architecture flag space project

Topic 8:
story cancer cell country feel love war god political state body book believe american

Topic 9:
universe black space child galaxy city hole black hole data star light country image building

Topic 10:
data story water city information ocean health patient child car fish care doctor shark

Topic 11:
car ca energy em dollar kid love money oil universe cost percent solar hour

Topic 12:
game cancer play city health patient care black doctor video feel video game experience playing

Topic 13:
robot child car ca family care health em patient baby body power energy mother

Topic 14:
game child story cell play data car video country family video game africa light india


4) Latent Semantic Analysis (LSA) + Normalization

In [35]:
if not SKIP:

    # Instantiate LSA_norm object
    LSA_norm_obj = TruncatedSVD(n_components = NUM_TOPICS)

    # Normalize the vectorized data
    stdScale = Normalizer()
    vect_data_norm = stdScale.fit_transform(vect_data)

    # Obtain clustered data
    LSA_norm_data = LSA_norm_obj.fit_transform(vect_data_norm)

    # Create dictonary with most common words in each topic
    LSA_norm_topics_dict = {}
    for idx, topic in tqdm(enumerate(LSA_norm_obj.components_), total=len(LSA_norm_obj.components_)):
        LSA_norm_topics_dict[idx] = [vectorizer.get_feature_names()[i] for i in topic.argsort()][:-15:-1]

    # Print the topics and their common words
    for k, v in LSA_norm_topics_dict.items():
        print(f'Topic {k}:')
        print(" ".join(v))
        print('')
100%|██████████████████████████████████████████| 15/15 [02:59<00:00, 11.98s/it]
Topic 0:
story woman child country percent technology school kid love feel old 000 course city

Topic 1:
woman child men story girl school family country kid mother man young love told

Topic 2:
country city percent government dollar state global million company money africa business 000 economy

Topic 3:
brain cell woman cancer disease patient data percent health drug body country information gene

Topic 4:
woman water planet earth ocean animal men light specie energy body sea space cell

Topic 5:
child school kid water food family teacher education animal disease parent cell percent student

Topic 6:
city woman building design cell project brain school community cancer space patient car men

Topic 7:
city brain story country feel love war state street patient cell mind community disease

Topic 8:
brain child woman country city school planet neuron language space universe earth education computer

Topic 9:
story child country image information technology book cell art africa space building data cancer

Topic 10:
data city information car planet number story earth universe child star image map light

Topic 11:
story water brain food animal information robot company fish ocean book business film guy

Topic 12:
child music technology robot car sound play water data family video game machine patient

Topic 13:
car cell story energy dollar brain technology money billion universe light machine million 000

Topic 14:
music sound play city country cell data video piece africa 000 school story language


None of the topic modeling techniques seem to be a clear winner here...

Since LDA is considered to the state of the art technique I'll choose to employ it in the recommender system.

Recommender

In [77]:
def get_recommendation(TARGET_ID, NUM_RECOMMENDATIONS = 5):
    
    '''
    Requires the following previous objects from this notebook:
    1. Trained vectorizer
    2. Trained LDA_obj
    3. Converted LDA_data
    4. topic_names (list with the modeled topic for each TED Talk)
    5. ted_transcripts dataframe contaning both CSVs already merged
    '''
    
    # Vectorize the document correspondent to the TARGET_ID
    target_vector = vectorizer.transform([cleaned_corpus[TARGET_ID]])
    
    # Model the vector with the trained LDA_Obj
    target_modeled = LDA_obj.transform(target_vector)
    
    # Fit a KNN algorithm on the whole dataset modeled with LDA
    NN = NearestNeighbors(n_neighbors=NUM_RECOMMENDATIONS+1, metric='cosine', algorithm='brute', n_jobs=-1)
    NN.fit(LDA_data)
    
    # Find the nearest neighbords for the LDA vector correspondent to the TARGET_ID
    results = NN.kneighbors(target_modeled)
    recommend_list = results[1][0]
    similarity_scores = results[0][0]

    # Loop to extract revelant information about the recommendations
    titles, modeled_topics, tags, descriptions = [], [] ,[], []
    for idx in recommend_list:
        titles.append(ted_transcripts.loc[idx,'title'])
        modeled_topics.append(topic_names.iloc[idx,0])
        tags.append(ted_transcripts.loc[idx,'tags'])
        descriptions.append(ted_transcripts.loc[idx,'description'])

    # Put recommendations in a dataframe for outputting
    output_df = pd.DataFrame({'ID': recommend_list,
                              'Similarity Score': similarity_scores,
                              'Title': titles,
                              'Modeled Topic': modeled_topics,
                              'Tags': tags,
                              'Description': descriptions})
    
    # Customize index to specify that the first row is the TED Talk from the TARGET_ID 
    custom_index = np.arange(1, NUM_RECOMMENDATIONS+1).tolist()
    custom_index.insert(0, 'Base')
    output_df.set_index([custom_index], inplace=True)
    
    return output_df
In [78]:
get_recommendation(2020, NUM_RECOMMENDATIONS=10)
Out[78]:
ID Similarity Score Title Modeled Topic Tags Description
Base 2020 2.220446e-16 A beatboxing lesson from a father-daughter duo guy love word book wanted ['TEDYouth', 'art', 'entertainment', 'family',... Nicole Paris was raised to be a beatboxer -- w...
1 2382 4.954671e-03 Songs that bring history to life guy love word book wanted ['history', 'live music', 'music'] Rhiannon Giddens pours the emotional weight of...
2 179 8.770358e-03 The music wars guy love word book wanted ['entertainment', 'humor', 'music', 'technology'] New York Times tech columnist David Pogue perf...
3 2315 9.144099e-03 "Rollercoaster" guy love word book wanted ['guitar', 'live music', 'music', 'performance... Singer, songwriter and actress Sara Ramirez is...
4 304 9.628546e-03 Playing invisible turntables guy love word book wanted ['entertainment', 'humor', 'illusion', 'live m... Human beatbox James "AudioPoet" Burchfield per...
5 934 1.248823e-02 Try something new for 30 days guy love word book wanted ['culture', 'success'] Is there something you've always meant to do, ...
6 1665 1.252386e-02 The Museum of Four in the Morning guy love word book wanted ['entertainment', 'humor', 'online video', 'sp... Beware: Rives has a contagious obsession with ...
7 165 1.274940e-02 A performance of "Mathemagic" guy love word book wanted ['education', 'entertainment', 'magic', 'math'... In a lively show, mathemagician Arthur Benjami...
8 2232 1.351661e-02 "St. James Infirmary Blues" guy love word book wanted ['art', 'live music', 'music', 'performance', ... Singer Rhiannon Giddens joins international mu...
9 191 1.370682e-02 Juggle and jest guy love word book wanted ['collaboration', 'entertainment', 'humor', 'p... Illustrious jugglers the Raspyni Brothers show...
10 1844 1.381693e-02 A magical search for a coincidence guy love word book wanted ['entertainment', 'illusion', 'magic'] Small coincidences. They happen all the time a...

End