TED Talks Recommender¶

Matheus Schmitz

https://matheus-schmitz.github.io/

https://www.linkedin.com/in/matheusschmitz/

# General Packages
import numpy as np
import pandas as pd
import os
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Iterators
from collections import Counter
from itertools import islice
from operator import itemgetter
from tqdm import tqdm

# Text
import re
from textblob import TextBlob
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize, MWETokenizer
from nltk.stem import porter, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.util import ngrams

# Scikit-Learn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD, NMF
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.neighbors import NearestNeighbors

# Download nltk packages
nltk.download('punkt')
nltk.download('brown')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Matheus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Matheus\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Matheus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

# Path to CSVs
path = 'C:\Portfolio\TED_Talks_Recommender/'

# Load TED Main
ted_main = pd.read_csv(path + 'ted_main.csv')

# Load TED Transcripts
ted_transcripts = pd.read_csv(path + 'transcripts.csv')

# Meger them
ted_transcripts = pd.merge(ted_main, ted_transcripts, on='url')
ted_transcripts.head(3)

Exploring the Transcripts¶

# Return first n items of the iterable as a list
def take(n, iterable):
    return list(islice(iterable, n))

# Apply word counter, then sort by frequency, then extract top words
WC = Counter(" ".join(ted_transcripts['transcript']).split())
WC_sorted = {k: v for k, v in sorted(WC.items(), key=lambda item: item[1], reverse=True)}
top_words = take(15, WC_sorted.items())
top_words

[('the', 225138),
 ('to', 144790),
 ('of', 131538),
 ('and', 123312),
 ('a', 119543),
 ('that', 88820),
 ('in', 83116),
 ('I', 75769),
 ('is', 67588),
 ('you', 57039),
 ('we', 53599),
 ('And', 42367),
 ('this', 40881),
 ('it', 40334),
 ('was', 34616)]

As far as common words go, everything seems to be fine

# Reading a couple of the transcripts
ted_transcripts['transcript'][2020][0:1000]

"Nicole Paris: TEDYouth, make some noise!(Beatboxing) TEDYouth, make some —(Beatboxing)(Beatboxing ends)Are you ready?(Cheers and applause)Are you ready?Ed Cage: Yeah, yeah, yeah!(Beatboxing)(Laughter)EC: Y'all like that? Let me show you how we used to do it —NP: Get it pops, go ahead.EC: ... when I was growing up in the '90s.(Beatboxing)(Beatboxing ends)(Laughter)(Beatboxing)NP: Pops, pops, pops, pops, pops, pops, hold up, hold up, hold up, hold up! Oh my God. OK, he's trying to battle me. Hold on, right now, hold on. Do you remember when you used to beatbox me to sleep?EC: Yeah, yeah, I remember. That's when she was a little baby. We would do something like this.(Beatboxing)NP: I remember that.(Beatboxing)NP: All right, pops, pops, pops, chill out, chill out. Hold up, hold up, hold up.EC: Y'all remember the video. This is like a little payback or something for 50 million people calling me the loser.NP: Hold up, hold up. But a lot of people out there don't really know what beatboxing i"

There seem to be some non-speech sounds or other occurances which are added to the transcripts. In the example above we can see (Beatboxing) and (Laughter). Luckily those non-speech sounds included in the transcripts seem to be demarked as parentheticals (parathesis () or brackets []), which makes it easier to remove them using regular expressions.

# Checking how many different elements in the transcript have () or []
non_speech_elements = [key for key, value in WC_sorted.items() if '(' in key or '[' in key]
print("Number of non-speech elements:", len(non_speech_elements))
print("Most common non-speech elements:", non_speech_elements[:20])

Number of non-speech elements: 9399
Most common non-speech elements: ['(Laughter)', '(Applause)', 'you.(Applause)', 'much.(Applause)', '(Music)', '(Laughter)So', '(Laughter)And', 'you.(Applause)Thank', '(Laughter)But', '(Laughter)I', '(Laughter)So,', '(Applause)So', '[unclear]', '(Laughter)Now,', '(Applause)And', '(Audience:', '(Laughs)', 'you.(Applause)Chris', '(Applause)Thank', '(Laughter)This']

# Regular Expression approach to removing elements inside parenthesis
removed_parenthesis = re.sub(r'\([^)]*\)', ' ', 
                             ted_transcripts['transcript'][2020])

# Check result
removed_parenthesis[0:1000]

"Nicole Paris: TEDYouth, make some noise!  TEDYouth, make some —  Are you ready? Are you ready?Ed Cage: Yeah, yeah, yeah!  EC: Y'all like that? Let me show you how we used to do it —NP: Get it pops, go ahead.EC: ... when I was growing up in the '90s.    NP: Pops, pops, pops, pops, pops, pops, hold up, hold up, hold up, hold up! Oh my God. OK, he's trying to battle me. Hold on, right now, hold on. Do you remember when you used to beatbox me to sleep?EC: Yeah, yeah, I remember. That's when she was a little baby. We would do something like this. NP: I remember that. NP: All right, pops, pops, pops, chill out, chill out. Hold up, hold up, hold up.EC: Y'all remember the video. This is like a little payback or something for 50 million people calling me the loser.NP: Hold up, hold up. But a lot of people out there don't really know what beatboxing is, where it started from.EC: Right, right.NP: Where it came from. So why don't you give them a little history — just a tickle — a bit of history of"

Exploring Text Cleaning Approaches¶

Tokenization¶

Exploring a couple methods for tokenizing words:

1) Word tokenization with TextBlob().words

tokens = TextBlob(removed_parenthesis).words
tokens[:150]

WordList(['Nicole', 'Paris', 'TEDYouth', 'make', 'some', 'noise', 'TEDYouth', 'make', 'some', '—', 'Are', 'you', 'ready', 'Are', 'you', 'ready', 'Ed', 'Cage', 'Yeah', 'yeah', 'yeah', 'EC', "Y'all", 'like', 'that', 'Let', 'me', 'show', 'you', 'how', 'we', 'used', 'to', 'do', 'it', '—NP', 'Get', 'it', 'pops', 'go', 'ahead.EC', 'when', 'I', 'was', 'growing', 'up', 'in', 'the', "'90s", 'NP', 'Pops', 'pops', 'pops', 'pops', 'pops', 'pops', 'hold', 'up', 'hold', 'up', 'hold', 'up', 'hold', 'up', 'Oh', 'my', 'God', 'OK', 'he', "'s", 'trying', 'to', 'battle', 'me', 'Hold', 'on', 'right', 'now', 'hold', 'on', 'Do', 'you', 'remember', 'when', 'you', 'used', 'to', 'beatbox', 'me', 'to', 'sleep', 'EC', 'Yeah', 'yeah', 'I', 'remember', 'That', "'s", 'when', 'she', 'was', 'a', 'little', 'baby', 'We', 'would', 'do', 'something', 'like', 'this', 'NP', 'I', 'remember', 'that', 'NP', 'All', 'right', 'pops', 'pops', 'pops', 'chill', 'out', 'chill', 'out', 'Hold', 'up', 'hold', 'up', 'hold', 'up.EC', "Y'all", 'remember', 'the', 'video', 'This', 'is', 'like', 'a', 'little', 'payback', 'or', 'something', 'for', '50', 'million', 'people', 'calling', 'me', 'the', 'loser.NP'])

OK, he's trying to battle me. → 'OK', 'he', "'s", 'trying', 'to', 'battle', 'me'

2) Word tokenization with NLTK's word_tokenize()

This approach keeps punctuation

tokens = word_tokenize(removed_parenthesis)
print(tokens[:150])

['Nicole', 'Paris', ':', 'TEDYouth', ',', 'make', 'some', 'noise', '!', 'TEDYouth', ',', 'make', 'some', '—', 'Are', 'you', 'ready', '?', 'Are', 'you', 'ready', '?', 'Ed', 'Cage', ':', 'Yeah', ',', 'yeah', ',', 'yeah', '!', 'EC', ':', "Y'all", 'like', 'that', '?', 'Let', 'me', 'show', 'you', 'how', 'we', 'used', 'to', 'do', 'it', '—NP', ':', 'Get', 'it', 'pops', ',', 'go', 'ahead.EC', ':', '...', 'when', 'I', 'was', 'growing', 'up', 'in', 'the', "'90s", '.', 'NP', ':', 'Pops', ',', 'pops', ',', 'pops', ',', 'pops', ',', 'pops', ',', 'pops', ',', 'hold', 'up', ',', 'hold', 'up', ',', 'hold', 'up', ',', 'hold', 'up', '!', 'Oh', 'my', 'God', '.', 'OK', ',', 'he', "'s", 'trying', 'to', 'battle', 'me', '.', 'Hold', 'on', ',', 'right', 'now', ',', 'hold', 'on', '.', 'Do', 'you', 'remember', 'when', 'you', 'used', 'to', 'beatbox', 'me', 'to', 'sleep', '?', 'EC', ':', 'Yeah', ',', 'yeah', ',', 'I', 'remember', '.', 'That', "'s", 'when', 'she', 'was', 'a', 'little', 'baby', '.', 'We', 'would', 'do', 'something', 'like', 'this']

OK, he's trying to battle me. → 'OK', ',', 'he', "'s", 'trying', 'to', 'battle', 'me', '.'

3) Word tokenization with NLTK's wordpunct_tokenize()

This approach keeps punctuation AND splits contractions into three parts (eg: he's = "he", "'", "s")

tokens = wordpunct_tokenize(removed_parenthesis)
print(tokens[:150])

['Nicole', 'Paris', ':', 'TEDYouth', ',', 'make', 'some', 'noise', '!', 'TEDYouth', ',', 'make', 'some', '—', 'Are', 'you', 'ready', '?', 'Are', 'you', 'ready', '?', 'Ed', 'Cage', ':', 'Yeah', ',', 'yeah', ',', 'yeah', '!', 'EC', ':', 'Y', "'", 'all', 'like', 'that', '?', 'Let', 'me', 'show', 'you', 'how', 'we', 'used', 'to', 'do', 'it', '—', 'NP', ':', 'Get', 'it', 'pops', ',', 'go', 'ahead', '.', 'EC', ':', '...', 'when', 'I', 'was', 'growing', 'up', 'in', 'the', "'", '90s', '.', 'NP', ':', 'Pops', ',', 'pops', ',', 'pops', ',', 'pops', ',', 'pops', ',', 'pops', ',', 'hold', 'up', ',', 'hold', 'up', ',', 'hold', 'up', ',', 'hold', 'up', '!', 'Oh', 'my', 'God', '.', 'OK', ',', 'he', "'", 's', 'trying', 'to', 'battle', 'me', '.', 'Hold', 'on', ',', 'right', 'now', ',', 'hold', 'on', '.', 'Do', 'you', 'remember', 'when', 'you', 'used', 'to', 'beatbox', 'me', 'to', 'sleep', '?', 'EC', ':', 'Yeah', ',', 'yeah', ',', 'I', 'remember', '.', 'That', "'", 's', 'when', 'she', 'was', 'a', 'little']

OK, he's trying to battle me. → 'OK', ',', 'he', "'", 's', 'trying', 'to', 'battle', 'me', '.'

4) Sentence tokenization with TextBlob().sentences

tokens = TextBlob(removed_parenthesis).sentences
tokens[:10]

[Sentence("Nicole Paris: TEDYouth, make some noise!"),
 Sentence("TEDYouth, make some —  Are you ready?"),
 Sentence("Are you ready?Ed Cage: Yeah, yeah, yeah!"),
 Sentence("EC: Y'all like that?"),
 Sentence("Let me show you how we used to do it —NP: Get it pops, go ahead.EC: ... when I was growing up in the '90s."),
 Sentence("NP: Pops, pops, pops, pops, pops, pops, hold up, hold up, hold up, hold up!"),
 Sentence("Oh my God."),
 Sentence("OK, he's trying to battle me."),
 Sentence("Hold on, right now, hold on."),
 Sentence("Do you remember when you used to beatbox me to sleep?EC: Yeah, yeah, I remember.")]

OK, he's trying to battle me. → OK, he's trying to battle me.

5) Sentence tokenization with NLTK's sent_tokenize()

tokens = sent_tokenize(removed_parenthesis)
print(tokens[:10])

['Nicole Paris: TEDYouth, make some noise!', 'TEDYouth, make some —  Are you ready?', 'Are you ready?Ed Cage: Yeah, yeah, yeah!', "EC: Y'all like that?", "Let me show you how we used to do it —NP: Get it pops, go ahead.EC: ... when I was growing up in the '90s.", 'NP: Pops, pops, pops, pops, pops, pops, hold up, hold up, hold up, hold up!', 'Oh my God.', "OK, he's trying to battle me.", 'Hold on, right now, hold on.', 'Do you remember when you used to beatbox me to sleep?EC: Yeah, yeah, I remember.']

OK, he's trying to battle me. → OK, he's trying to battle me.

6) Noun phrases with TextBlob().noun_phrases

tokens = TextBlob(removed_parenthesis).noun_phrases
tokens[:150]

WordList(['nicole paris', 'tedyouth', 'tedyouth', 'ed cage', 'yeah', 'ec', "y'all", 'np', 'pops', 'oh', 'god', 'hold', 'ec', 'yeah', 'np', 'np', 'hold', "y'all", 'hold', 'right', 'history —', 'tickle —', 'beatbox', 'york', 'york', 'york', 'yeah', 'well', 'louis', 'np', "y'all hands", 'ec', 'york', 'dj', 'simple —', 'simple beats', 'well', 'jam sessions', 'jam sessions consist', "'ll look", "'ll text", 'kitchen cooking', 'road trips', 'standing', 'aw', 'dad', 'naw', 'jam session', 'yeah.np', 'tiny bit', 'jam session', 'np', "y'all", 'jam session', 'ec', "y'all", 'jam session', 'np', 'sorry', "ca n't", 'yeah', 'kick', 'np', 'ec', "y'all", 'np', 'thank', 'eg', 'thank', 'np', 'thank'])

Well, this one definitely doesn't cut it!

Lemmatization¶

https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

# Using WordNetLemmatizer()
lemmatizer = WordNetLemmatizer()
lemmatized_text = [lemmatizer.lemmatize(w) for w in TextBlob(removed_parenthesis).words]
print(lemmatized_text[:100])

['Nicole', 'Paris', 'TEDYouth', 'make', 'some', 'noise', 'TEDYouth', 'make', 'some', '—', 'Are', 'you', 'ready', 'Are', 'you', 'ready', 'Ed', 'Cage', 'Yeah', 'yeah', 'yeah', 'EC', "Y'all", 'like', 'that', 'Let', 'me', 'show', 'you', 'how', 'we', 'used', 'to', 'do', 'it', '—NP', 'Get', 'it', 'pop', 'go', 'ahead.EC', 'when', 'I', 'wa', 'growing', 'up', 'in', 'the', "'90s", 'NP', 'Pops', 'pop', 'pop', 'pop', 'pop', 'pop', 'hold', 'up', 'hold', 'up', 'hold', 'up', 'hold', 'up', 'Oh', 'my', 'God', 'OK', 'he', "'s", 'trying', 'to', 'battle', 'me', 'Hold', 'on', 'right', 'now', 'hold', 'on', 'Do', 'you', 'remember', 'when', 'you', 'used', 'to', 'beatbox', 'me', 'to', 'sleep', 'EC', 'Yeah', 'yeah', 'I', 'remember', 'That', "'s", 'when', 'she']

Stemming¶

stemmer = nltk.stem.porter.PorterStemmer()
stemmed_text = [stemmer.stem(w) for w in TextBlob(removed_parenthesis).words]
print(stemmed_text[:100])

['nicol', 'pari', 'tedyouth', 'make', 'some', 'nois', 'tedyouth', 'make', 'some', '—', 'are', 'you', 'readi', 'are', 'you', 'readi', 'Ed', 'cage', 'yeah', 'yeah', 'yeah', 'EC', "y'all", 'like', 'that', 'let', 'me', 'show', 'you', 'how', 'we', 'use', 'to', 'do', 'it', '—np', 'get', 'it', 'pop', 'go', 'ahead.ec', 'when', 'I', 'wa', 'grow', 'up', 'in', 'the', "'90", 'NP', 'pop', 'pop', 'pop', 'pop', 'pop', 'pop', 'hold', 'up', 'hold', 'up', 'hold', 'up', 'hold', 'up', 'Oh', 'my', 'god', 'OK', 'he', "'s", 'tri', 'to', 'battl', 'me', 'hold', 'on', 'right', 'now', 'hold', 'on', 'Do', 'you', 'rememb', 'when', 'you', 'use', 'to', 'beatbox', 'me', 'to', 'sleep', 'EC', 'yeah', 'yeah', 'I', 'rememb', 'that', "'s", 'when', 'she']

Applying Best Text Cleaning Approches¶

def corpus_cleaner(corpus, stem='lemmatizer'):

    '''
    Take a corpus of documents and apply the best cleaning steps from above.
    
    1. Remove parantheticals
    2. Tokenize into words using TextBlob().words
    3. Set to lower case and remove stopwords
    4. Lemmatize
    5. Again lowercase and remove stopwords
    
    Output = A list (corpus) of lists (cleaned documents)
    '''
    
    # Define stemmer 
    if stem == 'lemmatizer':
        lemmatizer = nltk.stem.WordNetLemmatizer()
    elif stem == 'stemmer':
        stemmer = nltk.stem.porter.PorterStemmer()
    else:
        raise Error("Invalid stemmer, choose either 'lemmatizer' or 'stemmer'.") 
    
    # Set stopwords
    stop = stopwords.words('english')
    stop += ['.', ',',':','...','!"','?"', "'", '"',' - ',' — ',',"','."','!', ';','♫♫','♫',\
             '.\'"','[',']','—',".\'", 'ok','okay','yeah','ya','stuff', ' 000 ',' em ',\
             ' oh ','thank','thanks','la','was','wa','?','like','go',' le ',' ca ',' I '," ? ","s", " t ","ve","re", \
             'oh', 'sort', 'maybe', 'guy', 'applause']
   
    output_corpus = []
    
    for document in corpus:
        cleaned_doc = []
        
        # Remove parentheticals
        clean_parens = re.sub(r'\([^)]*\)', ' ', document)
        
        # Tokenize
        for word  in wordpunct_tokenize(clean_parens):

            # Remove stopwords
            if word.lower() not in stop:
                
                # Lemmatize or Stem
                if stem == 'lemmatizer':
                    cleaned_word = lemmatizer.lemmatize(word.lower())
                elif stem == 'stemmer':
                    cleaned_word = stemmer.stem(word.lower())
                else:
                    raise Error("Invalid stemmer, choose either 'lemmatizer' or 'stemmer'.") 
            
                # Add to document
                cleaned_doc.append(cleaned_word.lower())
            
        # After cleaning all words fora document, add it to the corpus
        output_corpus.append(' '.join(cleaned_doc))
        
    return output_corpus

%%time
# Execute function to clean all corpus
cleaned_corpus = corpus_cleaner(ted_transcripts['transcript'])

Wall time: 33.8 s

# Check result
print(cleaned_corpus[2020][0:100])

nicole paris tedyouth make noise tedyouth make ready ready ed cage ec let show used np get pop ahead

N-gram Models¶

Checking whether unigrams, bigrams or trigrams make for more useful representations

# Function to extract most common n-grams
def top_n_gram(corpus, n=2, top_ngrams=10):
    counter = Counter()
    for doc in tqdm(corpus):
        words = TextBlob(doc).words
        n_grams = ngrams(words, n)
        counter += Counter(n_grams)
    for n_gram, count in counter.most_common(top_ngrams):
        print('%30s - %i' % (' '.join(n_gram), count))

# 15 Most common Unigrams
top_n_gram(cleaned_corpus, n=1, top_ngrams=15)

100%|██████████████████████████████████████| 2467/2467 [00:36<00:00, 66.76it/s]

                           one - 21266
                        people - 19800
                         thing - 14872
                          know - 13592
                          year - 13109
                         going - 12878
                          time - 12621
                         think - 12320
                             u - 12071
                           get - 11941
                           see - 11806
                         would - 11611
                        really - 11046
                           way - 10647
                         world - 10532

# 15 Most common Bigrams
top_n_gram(cleaned_corpus, n=2, top_ngrams=15)

100%|██████████████████████████████████████| 2467/2467 [02:52<00:00, 14.28it/s]

                      year ago - 2082
                    little bit - 1607
                      year old - 1365
                  united state - 1103
                     one thing - 1041
                  around world - 938
                      new york - 894
                       can not - 877
                    first time - 751
                     every day - 692
                   many people - 656
                     last year - 604
                  every single - 573
                       one day - 559
                       10 year - 542

# 15 Most common Trigrams
top_n_gram(cleaned_corpus, n=3, top_ngrams=15)

100%|██████████████████████████████████████| 2467/2467 [03:12<00:00, 12.78it/s]

                 new york city - 236
                  000 year ago - 135
                 new york time - 127
                   10 year ago - 118
              every single day - 109
              million year ago - 109
           people around world - 100
                  two year ago - 100
                  world war ii - 99
                 one two three - 97
               couple year ago - 96
                   20 year ago - 83
                 five year old - 78
               talk little bit - 71
                spend lot time - 71

# The trigrams don't seem to have much useful information that can't be captured by bigrams
# Hence I'll limit my N-gram range to unigrams and bigrams
NGRAM_MIN = 1
NGRAM_MAX = 2

Vectorizing Data¶

Vectorizing = Turning words into numerical representations

# Instantiate TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range = (NGRAM_MIN, NGRAM_MAX),
                             stop_words = 'english',
                             max_df = 0.5,
                             max_features = len(cleaned_corpus))

# Obtain vectorized data
vect_data = vectorizer.fit_transform(cleaned_corpus)

# Check result
plt.figure(figsize=(8, 8))
plt.spy(vect_data, markersize=0.01)
plt.show()

TF-IDF vectorization doesn't seem to go well along other topic modeling techniques down the pipeline, so I'll use CounteVectorizer instead.

From taking a look at the LDA paper, this seems to happen because LDA has some sort of built in TF-IDF...

Source: http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf

# Instantiate CountVectorizer
vectorizer = CountVectorizer(ngram_range = (NGRAM_MIN, NGRAM_MAX),
                             stop_words = 'english',
                             max_df = 0.5,
                             max_features = len(cleaned_corpus))

# Obtain vectorized data
vect_data = vectorizer.fit_transform(cleaned_corpus)

# Check result
plt.figure(figsize=(8, 8))
plt.spy(vect_data, markersize=0.01)
plt.show()

The word vectors seems to be well distributed in their correlations with the documents.

Topic Modeling¶

NUM_TOPICS = 15
#NUM_TOPICS = np.random.randint(10, 25)
print(f'Modeling with {NUM_TOPICS} topics!')

Modeling with 15 topics!

1) Latent Dirichlet Allocation (LDA)¶

Explanation 1: https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158

Explanation 2: https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2

Paper: https://ai.stanford.edu/~ang/papers/jair03-lda.pdf

# Instantiate object
LDA_obj = LatentDirichletAllocation(n_components = NUM_TOPICS,
                                    max_iter = NUM_TOPICS,
                                    batch_size = 32,
                                    learning_method = 'online',
                                    n_jobs = -1)

# Obtain clustered data
LDA_data = LDA_obj.fit_transform(vect_data)

# Create dictonary with most common words in each topic
LDA_topics_dict = {}
for idx, topic in tqdm(enumerate(LDA_obj.components_), total=len(LDA_obj.components_)):
    LDA_topics_dict[idx] = [vectorizer.get_feature_names()[i] for i in topic.argsort()][:-15:-1]

# Print the topics and their common words
for k, v in LDA_topics_dict.items():
    print(f'Topic {k}:')
    print(" ".join(v))
    print('')

100%|██████████████████████████████████████████| 15/15 [04:02<00:00, 16.15s/it]

Topic 0:
country percent government money state company business africa global dollar million market economy social

Topic 1:
school kid child story family old home student year old teacher parent young told community

Topic 2:
feel experience word believe story love self live mind person social culture reason sense

Topic 3:
robot art rule sound image object body light space eye pattern create artist color

Topic 4:
guy love word book wanted minute try hand feel yes person thinking everybody pretty

Topic 5:
car energy percent ca dollar 000 oil water million cost power billion climate technology

Topic 6:
data technology computer information internet phone medium online digital using video device example company

Topic 7:
brain machine science example technology model understand information experiment computer language scientist learning behavior

Topic 8:
food plant eat fish choice farmer percent feed farm crop eating diet grow seed

Topic 9:
planet earth water animal ocean specie universe sea 000 star light ice mar fish

Topic 10:
cell patient cancer disease drug health body doctor medical blood treatment gene medicine heart

Topic 11:
city building design space project material water built map build create street house architecture

Topic 12:
war violence state police country american refugee group prison military law conflict weapon muslim

Topic 13:
woman men child girl baby family sex mother black female care boy male percent

Topic 14:
game play music video dna playing player real genome played video game piece toy tool

# For each document, classify it's theme as the highest scoring topic
document_topics = np.argmax(LDA_data, axis=1)

# Then convert to dataframe
pred_labels = pd.DataFrame(document_topics)
pred_labels.head()

# Covert the numbers to more understandable topic names
topic_names = pred_labels.copy()
for topic_code in range(pred_labels.nunique()[0]):
    topic_names[0][topic_names[0] == topic_code] = ' '.join(LDA_topics_dict[topic_code][:5])
    
topic_names.head()

# Visualize topic frequency
fig, ax = plt.subplots(figsize=(15,12))
plt.tick_params(labelsize=15)
sns.countplot(y=topic_names[0].values)
plt.show()

# Iterative visualization for topic modeling with pyLDAvis
import pyLDAvis, pyLDAvis.sklearn
from IPython.display import display 
    
# Setup to run in Jupyter notebook
pyLDAvis.enable_notebook()

# Create the visualization
vis = pyLDAvis.sklearn.prepare(LDA_obj, vect_data, vectorizer)

# Export as a standalone HTML web page
pyLDAvis.save_html(vis, 'lda.html')

# Let's view it!
display(vis)

Visualize the LDA modeling results

# Option to skip testing other methods since we'll be using LDA
SKIP = False

2) Non-Negative Matrix Factorization (NMF)¶

if not SKIP:

    # Instantiate NMF object
    NMF_obj = NMF(n_components = NUM_TOPICS)

    # Obtain clustered data
    NMF_data = NMF_obj.fit_transform(vect_data)

    # Create dictonary with most common words in each topic
    NMF_topics_dict = {}
    for idx, topic in tqdm(enumerate(NMF_obj.components_), total=len(NMF_obj.components_)):
        NMF_topics_dict[idx] = [vectorizer.get_feature_names()[i] for i in topic.argsort()][:-15:-1]

    # Print the topics and their common words
    for k, v in NMF_topics_dict.items():
        print(f'Topic {k}:')
        print(" ".join(v))
        print('')

100%|██████████████████████████████████████████| 15/15 [02:41<00:00, 10.74s/it]

Topic 0:
love feel word person experience guy friend hand man god mind moment old wanted

Topic 1:
woman men girl man sex gender boy young female mother black male pm violence

Topic 2:
brain neuron cell body memory animal area sleep light mind ability child region control

Topic 3:
country state government percent africa china global united economic india money war political economy

Topic 4:
water ocean food animal planet fish specie earth sea plant 000 percent area tree

Topic 5:
cell cancer patient disease drug body blood tumor health stem cell stem doctor organ medicine

Topic 6:
child school kid family food teacher education student parent old mother girl percent community

Topic 7:
city building space design street community public neighborhood flag project built new york york architecture

Topic 8:
computer technology design machine sort example project information using building internet create learning book

Topic 9:
universe space planet galaxy star black light earth hole black hole telescope particle energy sun

Topic 10:
data information number patient health using web decision drug company percent algorithm map study

Topic 11:
car energy ca dollar percent power oil em cost company money nuclear mile fuel

Topic 12:
game play video video game playing real player hour sound social online music feel win

Topic 13:
robot body build building animal leg play sort foot rule ant lab video task

Topic 14:
story book film tell story told read telling live picture mother character africa movie wanted

3) Latent Semantic Analysis (LSA)¶

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

if not SKIP:

    # Instantiate LSA object
    LSA_obj = TruncatedSVD(n_components = NUM_TOPICS)

    # Obtain clustered data
    LSA_data = LSA_obj.fit_transform(vect_data)

    # Create dictonary with most common words in each topic
    LSA_topics_dict = {}
    for idx, topic in tqdm(enumerate(LSA_obj.components_), total=len(LSA_obj.components_)):
        LSA_topics_dict[idx] = [vectorizer.get_feature_names()[i] for i in topic.argsort()][:-15:-1]

    # Print the topics and their common words
    for k, v in LSA_topics_dict.items():
        print(f'Topic {k}:')
        print(" ".join(v))
        print('')

100%|██████████████████████████████████████████| 15/15 [02:41<00:00, 10.78s/it]

Topic 0:
woman country story percent child brain technology school million 000 example kid course city

Topic 1:
woman country men child girl school story family kid man mother young community boy

Topic 2:
brain woman cell men cancer body patient child girl love neuron disease story man

Topic 3:
country cell cancer percent brain disease patient health drug africa state government data dollar

Topic 4:
woman water planet earth men cancer cell space universe energy ocean light star black

Topic 5:
brain country woman neuron state china energy planet power men global universe political government

Topic 6:
child water school kid food brain planet city family earth ocean animal fish area

Topic 7:
city building brain car design woman street cell community public architecture flag space project

Topic 8:
story cancer cell country feel love war god political state body book believe american

Topic 9:
universe black space child galaxy city hole black hole data star light country image building

Topic 10:
data story water city information ocean health patient child car fish care doctor shark

Topic 11:
car ca energy em dollar kid love money oil universe cost percent solar hour

Topic 12:
game cancer play city health patient care black doctor video feel video game experience playing

Topic 13:
robot child car ca family care health em patient baby body power energy mother

Topic 14:
game child story cell play data car video country family video game africa light india

4) Latent Semantic Analysis (LSA) + Normalization¶

if not SKIP:

    # Instantiate LSA_norm object
    LSA_norm_obj = TruncatedSVD(n_components = NUM_TOPICS)

    # Normalize the vectorized data
    stdScale = Normalizer()
    vect_data_norm = stdScale.fit_transform(vect_data)

    # Obtain clustered data
    LSA_norm_data = LSA_norm_obj.fit_transform(vect_data_norm)

    # Create dictonary with most common words in each topic
    LSA_norm_topics_dict = {}
    for idx, topic in tqdm(enumerate(LSA_norm_obj.components_), total=len(LSA_norm_obj.components_)):
        LSA_norm_topics_dict[idx] = [vectorizer.get_feature_names()[i] for i in topic.argsort()][:-15:-1]

    # Print the topics and their common words
    for k, v in LSA_norm_topics_dict.items():
        print(f'Topic {k}:')
        print(" ".join(v))
        print('')

100%|██████████████████████████████████████████| 15/15 [02:59<00:00, 11.98s/it]

Topic 0:
story woman child country percent technology school kid love feel old 000 course city

Topic 1:
woman child men story girl school family country kid mother man young love told

Topic 2:
country city percent government dollar state global million company money africa business 000 economy

Topic 3:
brain cell woman cancer disease patient data percent health drug body country information gene

Topic 4:
woman water planet earth ocean animal men light specie energy body sea space cell

Topic 5:
child school kid water food family teacher education animal disease parent cell percent student

Topic 6:
city woman building design cell project brain school community cancer space patient car men

Topic 7:
city brain story country feel love war state street patient cell mind community disease

Topic 8:
brain child woman country city school planet neuron language space universe earth education computer

Topic 9:
story child country image information technology book cell art africa space building data cancer

Topic 10:
data city information car planet number story earth universe child star image map light

Topic 11:
story water brain food animal information robot company fish ocean book business film guy

Topic 12:
child music technology robot car sound play water data family video game machine patient

Topic 13:
car cell story energy dollar brain technology money billion universe light machine million 000

Topic 14:
music sound play city country cell data video piece africa 000 school story language

None of the topic modeling techniques seem to be a clear winner here...

Since LDA is considered to the state of the art technique I'll choose to employ it in the recommender system.

Recommender¶

def get_recommendation(TARGET_ID, NUM_RECOMMENDATIONS = 5):
    
    '''
    Requires the following previous objects from this notebook:
    1. Trained vectorizer
    2. Trained LDA_obj
    3. Converted LDA_data
    4. topic_names (list with the modeled topic for each TED Talk)
    5. ted_transcripts dataframe contaning both CSVs already merged
    '''
    
    # Vectorize the document correspondent to the TARGET_ID
    target_vector = vectorizer.transform([cleaned_corpus[TARGET_ID]])
    
    # Model the vector with the trained LDA_Obj
    target_modeled = LDA_obj.transform(target_vector)
    
    # Fit a KNN algorithm on the whole dataset modeled with LDA
    NN = NearestNeighbors(n_neighbors=NUM_RECOMMENDATIONS+1, metric='cosine', algorithm='brute', n_jobs=-1)
    NN.fit(LDA_data)
    
    # Find the nearest neighbords for the LDA vector correspondent to the TARGET_ID
    results = NN.kneighbors(target_modeled)
    recommend_list = results[1][0]
    similarity_scores = results[0][0]

    # Loop to extract revelant information about the recommendations
    titles, modeled_topics, tags, descriptions = [], [] ,[], []
    for idx in recommend_list:
        titles.append(ted_transcripts.loc[idx,'title'])
        modeled_topics.append(topic_names.iloc[idx,0])
        tags.append(ted_transcripts.loc[idx,'tags'])
        descriptions.append(ted_transcripts.loc[idx,'description'])

    # Put recommendations in a dataframe for outputting
    output_df = pd.DataFrame({'ID': recommend_list,
                              'Similarity Score': similarity_scores,
                              'Title': titles,
                              'Modeled Topic': modeled_topics,
                              'Tags': tags,
                              'Description': descriptions})
    
    # Customize index to specify that the first row is the TED Talk from the TARGET_ID 
    custom_index = np.arange(1, NUM_RECOMMENDATIONS+1).tolist()
    custom_index.insert(0, 'Base')
    output_df.set_index([custom_index], inplace=True)
    
    return output_df

get_recommendation(2020, NUM_RECOMMENDATIONS=10)

End¶

Matheus Schmitz

https://matheus-schmitz.github.io/

https://www.linkedin.com/in/matheusschmitz/

	comments	description	duration	event	film_date	languages	main_speaker	name	num_speaker	published_date	ratings	related_talks	speaker_occupation	tags	title	url	views	transcript
0	4553	Sir Ken Robinson makes an entertaining and pro...	1164	TED2006	1140825600	60	Ken Robinson	Ken Robinson: Do schools kill creativity?	1	1151367060	[{'id': 7, 'name': 'Funny', 'count': 19645}, {...	[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...	Author/educator	['children', 'creativity', 'culture', 'dance',...	Do schools kill creativity?	https://www.ted.com/talks/ken_robinson_says_sc...	47227110	Good morning. How are you?(Laughter)It's been ...
1	265	With the same humor and humanity he exuded in ...	977	TED2006	1140825600	43	Al Gore	Al Gore: Averting the climate crisis	1	1151367060	[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...	[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...	Climate advocate	['alternative energy', 'cars', 'climate change...	Averting the climate crisis	https://www.ted.com/talks/al_gore_on_averting_...	3200520	Thank you so much, Chris. And it's truly a gre...
2	124	New York Times columnist David Pogue takes aim...	1286	TED2006	1140739200	26	David Pogue	David Pogue: Simplicity sells	1	1151367060	[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...	[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...	Technology columnist	['computers', 'entertainment', 'interface desi...	Simplicity sells	https://www.ted.com/talks/david_pogue_says_sim...	1636292	(Music: "The Sound of Silence," Simon & Garfun...

	ID	Similarity Score	Title	Modeled Topic	Tags	Description
Base	2020	2.220446e-16	A beatboxing lesson from a father-daughter duo	guy love word book wanted	['TEDYouth', 'art', 'entertainment', 'family',...	Nicole Paris was raised to be a beatboxer -- w...
1	2382	4.954671e-03	Songs that bring history to life	guy love word book wanted	['history', 'live music', 'music']	Rhiannon Giddens pours the emotional weight of...
2	179	8.770358e-03	The music wars	guy love word book wanted	['entertainment', 'humor', 'music', 'technology']	New York Times tech columnist David Pogue perf...
3	2315	9.144099e-03	"Rollercoaster"	guy love word book wanted	['guitar', 'live music', 'music', 'performance...	Singer, songwriter and actress Sara Ramirez is...
4	304	9.628546e-03	Playing invisible turntables	guy love word book wanted	['entertainment', 'humor', 'illusion', 'live m...	Human beatbox James "AudioPoet" Burchfield per...
5	934	1.248823e-02	Try something new for 30 days	guy love word book wanted	['culture', 'success']	Is there something you've always meant to do, ...
6	1665	1.252386e-02	The Museum of Four in the Morning	guy love word book wanted	['entertainment', 'humor', 'online video', 'sp...	Beware: Rives has a contagious obsession with ...
7	165	1.274940e-02	A performance of "Mathemagic"	guy love word book wanted	['education', 'entertainment', 'magic', 'math'...	In a lively show, mathemagician Arthur Benjami...
8	2232	1.351661e-02	"St. James Infirmary Blues"	guy love word book wanted	['art', 'live music', 'music', 'performance', ...	Singer Rhiannon Giddens joins international mu...
9	191	1.370682e-02	Juggle and jest	guy love word book wanted	['collaboration', 'entertainment', 'humor', 'p...	Illustrious jugglers the Raspyni Brothers show...
10	1844	1.381693e-02	A magical search for a coincidence	guy love word book wanted	['entertainment', 'illusion', 'magic']	Small coincidences. They happen all the time a...

	0
0	school kid child story family
1	car energy percent ca dollar
2	guy love word book wanted
3	city building design space project
4	country percent government money state