Ad Code

Data Science labs blog

Topic Modeling with LSA, PLSA, LDA & lda2Vec algorithms

Topic Modeling with LSA, PLSA, LDA & lda2Vec algorithms

Image for post

Overview

LSA

Image for post
Image for post
Image for post

Code

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
documents = ["doc1.txt", "doc2.txt", "doc3.txt"]
# raw documents to tf-idf matrix: vectorizer = TfidfVectorizer(stop_words='english',
use_idf=True,
smooth_idf=True)
# SVD to reduce dimensionality: svd_model = TruncatedSVD(n_components=100, // num dimensions
algorithm='randomized',
n_iter=10)
# pipeline of tf-idf + SVD, fit to and applied to documents:svd_transformer = Pipeline([('tfidf', vectorizer),
('svd', svd_model)])
svd_matrix = svd_transformer.fit_transform(documents)
# svd_matrix can later be used to compare documents, compare words, or compare queries with documents

PLSA

Image for post
Image for post
Image for post
Image for post
https://www.slideshare.net/NYCPredictiveAnalytics/introduction-to-probabilistic-latent-semantic-analysis
Image for post

LDA

Image for post
Image for post
Image for post
https://cs.stanford.edu/~ppasupat/a9online/1140.html

Code

from gensim.corpora.Dictionary import load_from_text, doc2bow
from gensim.corpora import MmCorpus
from gensim.models.ldamodel import LdaModel
document = "This is some document..."# load id->word mapping (the dictionary)
id2word = load_from_text('wiki_en_wordids.txt')
# load corpus iterator
mm = MmCorpus('wiki_en_tfidf.mm')
# extract 100 LDA topics, updating once every 10,000
lda = LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)
# use LDA model: transform new doc to bag-of-words, then apply lda
doc_bow = doc2bow(document.split())
doc_lda = lda[doc_bow]
# doc_lda is vector of length num_topics representing weighted presence of each topic in the doc

LDA in Deep Learning: lda2vec

Image for post
Image for post
https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec

Conclusion

Reactions

Post a Comment

0 Comments