Introduction
Topic Models, in a nutshell, are a type of statistical language models used for uncovering hidden structure in a collection of texts. In a practical and more intuitively, you can think of it as a task of:
Dimensionality Reduction, where rather than representing a text T in its feature space as {Word_i: count(Word_i, T) for Word_i in Vocabulary}, you can represent it in a topic space as {Topic_i: Weight(Topic_i, T) for Topic_i in Topics}
Unsupervised Learning, where it can be compared to clustering, as in the case of clustering, the number of topics, like the number of clusters, is an output parameter. By doing topic modeling, we build clusters of words rather than clusters of texts. A text is thus a mixture of all the topics, each having a specific weight
Tagging, abstract “topics” that occur in a collection of documents that best represents the information in them.
There are several existing algorithms you can use to perform the topic modeling. The most common of it are, Latent Semantic Analysis (LSA/LSI), Probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA)
In this article, we’ll take a closer look at LDA, and implement our first topic model using the sklearn implementation in python 2.7
“ The latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics.” — Wikipedia
Latent Dirichlet Allocation(LDA) is one of the most common algorithms in topic modelling. LDA was proposed by J. K. Pritchard, M. Stephens and P. Donnelly in 2000 and rediscovered by David M. Blei, Andrew Y. Ng and Michael I. Jordan in 2003.
Each topic is simply a distribution over words. Each document contains a mixture of topics and words are drawn from these topics.
Each document can be described by a distribution of topics and each topic can be described by a distribution of words
Before going into the LDA method, let me remind you that not reinventing the wheel and going for the quick solution is usually the best start. Several providers have great API for topic extraction (and it is free up to a certain number of calls): Google, Microsoft, MeaningCloud… I tried all of the three and all work very well.
However, if your data is highly specific, and no generic topic can represent it, then you will have to go for a more personalized approach. This article focuses on one of these approaches: LDA.
Contents
What is topic modeling?
Latent Dirichlet Allocation (LDA)
Model development, evaluation and deployment
The full code (Python)
Stepping through the code
Model evaluation
Deploying the model on new transcripts
Conclusion
Theoretical Overview
LDA is a form of unsupervised learning that views documents as bags of words (ie order does not matter). LDA works by first making a key assumption: the way a document was generated was by picking a set of topics and then for each topic picking a set of words. Now you may be asking “ok so how does it find topics?” Well the answer is simple: it reverse engineers this process. To do this it does the following for each document m:
Assume there are k topics across all of the documents
Distribute these k topics across document m (this distribution is known as α and can be symmetric or asymmetric, more on this later) by assigning each word a topic.
For each word w in document m, assume its topic is wrong but every other word is assigned the correct topic.
Probabilistically assign word w a topic based on two things:
- what topics are in document m
- how many times word w has been assigned a particular topic across all of the documents (this distribution is called β, more on this later)
Repeat this process a number of times for each document and you’re done!
α is a matrix where each row is a document and each column represents a topic. A value in row i and column j represents how likely document i contains topic j. A symmetric distribution would mean that each topic is evenly distributed throughout the document while an asymmetric distribution favors certain topics over others. This affects the starting point of the model and can be used when you have a rough idea of how the topics are distributed to improve results.
β is a matrix where each row represents a topic and each column represents a word. A value in row i and column j represents how likely that topic i contains word j. Usually each word is distributed evenly throughout the topic such that no topic is biased towards certain words. This can be exploited though in order to bias certain topics to favor certain words. For example if you know you have a topic about Apple products it can be helpful to bias words like “iphone” and “ipad” for one of the topics in order to push the model towards finding that particular topic.
LDA is a generative probabilistic model that assumes each topic is a mixture over an underlying set of words, and each document is a mixture of over a set of topic probabilities.
Parameters of LDA
Alpha parameter is Dirichlet prior concentration parameter that represents document-topic density — with a higher alpha, documents are assumed to be made up of more topics and result in more specific topic distribution per document.
Beta parameter is the same prior concentration parameter that represents topic-word density — with high beta, topics are assumed to made of up most of the words and result in a more specific word distribution per topic.
LDA Implementation
The complete code is available as a Jupyter Notebook on GitHub
Loading data
Data cleaning
Exploratory analysis
Preparing data for LDA analysis
LDA model training
Analyzing LDA model results
Fine-tuning
Number of topics: try out several numbers of topics to understand which amount makes sense. You actually need to see the topics to know if your model makes sense or not. As for K-Means, LDA converges and the model makes sense at a mathematical level, but it does not mean it makes sense at a human level.
Cleaning your data: adding stop words that are too frequent in your topics and re-running your model is a common step. Keeping only nouns and verbs, removing templates from texts, testing different cleaning methods iteratively will improve your topics. Be prepared to spend some time here.
Alpha, Eta. If you’re not into technical stuff, forget about these. Otherwise, you can tweak alpha and eta to adjust your topics. Start with ‘auto’, and if the topics are not relevant, try other values. I recommend using low values of Alpha and Eta to have a small number of topics in each document and a small number of relevant words in each topic.
Increase the number of passes to have a better model. 3 or 4 is a good number, but you can go higher.
Assessing results
Are your topics interpretable?
Are your topics unique? (two different topics have different words)
Are your topics exhaustive? (are all your documents well represented by these topics?)
If your model follows these 3 criteria, it looks like a good model :)
Main advantages of LDA It’s fast
Use the %time command in Jupyter to verify it. The model is usually fast to run. Of course, it depends on your data. Several factors can slow down the model:
Long documents
Large number of documents
Large vocabulary size (especially if you use n-grams with a large n)
It’s intuitive
Modelling topics as weighted lists of words is a simple approximation yet a very intuitive approach if you need to interpret it. No embedding nor hidden dimensions, just bags of words with weights. It can predict topics for new unseen documents
Once the model has run, it is ready to allocate topics to any document. Of course, if your training dataset is in English and you want to predict the topics of a Chinese document it won’t work. But if the new documents have the same structure and should have more or less the same topics, it will work. Main disadvantages of LDA Lots of fine-tuning
If LDA is fast to run, it will give you some trouble to get good results with it. That’s why knowing in advance how to fine-tune it will really help you. It needs human interpretation
Topics are found by a machine. A human needs to label them in order to present the results to non-experts people. You cannot influence topics
Knowing that some of your documents talk about a topic you know, and not finding it in the topics found by LDA will definitely be frustrating. And there’s no way to say to the model that some words should belong together. You have to sit and wait for the LDA to give you what you want.
Let’s take a look at roughly what approaches are commonly used for the evaluation:
Eye Balling Models
Top N words
Topics / Documents
Intrinsic Evaluation Metrics
Capturing model semantics
Topics interpretability
Human Judgements
What is a topic
Extrinsic Evaluation Metrics/Evaluation at task
Is model good at performing predefined tasks, such as classification
Natural language is messy, ambiguous and full of subjective interpretation, and sometimes trying to cleanse ambiguity reduces the language to an unnatural form. In this article, we’ll explore more about topic coherence, an intrinsic evaluation metric, and how you can use it to quantitatively justify the model selection. What is Topic Coherence?
Before we understand topic coherence, let’s briefly look at the perplexity measure. Perplexity as well is one of the intrinsic evaluation metric, and is widely used for language model evaluation. It captures how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set.
Focussing on the log-likelihood part, you can think of the perplexity metric as measuring how probable some new unseen data is given the model that was learned earlier. That is to say, how well does the model represent or reproduce the statistics of the held-out data.
However, recent studies have shown that predictive likelihood (or equivalently, perplexity) and human judgment are often not correlated, and even sometimes slightly anti-correlated.
Optimizing for perplexity may not yield human interpretable topics
This limitation of perplexity measure served as a motivation for more work trying to model the human judgment, and thus Topic Coherence.
The concept of topic coherence combines a number of measures into a framework to evaluate the coherence between topics inferred by a model. But before that…
What is topic coherence?
Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. But …
What is coherence?
A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”
Model Implementation
The complete code is available as a Jupyter Notebook on GitHub
Loading data
Data Cleaning
Phrase Modeling: Bi-grams and Tri-grams
Data transformation: Corpus and Dictionary
Base Model Performance
Hyperparameter Tuning
Final Model
Visualize Results