Topic Modeling Example

Using Latent Dirichlet Allocation (LDA)


Reference material:

We often have collections of documents, such as blog posts or news articles, that we would like to divide into natural groups so that we can understand them separately. Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items.

Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.

Our approach will be to work with LDA objects from the topicmodels package, particularly tidying such models so that they can be manipulated with ggplot2 and dplyr.


Reinaldo (Rei) Sanchez-Arias
Assistant Professor of Data Science