Pattern and Trend Discovery in Large Document Collections using Topic Models
|Ramesh Nallapati||Carnegie Mellon University||[Home Page]|
Notice: hosted by David Israel
Date: Friday February 01, 2008 at 11:00
Location: EJ228 (SRI E building) (Directions)
I will present my post doctoral work at CMU on applying topic models for pattern and trend discovery in large document corpora. I will go over four different models I developed in this framework:
(i) Multiscale Topic-tomography Model: This is a new topic model that models evolution of topic content with time. Our novel idea of combining topic models with Multi-scale Haar Wavelet analysis allows the user to zoom into and zoom out of topic-evolution-display at various resolutions of time scale. In addition, it overcomes the problem of non-conjugacy between the prior and conditional distributions in Dynamic Topic models, by successfully using a combination of beta-binomial distributions to model topic evolution.
(ii) Link-PLSA-LDA: This model combines PLSA and LDA models to simultaneously model text and hyperlinks. A new feature of this model in relation with other existing models such as PHITS and Mixed Membership model is that it allows topical information to propagate from one document to another via the hyperlinks. Our experiments on blog data indicate that the new model performs better than the aforementioned models on the task of link prediction.
(iii) Detecting Within-Topic Word Correlation: In this work, we propose an algorithm that discovers both short-range and long-range dependencies between words within the topics discovered by LDA. The algorithm takes the topical assignments of LDA as input data and uses a Markov Random Field structure learning algorithm to discover topical dependencies between words. The output of the model on AP corpus reveals some interesting patterns.
(iv) Distributed LDA: In this work, I built a distributed implementation of the variational inference for the LDA model. The implementation distributes the workload of E-step for subsets of documents among the worker nodes which in turn send the sufficient statistics to the master node for the execution of the M-step. Our experiments indicate that the implementation achieves more than a 10-fold speed-up compared to a serial implementation, allowing us to scale this model to large scale document collections.
If time permits, I will also talk about my latest work in progress, that applies new developments in simultaneous sparse approximations to automatically discover representative documents in large document collections.
Please arrive at least 10 minutes early in order to sign in and be escorted to the conference room. SRI is located at 333 Ravenswood Avenue in Menlo Park. Visitors may park in the visitors lot in front of Building E, and should follow the instructions by the lobby phone to be escorted to the meeting room. Detailed directions to SRI, as well as maps, are available from the Visiting AIC web page.