%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{LDAvis details} \documentclass[12pt]{article} \usepackage{graphicx,amsmath} \usepackage{epsfig} \usepackage{rotating} \usepackage[normalem]{ulem} \usepackage{multirow} \topmargin=-.5in \oddsidemargin=0in \textwidth=6.5in \textheight=9in \parindent 0.0in \parskip .20in \begin{document} \section{Introduction} In $\texttt{LDAvis}$, we visualize the fit of an LDA topic model to a corpus of documents. The data and model are described as follows: $\textbf{Data:}$ \begin{itemize} \item $D$ documents in the corpus \item $n_d$ tokens in document $d$, for $d = 1...D$ (denoted \texttt{doc.length} in our package's R code) \item $N = \sum_d n_d$ total tokens in the corpus \item $W$ terms in the vocabulary \item $M_w$ is defined as the frequency of term $w$ across the corpus, where $\sum_w M_w = N$ (denoted \texttt{term.frequency} in our package's R code) \end{itemize} $\textbf{Model:}$ \begin{itemize} \item $K$ topics in the model \item For document $d = 1...D$, the length-$K$ topic probability vector, $\boldsymbol{\theta}_d$, is drawn from a Dirichlet($\boldsymbol{\alpha}$) prior, where $\alpha_k > 0$ for topics $k = 1...K$. \item For topic $k = 1...K$, the length-$W$ term probability vector, $\boldsymbol{\phi}_k$, is drawn from a Dirichlet($\boldsymbol{\beta}$) prior, where $\beta_w > 0$ for terms $w = 1...W$. \item The probability model states that for the $j^{th}$ token from document $d$, a latent topic, $z_{dj}$, is drawn, where P($z_{dj} = k) = \theta_{dk}$ for document $d = 1...D$, token $j = 1...n_d$, and topic $k = 1...K$. \item Then, the $j^{th}$ token from the $d^{th}$ document, $Y_{dj}$, is drawn from the vocabulary of terms according to P($Y_{dj} = w \mid z_{dj}) = \phi_{(z_{dj},w)}$, for document $d = 1...D$, token $j = 1...n_d$, and term $w = 1...W$. \end{itemize} A number of algorithms can be used to fit an LDA model to a data set. Two of the most common are the collapsed Gibbs sampler (Griffiths and Steyvers, 2004) and variational Bayes (Blei et al 2003). %A key feature of $\texttt{LDAvis}$ is that we incorporate the prior distributions (which are a part of every every LDA model) into the visualization, so that we can visualize aspects of the posterior distributions of the various parameters in the LDA model. Specifically, we don't take as input to $\texttt{LDAvis}$ the sampled topic assignments from the last iteration of a collapsed Gibbs sampler, for instance, but rather we take as inputs general variables that are estimated as a result of any algorithm used to fit an LDA model, including the collapsed Gibbs sampler (Griffiths and Steyvers, 2004), variational Bayes (Blei et al 2003), or others. Our interactive visualization tool, $\texttt{LDAvis}$, requires five input arguments: \begin{enumerate} \item $\boldsymbol\phi$, the $K \times W$ matrix containing the estimated probability mass function over the $W$ terms in the vocabulary for each of the $K$ topics in the model. Note that $\phi_{kw} > 0$ for all $k \in 1...K$ and all $w \in 1...W$, because of the priors. (Although our software allows values of zero due to rounding). Each of the $K$ rows of $\boldsymbol\phi$ must sum to one. \item $\boldsymbol\theta$, the $D \times K$ matrix containing the estimated probability mass function over the $K$ topics in the model for each of the $D$ documents in the corpus. Note that $\theta_{dk} > 0$ for all $d \in 1...D$ and all $k \in 1...K$, because of the priors (although, as above, our software accepts zeroes due to rounding). Each of the $D$ rows of $\boldsymbol\theta$ must sum to one. %\item $\boldsymbol\alpha$, the length-$K$ vector of Dirichlet hyperparameters that specify the prior distribution over topics for each document (we assume that the same vector, $\boldsymbol\alpha$ is used for each of the $D$ documents, but note that this prior doesn't have to be symmetric). We require $\alpha_k > 0$ for each topic $k \in 1...K$. %\item $\boldsymbol\beta$, the length-$W$ vector of Dirichlet hyperparameters that specify the prior distribution over the $W$ terms in the vocabulary for each of the $K$ topics. We require $\beta_w > 0$ for each term $w \in 1...W$. \item $n_d$, the number of tokens observed in document $d$, where $n_d$ is required to be an integer greater than zero, for documents $d = 1...D$. Denoted \texttt{doc.length} in our code. \item \texttt{vocab}, the length-$W$ character vector containing the terms in the vocabulary (listed in the same order as the columns of $\phi$). \item $M_w$, the frequency of term $w$ across the entire corpus, where $M_w$ is required to be an integer greater than zero for each term $w=1...W$. Denoted \texttt{term.frequency} in our code. \end{enumerate} In general, the prior parameters $\boldsymbol\alpha$ and $\boldsymbol\beta$ are specified by the modeler (although in some cases they are estimated from the data), $n_d$ and $M_w$ are computed from the data, and the algorithm used to fit the model produces point estimates of $\boldsymbol\phi$ and $\boldsymbol\theta$. When using the collapsed Gibbs sampler, we recommend using equations 6 and 7 from Griffiths and Steyvers (2004) to estimate $\boldsymbol\phi$ and $\boldsymbol\theta$. These are the ``smoothed" estimates of the parameters that incorporate the priors, rather than, for example, the matrices containing the counts of topic assignments to each document and term, which are a common output of Gibbs Sampler implementations that don't necessarily incorporate the priors. Two popular packages for fitting an LDA model to data are the R package \texttt{lda} (Chang, 2012) and the JAVA-based standalone software package \texttt{MALLET} (McCallum, 2002). Our package contains an example of using the \texttt{lda} package to fit a topic model to a corpus of movie reviews, available in the \texttt{inst/examples/reviews} directory of \texttt{LDAvis}. %We denote the total number of observed tokens in the corpus $N = \sum_d n_d$. Next we introduce the idea of a ``pseudo-token". %When fitting the LDA model, the Dirichlet priors on the $D$ document-topic distributions and the $K$ topic-term distributions have the effect of adding ``pseudo-tokens" to the data. The Dirichlet distribution is the conjugate prior for the multinomial likelihood, and, as with the Beta-Binomial conjugate pair, the vectors specifying the Dirichlet priors can be thought of as adding additional (possibly partial) data points to the observed data to arrive at the posterior distribution. For example, suppose the first document in the corpus contains $n_1 = 50$ tokens, and we use the document-topic prior $\alpha_k = 0.01$ for each topic $k \in 1...K$, where we choose to fit a $K=20$-topic model. This is equivalent to adding $K \times \bar{\alpha}$, or $\sum_k \alpha_k = 20*0.01 = 0.2$ ``pseudo-tokens" to this document, spread out evenly across the $K$ topics. Compared to the 50 observed tokens, adding 0.2 ``pseudo-tokens" will not influence the posterior heavily, meaning this is a ``weak", ``flat", or relatively ``non-informative" prior. If the prior $\boldsymbol\alpha$ is not symmetric, then the numbers of ``pseudo-tokens" added to each topic, for each document, are not equal (which might be desired by the modeler, if he/she has some prior notion that some topics will be more prevalent than others -- see Wallach et al ``Why Priors Matter"). Likewise, we also add $W \times \bar{\beta}$ ``pseudo-tokens" to each topic, spread across the the $W$ terms in the vocabulary according to the individuals $\beta_w$'s. %These priors have the effect of ``smoothing" the estimates of $\boldsymbol\phi$ and $\boldsymbol\theta$, which guarantees that each topic has a non-zero probability under each document, and each term has a non-zero probability under each topic. Our visualization, $\texttt{LDAvis}$, reflects this fact (modulo rounding error). We denote the total number of ``smoothed" tokens in the data as the sum of the observed tokens and the ``pseudo-tokens" added by the priors: %$$ %N_\text{smoothed} = N + DK\bar{\alpha} + KW\bar{\beta}. %$$ \section{Definitions of visual elements in $\texttt{LDAvis}$} Here we define the dimensions of the visual elements in $\texttt{LDAvis}$. There are essentially four sets of visual elements that can be displayed, depending on the state of the visualization. They are: %-- two sets of circles that get displayed, $K$-at-a-time, in the left panel of the visualization, and two sets of horizontal bars that get displayed in a barchart, $r$-at-a-time ($r$ to be defined below), in the right panel of the visualization. At any one time, only $K$ circles are displayed on the left, but it's possible that either $r$ or $2r$ horizontal bars will be displayed at one time. \begin{enumerate} \item \textbf{Default Topic Circles:} $K$ circles, one to represent each topic, whose areas are set to be proportional to the proportions of the topics across the $N$ total tokens in the corpus. The default topic circles are displayed when no term is highlighted. \item \textbf{Red Bars:} $K \times W$ red horizontal bars, each of which represents the estimated number of times a given term was generated by a given topic. When a topic is selected, we show the red bars for the $R$ most \emph{relevant} terms for the selected topic, where $R = 30$ by default (see Sievert and Shirley (2014) for the definition of \emph{relevance}). \item \textbf{Blue Bars:} $W$ blue horizontal bars, one to represent the overall frequency of each term in the corpus. When no topic is selected, we display the blue bars for the $R$ most salient terms in the corpus, and when a topic is selected, we display the blue bars for the $R$ most relevant terms. See Chuang et al. (2012) for the definition of the \emph{saliency} of a term in a topic model. \item \textbf{Topic-Term Circles:} $K \times W$ circles whose areas are set to be proportional to the frequencies with which a given term is estimated to have been generated by the topics. When a given term, $w$, is highlighted, the $K$ default circles transition (i.e. their areas change) to the $K$ topic-term circles for term $w$. \end{enumerate} Let's define the dimensions of these visual elements: \begin{enumerate} \item The area of the \textbf{Default Circle} for topic $k$, $A^\text{default}_k$, is set to be proportional to $N_k/\sum_k N_k$, where $N_k$ is the estimated number of tokens that were generated by topic $k$ across the entire corpus. The formula for $N_k$ is: $$ N_k = \sum_{d=1}^D \theta_{dk}n_d. $$ It is straightforward to verify that $\sum_k N_k = N$. \item The width of the \textbf{Red Bar} for topic $k$ and term $w$, denoted $P_{kw}$, is set to $\phi_{kw} \times N_k$ for all topics $k = 1...K$ and terms $w = 1...W$. \item The width of the \textbf{Blue Bar} for term $w$ is set to $\sum_k P_{kw}$, the total number of occurrences of term $w$ in the corpus (note that prior to version 0.3.2 of LDAvis, this width was set to $M_w$, the user-supplied frequency of term $w$ across the entire corpus). \item The area of the \textbf{Topic-Term Circle} for term $w$ and topic $k$, denoted $A^\text{topic-term}_{kw}$, is set to be proportional to $P_{kw}/\sum_k P_{kw}$. \end{enumerate} \section{Discussion} Here we point out a few things about \texttt{LDAvis}: \begin{enumerate} \item Note that the all the visual elements represent frequencies (of various things in the training data), rather than conditional probabilities. For example, the area of topic-term circle $A^\text{topic-term}_{kw}$ could have been set to be proportional to $\phi_{kw}/\sum_k \phi_{kw}$, but instead we set it to be proportional to $P_{kw}/\sum_k P_{kw}$. So, suppose the term ``foo" had a 0.003 probability under, say, topic 20 and topic 45, and negligible probability under all other topics. One might expect that upon highlighting ``foo", the topic-term circles would all disappear except for two equal-area topic-term circles representing topics 20 and 45. Instead, if, for example, topic 20 occurred twice as frequently as topic 45, then the topic-term circle for topic 20 would be twice as large as that for topic 45 upon ``foo" being highlighted. This reflects the fact that 2/3 of the occurrences of ``foo" in the training data were estimated to have come from topic 20. In other words, we reflect the underlying (and potentially variable) frequencies of the topics themselves when we compute the areas of the topic-term circles. The same principle holds for the red bars and blue bars -- they visualize frequencies, rather than proportions, so that wider bars signify more frequent terms in the training data. We felt this was an important feature of the data to visualize, rather than building a visualization that simply displayed aspects of $\boldsymbol{\phi}$ and $\boldsymbol{\theta}$, which are normalized, and don't reflect the frequencies of the terms and topics in the data. \item By default, we set the dimensions of the left panel to be 530 x 530 pixels, and we set the sum of the areas of the default topic circles and the topic-term circles to be $530^2/4$, so that these circles cover at most 1/4 of the panel (in practice, because of overlapping circles, they cover less than 1/4 of the area of the panel). Likewise, the sum of the areas of the topic-term circles is set to be 1/4 of the area of the left panel of the display. This way the visualization looks OK for a range of numbers of topics, from roughly $10 \leq K \leq 100$. \item The centers of the default topic circles are laid out in two dimensions according to a multidimensional scaling (MDS) algorithm that is run on the inter-topic distance matrix. We use Jensen-Shannon divergence to compute distances between topics, and then we use the \texttt{cmdscale()} function in \texttt{R} to implement classical multidimensional scaling. The range of the first coordinate (along the x-axis) is not necessarily equal to that of the second coordinate (along the y-axis); thus we force the aspect ratio to be 1 to preserve the MDS distances. In practice (across the examples we've seen), the ranges of the x and y coordinates are within about 10\% of each other. \end{enumerate} \section{References} 1. Blei, David M., Ng, Andrew Y., and Jordan, Michael I. (2003). Latent Dirichlet Allocation, \emph{Journal of Machine Learning Research}, Volume 3, pages 993-1022. 2. Griffiths, Thomas L., and Steyvers, Mark (2004). Finding Scientific Topics, \emph{Proceedings of the National Academy of Science}, Volume 101, pages 5228-5235. 3. Jonathan Chang (2012). lda: Collapsed Gibbs sampling methods for topic models. R package version 1.3.2. \texttt{http://CRAN.R-project.org/package=lda}. 4. McCallum, Andrew Kachites (2002). MALLET: A Machine Learning for Language Toolkit. \texttt{http://mallet.cs.umass.edu}. 5. Chuang, Jason, Manning, Christopher D., and Heer, Jeffrey (2012). Termite: Visualization Techniques for Assessing Textual Topic Models, \emph{Advanced Visual Interfaces}. \end{document}