--- title: "Basic usage" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Basic usage} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This vignette describes the most basic usage of the `sentopics` package by estimating an LDA model and analysis it's output. Two other vignettes, describing time series and topic models with sentiment are also available. ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#", fig.width = 7, fig.height = 4, fig.align = "center" ) ``` # Data The package is shipped with a sample of press conferences from the European Central bank. For ease of use, the press conferences have been pre-processed into a `tokens` object from the `quanteda` package. (See [quanteda's introduction](https://quanteda.io/articles/quickstart.html#tokenizing-texts-1) for details on these objects). The press conferences also contains meta-data which can be accessed using `docvars()`. The press conferences were obtained from [ECB's website](https://www.ecb.europa.eu/press/press_conference/monetary-policy-statement/html/index.en.html). The package also provides an helper function to replicate the creation of the dataset: `get_ECB_press_conferences()` ```{r} library("sentopics") data("ECB_press_conferences_tokens") print(ECB_press_conferences_tokens, 3) head(docvars(ECB_press_conferences_tokens)) ``` # Topic modeling ## Introduction `sentopics` implements three types of topic model. The simplest, Latent Dirichlet Allocation (LDA), assumes that textual documents are issued from a generative process involving $K$ topics. A given document $d$ is constituted of a list of words $d = (w_1, \dots, w_N)$, with $N$ being the document's length. Each word $w_i$ originates from a vocabulary consisting of $V$ distinct terms. Then, documents are generated from the following random process: 1. For each topic $k \in K$, a distribution $\phi_k$ over the vocabulary is drawn. This distribution represent the probability of a word appearing given it belong to the topic and is drawn from a Dirichlet distribution with hyperparameter $\beta$. $$\phi \sim Dirichlet(\beta)$$ 2. For each document, a mixture of the $K$ topics, $\theta_d$, assign the probability of a word in document $d$ being generated from topic $k$. This mixture is also drawn from a Dirichlet distribution with hyperparameter $\alpha$. $$\theta \sim Dirichlet(\alpha)$$ 3. For each word position $i$ of document $d$, the following sequence of draws is executed: 1. A latent topic assignment $z_i$ is drawn from the document mixture. $z_i \sim Multinomial(\theta)$ 2. A word $w_i$ is drawn from the topic's vocabulary distribution. $w_i \sim Multinomial(\phi_{z_i})$ In `sentopics` the LDA model is estimated through Gibbs sampling, that iteratively sample the topic assignment $z_i$ of every word of the corpus until reaching a convergence. The topic assignments are sampled from the following distribution: $$ p(z_i = k|w,z^{-i}) \propto \frac{n_{k,v,.}^{-i} + \beta}{n_{k,.,.}^{-i} + V\beta} \frac{n_{k,.,d}^{-i} + \alpha}{n_{.,.,d}^{-i} + K\alpha},$$ where $n_{k,v,d}$ is the count of words at index $v$ of the vocabulary, assigned to topic $k$ and part of document $d$. The replacement of one of the indices $\{k,v,d\}$ by a dot indicates instead the count for all topics, all vocabulary indices or all documents. The superscript $-i$ indicates that the current word position $i$ is left out from the count variables. ## Estimating LDA models with `sentopics` The estimation of an LDA model is easily replicated using the `LDA()` and `fit()` function. The first function prepares the `R` object and initialize the assignment of the latent topics. The second function estimates the model using Gibbs sampling for a given number of iterations. Note that `fit()` may be used to iterate the model multiple times without resetting the estimation. ```{r} set.seed(123) lda <- LDA(ECB_press_conferences_tokens) lda lda <- fit(lda, iterations = 100) lda ``` Internally, the `lda` object is stored as a list and contains the model's parameters and outputs. ```{r} str(lda, max.level = 1, give.attr = FALSE) ``` `tokens` is the initial tokens object used to create the model. `vocabulary` is a data.frame indexing the set of words. `K` is the number of topics. `alpha` is the hyperparameter of the document-topic mixtures. `beta` is the hyperparameter of the topic-word mixtures. `it` is the number of iterations of the model. `za` contains the topic assignments of each word of the corpus. `theta` are the estimated document-topic mixtures. `phi` are the estimated topic-word mixtures. `logLikelihood` is the log-likelihood of the model at each iteration. Estimated mixtures are easily accessible through the `$` operator. But the package also includes the `topWords()` function to extract the most probable words of each topic. `topWords()` includes three types of outputs: *long* `data.table`/`data-frame`, `matrix` or `ggplot` object (also accessible through the alias `plot_topWords()`). ```{r} head(lda$theta) topWords(lda, output = "matrix") ``` In addition, document-level is facilitated through the use of the `melt()` method, that joins estimated topical proportions to document metadata present in the `tokens` input. This result in a *long* `data.table`/`data.frame` that can be used for plotting or easily reshaped to a wide format (for example using `data.table::dcast`). ```{r} melt(lda, include_docvars = TRUE) ``` To ease the result analysis, we can rename the default topic labels using the `sentopics_labels()` function. As a result, all outputs of the model will now display the custom labels. ```{r} sentopics_labels(lda) <- list( topic = c("Inflation", "Fiscal policy", "Governing council", "Financial sector", "Uncertainty") ) head(lda$theta) plot_topWords(lda) + ggplot2::theme_grey(base_size = 9) ``` Besides modifying topic labels, it is also possible to merge topics into a greater thematic. This is often useful when estimating a large number of topics (e.g, K > 15). The `mergeTopics()` does this job and re-label topics accordingly. ```{r} merged <- mergeTopics(lda, list( `Big big thematic` = c(1, 3:5), `Fical policy` = 2 )) merged ``` Note that merging topics is only useful for presentation purpose. Using again `fit` on a model with merged topics will drastically change the results as the current state of the model does not results from a standard estimation with the merged set of parameters. Provided that the `plotly` package is installed, one can also directly use `plot()` on the estimated topic model to enjoy a dynamic view of topic proportions and their most probable words (presented as a screenshot hereafter to limit this vignette's size). ```{r, eval=FALSE} plot(lda) ``` ```{r, eval=FALSE, include=FALSE} suppressWarnings({ plotly::save_image(plot(lda), file = "plotly1.svg") }) ``` ```{r include=FALSE} knitr::include_graphics("plotly1.svg") ```