Type: Package
Title: Latent Dirichlet Allocation Using 'tidyverse' Conventions
Version: 0.0.5
Description: Implements an algorithm for Latent Dirichlet Allocation (LDA), Blei et at. (2003) https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf, using style conventions from the 'tidyverse', Wickham et al. (2019)<doi:10.21105/joss.01686>, and 'tidymodels', Kuhn et al.https://tidymodels.github.io/model-implementation-principles/. Fitting is done via collapsed Gibbs sampling. Also implements several novel features for LDA such as guided models and transfer learning based on ongoing and, as yet, unpublished research.
License: MIT + file LICENSE
URL: https://github.com/TommyJones/tidylda/
BugReports: https://github.com/TommyJones/tidylda/issues
Depends: R (≥ 3.5.0)
Imports: dplyr, generics, gtools, Matrix, methods, mvrsquared (≥ 0.1.0), Rcpp (≥ 1.0.2), rlang, stats, stringr, tibble, tidyr, tidytext
Suggests: ggplot2, knitr, parallel, quanteda, testthat, tm, slam, spelling, covr, rmarkdown
LinkingTo: Rcpp, RcppArmadillo, RcppProgress, RcppThread
Encoding: UTF-8
RoxygenNote: 7.2.3
Language: en-US
LazyData: true
VignetteBuilder: knitr
NeedsCompilation: yes
Packaged: 2024-04-20 21:24:04 UTC; tommy
Author: Tommy Jones ORCID iD [aut, cre], Brendan Knapp ORCID iD [ctb], Barum Park [ctb]
Maintainer: Tommy Jones <jones.thos.w@gmail.com>
Repository: CRAN
Date/Publication: 2024-04-22 18:20:02 UTC

Latent Dirichlet Allocation Using 'tidyverse' Conventions

Description

Implements an algorithm for Latent Dirichlet Allocation (LDA) using style conventions from the 'tidyverse' and specifically 'tidymodels'. Also implements several novel features for LDA such as guided models and transfer learning.


Augment method for tidylda objects

Description

augment appends observation level model outputs.

Usage

## S3 method for class 'tidylda'
augment(
  x,
  data,
  type = c("class", "prob"),
  document_col = "document",
  term_col = "term",
  ...
)

Arguments

x

an object of class tidylda

data

a tidy tibble containing one row per original document-token pair, such as is returned by tdm_tidiers with column names c("document", "term") at a minimum.

type

one of either "class" or "prob"

document_col

character specifying the name of the column that corresponds to document IDs. Defaults to "document".

term_col

character specifying the name of the column that corresponds to term/token IDs. Defaults to "term".

...

other arguments passed to methods,currently not used

Details

The key statistic for augment is P(topic | document, token) = P(topic | token) * P(token | document). P(topic | token) are the entries of the 'lambda' matrix in the tidylda object passed with x. P(token | document) is taken to be the frequency of each token normalized within each document.

Value

augment returns a tidy tibble containing one row per document-token pair, with one or more columns appended, depending on the value of type.

If type = 'prob', then one column per topic is appended. Its value is P(topic | document, token).

If type = 'class', then the most-probable topic for each document-token pair is returned. If multiple topics are equally probable, then the topic with the smallest index is returned by default.


Calculate a matrix whose rows represent P(topic_i|tokens)

Description

Use Bayes' rule to get P(topic|token) from the estimated parameters of a probabilistic topic model.This resulting "lambda" matrix can be used for classifying new documents in a frequentist context and supports augment.

Usage

calc_lambda(beta, theta, p_docs = NULL, correct = TRUE)

Arguments

beta

a beta matrix

theta

a theta matrix

p_docs

A numeric vector of length nrow(theta) that is proportional to the number of terms in each document, defaults to NULL.

correct

Logical. Do you want to set NAs or NaNs in the final result to zero? Useful when hitting computational underflow. Defaults to TRUE. Set to FALSE for troubleshooting or diagnostics.

Value

Returns a matrix whose rows correspond to topics and whose columns correspond to tokens. The i,j entry corresponds to P(topic_i|token_j)


Calculate R-squared for a tidylda Model

Description

Formats inputs and hands off to calc_rsquared

Usage

calc_lda_r2(dtm, theta, beta, threads)

Arguments

dtm

must be of class dgCMatrix

theta

a theta matrix

beta

a beta matrix

threads

number of parallel threads

Value

Numeric scalar between negative infinity and 1


Probabilistic coherence of topics

Description

Calculates the probabilistic coherence of a topic or topics. This approximates semantic coherence or human understandability of a topic.

Usage

calc_prob_coherence(beta, data, m = 5)

Arguments

beta

A numeric matrix or a numeric vector. The vector, or rows of the matrix represent the numeric relationship between topic(s) and terms. For example, this relationship may be p(word|topic) or p(topic|word).

data

A document term matrix or term co-occurrence matrix. The preferred class is a dgCMatrix-class. However there is support for any Matrix-class object as well as several other commonly-used classes such as matrix, dfm, DocumentTermMatrix, and simple_triplet_matrix

m

An integer for the number of words to be used in the calculation. Defaults to 5

Details

For each pair of words {a, b} in the top M words in a topic, probabilistic coherence calculates P(b|a) - P(b), where {a} is more probable than {b} in the topic. For example, suppose the top 4 words in a topic are {a, b, c, d}. Then, we calculate 1. P(a|b) - P(b), P(a|c) - P(c), P(a|d) - P(d) 2. P(b|c) - P(c), P(b|d) - P(d) 3. P(c|d) - P(d) All 6 differences are averaged together.

Value

Returns an object of class numeric corresponding to the probabilistic coherence of the input topic(s).

Examples

# Load a pre-formatted dtm and topic model
data(nih_sample_dtm)

# fit a model
set.seed(12345)
model <- tidylda(
  data = nih_sample_dtm[1:20, ], k = 5,
  iterations = 100, burnin = 50
)

calc_prob_coherence(beta = model$beta, data = nih_sample_dtm, m = 5)

Convert various things to a dgCMatrix to work with various functions and methods

Description

Presently, tidylda makes heavy usage of the dgCMatrix class. However, a user may have created a DTM (or TCM) in one of several classes. Since data could be in several formats, this function converts them to a dgCMatrix before passing them along.

Usage

convert_dtm(dtm)

Arguments

dtm

the data you want to convert

Value

an object of class dgCMatrix


Make a lexicon for looping over in the gibbs sampler

Description

One run of the Gibbs sampler and other magic to initialize some objects. Works in concert with initialize_topic_counts.

Usage

create_lexicon(Cd_in, Beta_in, dtm_in, alpha, freeze_topics)

Arguments

Cd_in

IntegerMatrix denoting counts of topics in documents

Beta_in

NumericMatrix denoting probability of words in topics

dtm_in

arma::sp_mat document term matrix

alpha

NumericVector prior for topics over documents

freeze_topics

bool if making predictions, set to TRUE

Details

Arguments ending in _in are copied and their copies modified in some way by this function. In the case of Cd_in and Beta_in, the only modification is that they are converted from matrices to nested std::vector for speed, reliability, and thread safety. dtm_in is transposed for speed when looping over columns.

Value

Returns a list with five entries.

Docs is a list of vectors. Each element is a document, and the contents are indices for tokens. Used as an iterator for the Gibbs sampler.

Zd is a list of vectors, similar to Docs. However, its contents are topic assignments of each document/token pair. Used as an iterator for Gibbs sampling.

Cd is a matrix counting the number of times each topic is sampled per document.

Cv is a matrix counting the number of times each topic is sampled per token.

Ck is a vector counting the total number of times each topic is sampled overall.

Cd, Cv, and Ck are derivatives of Zd.


Main C++ Gibbs sampler for Latent Dirichlet Allocation

Description

This is the C++ Gibbs sampler for LDA. "Abandon all hope, ye who enter here."

Usage

fit_lda_c(
  Docs,
  Zd_in,
  Cd_in,
  Cv_in,
  Ck_in,
  alpha_in,
  eta_in,
  iterations,
  burnin,
  optimize_alpha,
  calc_likelihood,
  Beta_in,
  freeze_topics,
  threads = 1L,
  verbose = TRUE
)

Arguments

Docs

List with one element for each document and one entry for each token as formatted by initialize_topic_counts

Zd_in

List with one element for each document and one entry for each token as formatted by initialize_topic_counts

Cd_in

IntegerMatrix denoting counts of topics in documents

Cv_in

IntegerMatrix denoting counts of tokens in topics

Ck_in

IntegerVector denoting counts of topics across all tokens

alpha_in

NumericVector prior for topics over documents

eta_in

NumericMatrix for prior of tokens over topics

iterations

int number of gibbs iterations to run in total

burnin

int number of burn in iterations

optimize_alpha

bool do you want to optimize alpha each iteration?

calc_likelihood

bool do you want to calculate the log likelihood each iteration?

Beta_in

NumericMatrix denoting probability of tokens in topics

freeze_topics

bool if making predictions, set to TRUE

threads

unsigned integer, how many parallel threads? For now, nothing is actually parallel

verbose

bool do you want to print out a progress bar?

Details

Arguments ending in _in are copied and their copies modified in some way by this function. In the case of eta_in and Beta_in, the only modification is that they are converted from matrices to nested std::vector for speed, reliability, and thread safety. In the case of all others, they may be explicitly modified during training.

Value

Returns a list with the following entries.

Cd is a matrix counting the number of times each topic is sampled per document.

Cv is a matrix counting the number of times each topic is sampled per token.

Cd_mean the same as Cd but values averaged across iterations greater than burnin iterations.

Cv_mean the same as Cv but values averaged across iterations greater than burnin iterations.

Cd_sum the same as Cd but values summed across iterations greater than burnin iterations.

Cv_sum the same as Cv but values summed across iterations greater than burnin iterations.

log_likelihood a matrix with one row indexing iterations and one row of the log likelihood for each iteration.

alpha a vector of the document-topic prior

_eta a matrix of the topic-token prior


Format alpha For Input into fit_lda_c

Description

There are a bunch of ways users could format alpha but the C++ Gibbs sampler in fit_lda_c only takes it one way. This function does the appropriate formatting. It also returns errors if the user input a malformatted alpha.

Usage

format_alpha(alpha, k)

Arguments

alpha

the prior for topics over documents. Can be a numeric scalar or numeric vector.

k

the number of topics.

Value

Returns a list with two elements: alpha and alpha_class. alpha is the post-formatted version of alpha in the form of a k-length numeric vector. alpha_class is a character denoting whether or not the user-supplied alpha was a "scalar" or "vector".


Format eta For Input into fit_lda_c

Description

There are a bunch of ways users could format eta but the C++ Gibbs sampler in fit_lda_c only takes it one way. This function does the appropriate formatting. It also returns errors if the user input a malformatted eta.

Usage

format_eta(eta, k, Nv)

Arguments

eta

the prior for words over topics. Can be a numeric scalar, numeric vector, or numeric matrix.

k

the number of topics.

Nv

the total size of the vocabulary as inherited from ncol(dtm) in tidylda.

Value

Returns a list with two elements: eta and eta_class. eta is the post-formatted version of eta in the form of a k by Nv numeric matrix. eta_class is a character denoting whether or not the user-supplied eta was a "scalar", "vector", or "matrix".


Generate a sample of LDA posteriors

Description

Helper function called by both posterior.tidylda and predict.tidylda to generate samples from the posterior.

Usage

generate_sample(dir_par, matrix, times)

Arguments

dir_par

matrix of Dirichlet hyperparameters, one column per

matrix

character of "theta" or "beta", indicating which posterior matrix dir_par's columns are from.

times

Integer, number of samples to draw.

Value

Returns a tibble with one row per parameter per sample.


Glance method for tidylda objects

Description

glance constructs a single-row summary "glance" of a tidylda topic model.

Usage

## S3 method for class 'tidylda'
glance(x, ...)

Arguments

x

an object of class tidylda

...

other arguments passed to methods,currently not used

Value

glance returns a one-row tibble with the following columns:

num_topics: the number of topics in the model num_documents: the number of documents used for fitting num_tokens: the number of tokens covered by the model iterations: number of total Gibbs iterations run burnin: number of burn-in Gibbs iterations run

Examples


dtm <- nih_sample_dtm

lda <- tidylda(data = dtm, k = 10, iterations = 100, burnin = 75)

glance(lda)


Initialize topic counts for gibbs sampling

Description

Implementing seeded (or guided) LDA models and transfer learning means that we can't initialize topics with a uniform-random start. This function prepares data and then calls a C++ function, create_lexicon, that runs a single Gibbs iteration to populate topic counts (and other objects) used during the main Gibbs sampling run of fit_lda_c. In the event that you aren't using fancy seeding or transfer learning, this makes a random initialization by sampling from Dirichlet distributions parameterized by priors alpha and eta.

Usage

initialize_topic_counts(
  dtm,
  k,
  alpha,
  eta,
  beta_initial = NULL,
  theta_initial = NULL,
  freeze_topics = FALSE,
  threads = 1,
  ...
)

Arguments

dtm

a document term matrix or term co-occurrence matrix of class dgCMatrix.

k

the number of topics

alpha

the numeric vector prior for topics over documents as formatted by format_alpha

eta

the numeric matrix prior for topics over documents as formatted by format_eta

beta_initial

if specified, a numeric matrix for the probability of tokens in topics. Must be specified for predictions or updates as called by predict.tidylda or refit.tidylda respectively.

theta_initial

if specified, a numeric matrix for the probability of topics in documents. Must be specified for updates as called by refit.tidylda

freeze_topics

if TRUE does not update counts of tokens in topics. This is TRUE for predictions.

threads

number of parallel threads, currently unused

...

Additional arguments, currently unused

Value

Returns a list with 5 elements: docs, Zd, Cd, Cv, and Ck. All of these are used by fit_lda_c.

docs is a list with one element per document. Each element is a vector of integers of length sum(dtm[j,]) for the j-th document. The integer entries correspond to the zero-index column of the dtm.

Zd is a list of similar format as docs. The difference is that the integer values correspond to the zero-index for topics.

Cd is a matrix of integers denoting how many times each topic has been sampled in each document.

Cv is similar to Cd but it counts how many times each topic has been sampled for each token.

Ck is an integer vector denoting how many times each topic has been sampled overall.

Note

All of Cd, Cv, and Ck should be derivable by summing over Zd in various ways.


Construct a new object of class tidylda

Description

Since all three of tidylda, refit.tidylda, and predict.tidylda call fit_lda_c, we need a way to format the resulting posteriors and other user-facing objects consistently. This function does that.

Usage

new_tidylda(
  lda,
  dtm,
  burnin,
  is_prediction = FALSE,
  alpha = NULL,
  eta = NULL,
  optimize_alpha = NULL,
  calc_r2 = NULL,
  calc_likelihood = NULL,
  call = NULL,
  threads
)

Arguments

lda

list output of fit_lda_c

dtm

a document term matrix or term co-occurrence matrix of class dgCMatrix

burnin

integer number of burnin iterations.

is_prediction

is this for a prediction (as opposed to initial fitting, or update)? Defaults to FALSE

alpha

output of format_alpha

eta

output of format_eta

optimize_alpha

did you optimize alpha when making a call to fit_lda_c? If is_prediction = TRUE, this argument is ignored.

calc_r2

did the user want to calculate R-squared when calculating the the model? If is_prediction = TRUE, this argument is ignored.

calc_likelihood

did you calculate the log likelihood when making a call to fit_lda_c? If is_prediction = TRUE, this argument is ignored.

call

the result of calling match.call at the top of tidylda.

threads

number of parallel threads

Value

Returns an S3 object of class tidylda with the following slots:

beta is a numeric matrix whose rows are the posterior estimates of P(token|topic)

theta is a numeric matrix whose rows are the posterior estimates of P(topic|document)

lambda is a numeric matrix whose rows are the posterior estimates of P(topic|token), calculated using Bayes's rule. See calc_lambda.

alpha is the prior for topics over documents. If optimize_alpha is FALSE, alpha is what the user passed when calling tidylda. If optimize_alpha is TRUE, alpha is a numeric vector returned in the alpha slot from a call to fit_lda_c.

eta is the prior for tokens over topics. This is what the user passed when calling tidylda.

summary is the result of a call to summarize_topics

call is the result of match.call called at the top of tidylda

log_likelihood is a tibble whose columns are the iteration and log likelihood at that iteration. This slot is only populated if calc_likelihood = TRUE

r2 is a numeric scalar resulting from a call to calc_rsquared. This slot only populated if calc_r2 = TRUE

Note

In general, the arguments of this function should be what the user passed when calling tidylda.

burnin is used only to determine whether or not burn in iterations were used when fitting the model. If burnin > -1 then posteriors are calculated using lda$Cd_mean and lda$Cv_mean respectively. Otherwise, posteriors are calculated using lda$Cd_mean and lda$Cv_mean.

The class of call isn't checked. It's just passed through to the object returned by this function. Might be useful if you are using this function for troubleshooting or something.


Abstracts and metadata from NIH research grants awarded in 2014

Description

This dataset holds information on research grants awarded by the National Institutes of Health (NIH) in 2014. The data set was downloaded in approximately January of 2015. It includes both 'projects' and 'abstracts' files.

Usage

data("nih_sample")

Format

For nih_sample, a tibble of 100 randomly-sampled grants' abstracts and metadata. For nih_sample_dtm, a dgCMatrix-class representing the document term matrix of abstracts from 100 randomly-sampled grants.

Source

National Institutes of Health ExPORTER https://reporter.nih.gov/exporter


Draw from the marginal posteriors of a tidylda topic model

Description

Sample from the marginal posteriors of a tidylda topic model. This is useful for quantifying uncertainty around the parameters of beta or theta.

Usage

posterior(x, ...)

## S3 method for class 'tidylda'
posterior(x, matrix, which, times, ...)

Arguments

x

An object of class tidylda.

...

Other arguments, currently not used.

matrix

A character of either 'theta' or 'beta', indicating from which matrix to draw posterior samples.

which

Row index of theta, for document, or beta, for topic, from which to draw samples. which may also be a vector of indices to sample from multiple documents or topics simultaneously.

times

Integer, number of samples to draw.

Value

posterior returns a tibble with one row per parameter per sample.

Returns a data frame where each row is a single sample from the posterior. Each column is the distribution over a single parameter. The variable var is a facet for subsetting by document (for theta) or topic (for beta).

References

Heinrich, G. (2005) Parameter estimation for text analysis. Technical report. http://www.arbylon.net/publications/text-est.pdf

Examples


# load some data
data(nih_sample_dtm)

# fit a model
set.seed(12345)

m <- tidylda(
  data = nih_sample_dtm[1:20, ], k = 5,
  iterations = 200, burnin = 175
)

# sample from the marginal posterior corresponding to topic 1
t1 <- posterior(
  x = m,
  matrix = "beta",
  which = 1,
  times = 100  
)

# sample from the marginal posterior corresponding to documents 5 and 6
d5 <- posterior(
  x = m,
  matrix = "theta",
  which = c(5, 6),
  times = 100
)


Get predictions from a Latent Dirichlet Allocation model

Description

Obtains predictions of topics for new documents from a fitted LDA model

Usage

## S3 method for class 'tidylda'
predict(
  object,
  new_data,
  type = c("prob", "class", "distribution"),
  method = c("gibbs", "dot"),
  iterations = NULL,
  burnin = -1,
  no_common_tokens = c("default", "zero", "uniform"),
  times = 100,
  threads = 1,
  verbose = TRUE,
  ...
)

Arguments

object

a fitted object of class tidylda

new_data

a DTM or TCM of class dgCMatrix or a numeric vector

type

one of "prob", "class", or "distribution". Defaults to "prob".

method

one of either "gibbs" or "dot". If "gibbs" Gibbs sampling is used and iterations must be specified.

iterations

If method = "gibbs", an integer number of iterations for the Gibbs sampler to run. A future version may include automatic stopping criteria.

burnin

If method = "gibbs", an integer number of burnin iterations. If burnin is greater than -1, the entries of the resulting "theta" matrix are an average over all iterations greater than burnin. Behavior is the same as documented in tidylda.

no_common_tokens

behavior when encountering documents that have no tokens in common with the model. Options are "default", "zero", or "uniform". See 'details', below for explanation of behavior.

times

Integer, number of samples to draw if type = "distribution". Ignored if type is "class" or "prob". Defaults to 100.

threads

Number of parallel threads, defaults to 1. Note: currently ignored; only single-threaded prediction is implemented.

verbose

Logical. Do you want to print a progress bar out to the console? Only active if method = "gibbs". Defaults to TRUE.

...

Additional arguments, currently unused

Details

If predict.tidylda encounters documents that have no tokens in common with the model in object it will engage in one of three behaviors based on the setting of no_common_tokens.

default (the default) sets all topics to 0 for offending documents. This enables continued computations downstream in a way that NA would not. However, if no_common_tokens == "default", then predict.tidylda will emit a warning for every such document it encounters.

zero has the same behavior as default but it emits a message instead of a warning.

uniform sets all topics to 1/k for every topic for offending documents. it does not emit a warning or message.

Value

type gives different outputs depending on whether the user selects "prob", "class", or "distribution". If "prob", the default, returns a a "theta" matrix with one row per document and one column per topic. If "class", returns a vector with the topic index of the most likely topic in each document. If "distribution", returns a tibble with one row per parameter per sample. Number of samples is set by the times argument.

Examples


# load some data
data(nih_sample_dtm)

# fit a model
set.seed(12345)

m <- tidylda(
  data = nih_sample_dtm[1:20, ], k = 5,
  iterations = 200, burnin = 175
)

str(m)

# predict on held-out documents using gibbs sampling "fold in"
p1 <- predict(m, nih_sample_dtm[21:100, ],
  method = "gibbs",
  iterations = 200, burnin = 175
)

# predict on held-out documents using the dot product
p2 <- predict(m, nih_sample_dtm[21:100, ], method = "dot")

# compare the methods
barplot(rbind(p1[1, ], p2[1, ]), beside = TRUE, col = c("red", "blue"))

# predict classes on held out documents
p3 <- predict(m, nih_sample_dtm[21:100, ],
  method = "gibbs",
  type = "class",
  iterations = 100, burnin = 75
)

# predict distribution on held out documents
p4 <- predict(m, nih_sample_dtm[21:100, ],
  method = "gibbs",
  type = "distribution",
  iterations = 100, burnin = 75,
  times = 10
)


Print Method for tidylda

Description

Print a summary for objects of class tidylda

Usage

## S3 method for class 'tidylda'
print(x, digits = max(3L, getOption("digits") - 3L), n = 5, ...)

Arguments

x

an object of class tidylda

digits

minimal number of significant digits

n

Number of rows to show in each displayed tibble.

...

further arguments passed to or from other methods

Value

Silently returns x

Examples


dtm <- nih_sample_dtm

lda <- tidylda(data = dtm, k = 10, iterations = 100)

print(lda)

lda

print(lda, digits = 2)


Get Count Matrices from Beta or Theta (and Priors)

Description

This function is a core component of initialize_topic_counts. See details, below.

Usage

recover_counts_from_probs(prob_matrix, prior_matrix, total_vector)

Arguments

prob_matrix

a numeric beta or theta matrix

prior_matrix

a matrix of same dimension as prob_matrix whose entries represent the relevant prior (alpha or eta)

total_vector

a vector of token counts of length ncol(prob_matrix)

Details

This function uses a probability matrix (theta or beta), its prior (alpha or eta, respectively), and a vector of counts to simulate what the the Cd or Cv matrix would be at the end of a Gibbs run that resulted in that probability matrix.

For example, theta is calculated from a matrix of counts, Cd, and a prior, alpha. Specifically, the i,j entry of theta is given by

(Cd[i, j] + alpha[i, j]) / sum(Cd[, j] + alpha[, j])

Similarly, beta comes from

(Cv[i, j] + eta[i, j]) / sum(Cv[, j] + eta[, j])

(The above are written to be general with respect to alpha and eta being matrices. They could also be vectors or scalars.)

So, this function uses the above formulas to try and reconstruct Cd or Cv from theta and alpha or beta and eta, respectively. As of this writing, this method is experimental. In the future, there will be a paper with more technical details cited here.

The priors must be matrices for the purposes of the function. This is to support topic seeding and model updates. The former requires eta to be a matrix. The latter may require eta to be a matrix. Here, alpha is also required to be a matrix for compatibility.

All that said, for now initialize_topic_counts only uses this function to calculate Cd.

Value

Returns a matrix corresponding to the number of times each topic sampled for each document (Cd) or for each token (Cv) depending on whether or not prob_matrix/prior_matrix corresponds to theta/alpha or beta/eta respectively.


Objects exported from other packages

Description

These objects are imported from other packages. Follow the links below to see their documentation.

generics

augment, glance, refit, tidy


Update a Latent Dirichlet Allocation topic model

Description

Update an LDA model using collapsed Gibbs sampling.

Usage

## S3 method for class 'tidylda'
refit(
  object,
  new_data,
  iterations = NULL,
  burnin = -1,
  prior_weight = 1,
  additional_k = 0,
  additional_eta_sum = 250,
  optimize_alpha = FALSE,
  calc_likelihood = FALSE,
  calc_r2 = FALSE,
  return_data = FALSE,
  threads = 1,
  verbose = TRUE,
  ...
)

Arguments

object

a fitted object of class tidylda.

new_data

A document term matrix or term co-occurrence matrix of class dgCMatrix.

iterations

Integer number of iterations for the Gibbs sampler to run.

burnin

Integer number of burnin iterations. If burnin is greater than -1, the resulting "beta" and "theta" matrices are an average over all iterations greater than burnin.

prior_weight

Numeric, 0 or greater or NA. The weight of the beta as a prior from the base model. See Details, below.

additional_k

Integer number of topics to add, defaults to 0.

additional_eta_sum

Numeric magnitude of prior for additional topics. Ignored if additional_k is 0. Defaults to 250.

optimize_alpha

Logical. Experimental. Do you want to optimize alpha every iteration? Defaults to FALSE.

calc_likelihood

Logical. Do you want to calculate the log likelihood every iteration? Useful for assessing convergence. Defaults to FALSE.

calc_r2

Logical. Do you want to calculate R-squared after the model is trained? Defaults to FALSE.

return_data

Logical. Do you want new_data returned as part of the model object?

threads

Number of parallel threads, defaults to 1.

verbose

Logical. Do you want to print a progress bar out to the console? Defaults to TRUE.

...

Additional arguments, currently unused

Details

refit allows you to (a) update the probabilities (i.e. weights) of a previously-fit model with new data or additional iterations and (b) optionally use beta of a previously-fit LDA topic model as the eta prior for the new model. This is tuned by setting beta_as_prior = FALSE or beta_as_prior = TRUE respectively.

prior_weight tunes how strong the base model is represented in the prior. If prior_weight = 1, then the tokens from the base model's training data have the same relative weight as tokens in new_data. In other words, it is like just adding training data. If prior_weight is less than 1, then tokens in new_data are given more weight. If prior_weight is greater than 1, then the tokens from the base model's training data are given more weight.

If prior_weight is NA, then the new eta is equal to eta from the old model, with new tokens folded in. (For handling of new tokens, see below.) Effectively, this just controls how the sampler initializes (described below), but does not give prior weight to the base model.

Instead of initializing token-topic assignments in the manner for new models (see tidylda), the update initializes in 2 steps:

First, topic-document probabilities (i.e. theta) are obtained by a call to predict.tidylda using method = "dot" for the documents in new_data. Next, both beta and theta are passed to an internal function, initialize_topic_counts, which assigns topics to tokens in a manner approximately proportional to the posteriors and executes a single Gibbs iteration.

refit handles the addition of new vocabulary by adding a flat prior over new tokens. Specifically, each entry in the new prior is equal to the 10th percentile of eta from the old model. The resulting model will have the total vocabulary of the old model plus any new vocabulary tokens. In other words, after running refit.tidylda ncol(beta) >= ncol(new_data) where beta is from the new model and new_data is the additional data.

You can add additional topics by setting the additional_k parameter to an integer greater than zero. New entries to alpha have a flat prior equal to the median value of alpha in the old model. (Note that if alpha itself is a flat prior, i.e. scalar, then the new topics have the same value for their prior.) New entries to eta have a shape from the average of all previous topics in eta and scaled by additional_eta_sum.

Value

Returns an S3 object of class c("tidylda").

Note

Updates are, as of this writing, are almost-surely useful but their behaviors have not been optimized or well-studied. Caveat emptor!

Examples


# load a document term matrix
data(nih_sample_dtm)

d1 <- nih_sample_dtm[1:50, ]

d2 <- nih_sample_dtm[51:100, ]

# fit a model
m <- tidylda(d1,
  k = 10,
  iterations = 200, burnin = 175
)

# update an existing model by adding documents using old model as prior
m2 <- refit(
  object = m,
  new_data = rbind(d1, d2),
  iterations = 200,
  burnin = 175,
  prior_weight = 1
)

# use an old model to initialize new model and not use old model as prior
m3 <- refit(
  object = m,
  new_data = d2, # new documents only
  iterations = 200,
  burnin = 175,
  prior_weight = NA
)

# add topics while updating a model by adding documents
m4 <- refit(
  object = m,
  new_data = rbind(d1, d2),
  additional_k = 3,
  iterations = 200,
  burnin = 175
)


Summarize a topic model consistently across methods/functions

Description

Summarizes topics in a model. Called by tidylda and refit.tidylda and used to augment print.tidylda.

Usage

summarize_topics(theta, beta, dtm)

Arguments

theta

numeric matrix whose rows represent P(topic|document)

beta

numeric matrix whose rows represent P(token|topic)

dtm

a document term matrix or term co-occurrence matrix of class dgCMatrix.

Value

Returns a tibble with the following columns: topic is the integer row number of beta. prevalence is the frequency of each topic throughout the corpus it was trained on normalized so that it sums to 100. coherence makes a call to calc_prob_coherence using the default 5 most-probable terms in each topic. top_terms displays the top 5 most-probable terms in each topic.

Note

prevalence should be proportional to P(topic). It is calculated by weighting on document length. So, topics prevalent in longer documents get more weight than topics prevalent in shorter documents. It is calculated by

prevalence <- rowSums(dtm) * theta %>% colSums()

prevalence <- (prevalence * 100) %>% round(3)

An alternative calculation (not implemented here) might have been

prevalence <- colSums(dtm) * t(beta) %>% colSums()

prevalence <- (prevalence * 100) %>% round(3)


Tidy a matrix from a tidylda topic model

Description

Tidy the result of a tidylda topic model

Usage

## S3 method for class 'tidylda'
tidy(x, matrix, log = FALSE, ...)

## S3 method for class 'matrix'
tidy(x, matrix, log = FALSE, ...)

Arguments

x

an object of class tidylda or an individual beta, theta, or lambda matrix.

matrix

the matrix to tidy; one of 'beta', 'theta', or 'lambda'

log

do you want to have the result on a log scale? Defaults to FALSE

...

other arguments passed to methods,currently not used

Value

Returns a tibble.

If matrix = "beta" then the result is a table of one row per topic and token with the following columns: topic, token, beta

If matrix = "theta" then the result is a table of one row per document and topic with the following columns: document, topic, theta

If matrix = "lambda" then the result is a table of one row per topic and token with the following columns: topic, token, lambda

Functions

Note

If log = TRUE then "log_" will be appended to the name of the third column of the resulting table. e.g "beta" becomes "log_beta".

Examples


dtm <- nih_sample_dtm

lda <- tidylda(data = dtm, k = 10, iterations = 100, burnin = 75)

tidy_beta <- tidy(lda, matrix = "beta")

tidy_theta <- tidy(lda, matrix = "theta")

tidy_lambda <- tidy(lda, matrix = "lambda")


Create a tidy tibble for a dgCMatrix

Description

Create a tidy tibble for a dgCMatrix. Will probably be a PR to tidytext in the future

Usage

tidy_dgcmatrix(x, ...)

Arguments

x

must be of class dgCMatrix

...

Extra arguments, not used

Value

Returns a triplet matrix with columns "document", "term", and "count"


Utility function to tidy a simple triplet matrix

Description

Utility function to tidy a simple triplet matrix

Usage

tidy_triplet(x, triplets, row_names = NULL, col_names = NULL)

Arguments

x

Object with rownames and colnames

triplets

A data frame or list of i, j, x

row_names

rownames, if not gotten from rownames(x)

col_names

colnames, if not gotten from colnames(x)

Value

returns a triplet matrix in the form of a data frame. The first column indexes rows. The second column indexes columns. The third column contains the i,j values.

Note

This function ported from tidytext, copyright 2017 David Robinson and Julia Silge. Moved the function here for stability reasons, as it is internal to tidytext


Fit a Latent Dirichlet Allocation topic model

Description

Fit a Latent Dirichlet Allocation topic model using collapsed Gibbs sampling.

Usage

tidylda(
  data,
  k,
  iterations = NULL,
  burnin = -1,
  alpha = 0.1,
  eta = 0.05,
  optimize_alpha = FALSE,
  calc_likelihood = TRUE,
  calc_r2 = FALSE,
  threads = 1,
  return_data = FALSE,
  verbose = TRUE,
  ...
)

Arguments

data

A document term matrix or term co-occurrence matrix. The preferred class is a dgCMatrix-class. However there is support for any Matrix-class object as well as several other commonly-used classes such as matrix, dfm, DocumentTermMatrix, and simple_triplet_matrix

k

Integer number of topics.

iterations

Integer number of iterations for the Gibbs sampler to run.

burnin

Integer number of burnin iterations. If burnin is greater than -1, the resulting "beta" and "theta" matrices are an average over all iterations greater than burnin.

alpha

Numeric scalar or vector of length k. This is the prior for topics over documents.

eta

Numeric scalar, numeric vector of length ncol(data), or numeric matrix with k rows and ncol(data) columns. This is the prior for words over topics.

optimize_alpha

Logical. Do you want to optimize alpha every iteration? Defaults to FALSE. See 'details' below for more information.

calc_likelihood

Logical. Do you want to calculate the log likelihood every iteration? Useful for assessing convergence. Defaults to TRUE.

calc_r2

Logical. Do you want to calculate R-squared after the model is trained? Defaults to FALSE. See calc_lda_r2.

threads

Number of parallel threads, defaults to 1. See Details, below.

return_data

Logical. Do you want data returned as part of the model object?

verbose

Logical. Do you want to print a progress bar out to the console? Defaults to TRUE.

...

Additional arguments, currently unused

Details

This function calls a collapsed Gibbs sampler for Latent Dirichlet Allocation written using the excellent Rcpp package. Some implementation notes follow:

Topic-token and topic-document assignments are not initialized based on a uniform-random sampling, as is common. Instead, topic-token probabilities (i.e. beta) are initialized by sampling from a Dirichlet distribution with eta as its parameter. The same is done for topic-document probabilities (i.e. theta) using alpha. Then an internal function is called (initialize_topic_counts) to run a single Gibbs iteration to initialize assignments of tokens to topics and topics to documents.

When you use burn-in iterations (i.e. burnin = TRUE), the resulting beta and theta matrices are calculated by averaging over every iteration after the specified number of burn-in iterations. If you do not use burn-in iterations, then the matrices are calculated from the last run only. Ideally, you'd burn in every iteration before convergence, then average over the chain after its converged (and thus every observation is independent).

If you set optimize_alpha to TRUE, then each element of alpha is proportional to the number of times each topic has be sampled that iteration averaged with the value of alpha from the previous iteration. This lets you start with a symmetric alpha and drift into an asymmetric one. However, (a) this probably means that convergence will take longer to happen or convergence may not happen at all. And (b) I make no guarantees that doing this will give you any benefit or that it won't hurt your model. Caveat emptor!

The log likelihood calculation is the same that can be found on page 9 of https://arxiv.org/pdf/1510.08628.pdf. The only difference is that the version in tidylda allows eta to be a vector or matrix. (Vector used in this function, matrix used for model updates in refit.tidylda. At present, the log likelihood function appears to be ok for assessing convergence. i.e. It has the right shape. However, it is, as of this writing, returning positive numbers, rather than the expected negative numbers. Looking into that, but in the meantime caveat emptor once again.

Parallelism, is not currently implemented. The threads argument is a placeholder for planned enhancements.

Value

Returns an S3 object of class tidylda. See new_tidylda.

Examples

# load some data
data(nih_sample_dtm)

# fit a model
set.seed(12345)
m <- tidylda(
  data = nih_sample_dtm[1:20, ], k = 5,
  iterations = 200, burnin = 175
)

str(m)

# predict on held-out documents using gibbs sampling "fold in"
p1 <- predict(m, nih_sample_dtm[21:100, ],
  method = "gibbs",
  iterations = 200, burnin = 175
)

# predict on held-out documents using the dot product method
p2 <- predict(m, nih_sample_dtm[21:100, ], method = "dot")

# compare the methods
barplot(rbind(p1[1, ], p2[1, ]), beside = TRUE, col = c("red", "blue"))

Bridge function for fitting tidylda topic models

Description

Takes in arguments from various tidylda S3 methods and fits the resulting topic model. The arguments to this function are documented in tidylda.

Usage

tidylda_bridge(
  data,
  k,
  iterations,
  burnin,
  alpha,
  eta,
  optimize_alpha,
  calc_likelihood,
  calc_r2,
  threads,
  return_data,
  verbose,
  mc,
  ...
)

Value

Returns a tidylda S3 object as documented in new_tidylda.