Type: | Package |
Title: | Latent Dirichlet Allocation Using 'tidyverse' Conventions |
Version: | 0.0.5 |
Description: | Implements an algorithm for Latent Dirichlet Allocation (LDA), Blei et at. (2003) https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf, using style conventions from the 'tidyverse', Wickham et al. (2019)<doi:10.21105/joss.01686>, and 'tidymodels', Kuhn et al.https://tidymodels.github.io/model-implementation-principles/. Fitting is done via collapsed Gibbs sampling. Also implements several novel features for LDA such as guided models and transfer learning based on ongoing and, as yet, unpublished research. |
License: | MIT + file LICENSE |
URL: | https://github.com/TommyJones/tidylda/ |
BugReports: | https://github.com/TommyJones/tidylda/issues |
Depends: | R (≥ 3.5.0) |
Imports: | dplyr, generics, gtools, Matrix, methods, mvrsquared (≥ 0.1.0), Rcpp (≥ 1.0.2), rlang, stats, stringr, tibble, tidyr, tidytext |
Suggests: | ggplot2, knitr, parallel, quanteda, testthat, tm, slam, spelling, covr, rmarkdown |
LinkingTo: | Rcpp, RcppArmadillo, RcppProgress, RcppThread |
Encoding: | UTF-8 |
RoxygenNote: | 7.2.3 |
Language: | en-US |
LazyData: | true |
VignetteBuilder: | knitr |
NeedsCompilation: | yes |
Packaged: | 2024-04-20 21:24:04 UTC; tommy |
Author: | Tommy Jones |
Maintainer: | Tommy Jones <jones.thos.w@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-04-22 18:20:02 UTC |
Latent Dirichlet Allocation Using 'tidyverse' Conventions
Description
Implements an algorithm for Latent Dirichlet Allocation (LDA) using style conventions from the 'tidyverse' and specifically 'tidymodels'. Also implements several novel features for LDA such as guided models and transfer learning.
Augment method for tidylda
objects
Description
augment
appends observation level model outputs.
Usage
## S3 method for class 'tidylda'
augment(
x,
data,
type = c("class", "prob"),
document_col = "document",
term_col = "term",
...
)
Arguments
x |
an object of class |
data |
a tidy tibble containing one row per original document-token pair, such as is returned by tdm_tidiers with column names c("document", "term") at a minimum. |
type |
one of either "class" or "prob" |
document_col |
character specifying the name of the column that
corresponds to document IDs. Defaults to |
term_col |
character specifying the name of the column that
corresponds to term/token IDs. Defaults to |
... |
other arguments passed to methods,currently not used |
Details
The key statistic for augment
is P(topic | document, token) =
P(topic | token) * P(token | document). P(topic | token) are the entries
of the 'lambda' matrix in the tidylda
object passed
with x
. P(token | document) is taken to be the frequency of each
token normalized within each document.
Value
augment
returns a tidy tibble containing one row per document-token
pair, with one or more columns appended, depending on the value of type
.
If type = 'prob'
, then one column per topic is appended. Its value
is P(topic | document, token).
If type = 'class'
, then the most-probable topic for each document-token
pair is returned. If multiple topics are equally probable, then the topic
with the smallest index is returned by default.
Calculate a matrix whose rows represent P(topic_i|tokens)
Description
Use Bayes' rule to get P(topic|token) from the estimated parameters of a
probabilistic topic model.This resulting "lambda" matrix can be used for
classifying new documents in a frequentist context and supports
augment
.
Usage
calc_lambda(beta, theta, p_docs = NULL, correct = TRUE)
Arguments
beta |
a beta matrix |
theta |
a theta matrix |
p_docs |
A numeric vector of length |
correct |
Logical. Do you want to set NAs or NaNs in the final result to
zero? Useful when hitting computational underflow. Defaults to |
Value
Returns a matrix
whose rows correspond to topics and whose columns
correspond to tokens. The i,j entry corresponds to P(topic_i|token_j)
Calculate R-squared for a tidylda Model
Description
Formats inputs and hands off to calc_rsquared
Usage
calc_lda_r2(dtm, theta, beta, threads)
Arguments
dtm |
must be of class dgCMatrix |
theta |
a theta matrix |
beta |
a beta matrix |
threads |
number of parallel threads |
Value
Numeric scalar between negative infinity and 1
Probabilistic coherence of topics
Description
Calculates the probabilistic coherence of a topic or topics. This approximates semantic coherence or human understandability of a topic.
Usage
calc_prob_coherence(beta, data, m = 5)
Arguments
beta |
A numeric matrix or a numeric vector. The vector, or rows of the matrix represent the numeric relationship between topic(s) and terms. For example, this relationship may be p(word|topic) or p(topic|word). |
data |
A document term matrix or term co-occurrence matrix. The preferred
class is a |
m |
An integer for the number of words to be used in the calculation. Defaults to 5 |
Details
For each pair of words {a, b} in the top M words in a topic, probabilistic coherence calculates P(b|a) - P(b), where {a} is more probable than {b} in the topic. For example, suppose the top 4 words in a topic are {a, b, c, d}. Then, we calculate 1. P(a|b) - P(b), P(a|c) - P(c), P(a|d) - P(d) 2. P(b|c) - P(c), P(b|d) - P(d) 3. P(c|d) - P(d) All 6 differences are averaged together.
Value
Returns an object of class numeric
corresponding to the
probabilistic coherence of the input topic(s).
Examples
# Load a pre-formatted dtm and topic model
data(nih_sample_dtm)
# fit a model
set.seed(12345)
model <- tidylda(
data = nih_sample_dtm[1:20, ], k = 5,
iterations = 100, burnin = 50
)
calc_prob_coherence(beta = model$beta, data = nih_sample_dtm, m = 5)
Convert various things to a dgCMatrix
to work with various functions
and methods
Description
Presently, tidylda
makes heavy usage of the dgCMatrix
class.
However, a user may have created a DTM (or TCM) in one of several classes.
Since data could be in several formats, this function converts them to a
dgCMatrix
before passing them along.
Usage
convert_dtm(dtm)
Arguments
dtm |
the data you want to convert |
Value
an object of class dgCMatrix
Make a lexicon for looping over in the gibbs sampler
Description
One run of the Gibbs sampler and other magic to initialize some objects.
Works in concert with initialize_topic_counts
.
Usage
create_lexicon(Cd_in, Beta_in, dtm_in, alpha, freeze_topics)
Arguments
Cd_in |
IntegerMatrix denoting counts of topics in documents |
Beta_in |
NumericMatrix denoting probability of words in topics |
dtm_in |
arma::sp_mat document term matrix |
alpha |
NumericVector prior for topics over documents |
freeze_topics |
bool if making predictions, set to |
Details
Arguments ending in _in
are copied and their copies modified in
some way by this function. In the case of Cd_in
and Beta_in
,
the only modification is that they are converted from matrices to nested
std::vector
for speed, reliability, and thread safety. dtm_in
is transposed for speed when looping over columns.
Value
Returns a list with five entries.
Docs
is a list of vectors. Each element is a document, and the contents
are indices for tokens. Used as an iterator for the Gibbs sampler.
Zd
is a list of vectors, similar to Docs. However, its contents are topic
assignments of each document/token pair. Used as an iterator for Gibbs
sampling.
Cd
is a matrix counting the number of times each topic is sampled per
document.
Cv
is a matrix counting the number of times each topic is sampled per token.
Ck
is a vector counting the total number of times each topic is sampled overall.
Cd
, Cv
, and Ck
are derivatives of Zd
.
Main C++ Gibbs sampler for Latent Dirichlet Allocation
Description
This is the C++ Gibbs sampler for LDA. "Abandon all hope, ye who enter here."
Usage
fit_lda_c(
Docs,
Zd_in,
Cd_in,
Cv_in,
Ck_in,
alpha_in,
eta_in,
iterations,
burnin,
optimize_alpha,
calc_likelihood,
Beta_in,
freeze_topics,
threads = 1L,
verbose = TRUE
)
Arguments
Docs |
List with one element for each document and one entry for each token
as formatted by |
Zd_in |
List with one element for each document and one entry for each token
as formatted by |
Cd_in |
IntegerMatrix denoting counts of topics in documents |
Cv_in |
IntegerMatrix denoting counts of tokens in topics |
Ck_in |
IntegerVector denoting counts of topics across all tokens |
alpha_in |
NumericVector prior for topics over documents |
eta_in |
NumericMatrix for prior of tokens over topics |
iterations |
int number of gibbs iterations to run in total |
burnin |
int number of burn in iterations |
optimize_alpha |
bool do you want to optimize alpha each iteration? |
calc_likelihood |
bool do you want to calculate the log likelihood each iteration? |
Beta_in |
NumericMatrix denoting probability of tokens in topics |
freeze_topics |
bool if making predictions, set to |
threads |
unsigned integer, how many parallel threads? For now, nothing is actually parallel |
verbose |
bool do you want to print out a progress bar? |
Details
Arguments ending in _in
are copied and their copies modified in
some way by this function. In the case of eta_in
and Beta_in
,
the only modification is that they are converted from matrices to nested
std::vector
for speed, reliability, and thread safety. In the case
of all others, they may be explicitly modified during training.
Value
Returns a list with the following entries.
Cd
is a matrix counting the number of times each topic is sampled per
document.
Cv
is a matrix counting the number of times each topic is sampled per token.
Cd_mean
the same as Cd
but values averaged across iterations
greater than burnin
iterations.
Cv_mean
the same as Cv
but values averaged across iterations
greater than burnin
iterations.
Cd_sum
the same as Cd
but values summed across iterations
greater than burnin
iterations.
Cv_sum
the same as Cv
but values summed across iterations
greater than burnin
iterations.
log_likelihood
a matrix with one row indexing iterations and one
row of the log likelihood for each iteration.
alpha
a vector of the document-topic prior
_eta
a matrix of the topic-token prior
Format alpha
For Input into fit_lda_c
Description
There are a bunch of ways users could format alpha
but the C++ Gibbs
sampler in fit_lda_c
only takes it one way. This function does the
appropriate formatting. It also returns errors if the user input a malformatted
alpha
.
Usage
format_alpha(alpha, k)
Arguments
alpha |
the prior for topics over documents. Can be a numeric scalar or numeric vector. |
k |
the number of topics. |
Value
Returns a list with two elements: alpha
and alpha_class
.
alpha
is the post-formatted version of alpha
in the form of a
k
-length numeric vector. alpha_class
is a character
denoting whether or not the user-supplied alpha
was a "scalar" or
"vector".
Format eta
For Input into fit_lda_c
Description
There are a bunch of ways users could format eta
but the C++ Gibbs
sampler in fit_lda_c
only takes it one way. This function does the
appropriate formatting. It also returns errors if the user input a malformatted
eta
.
Usage
format_eta(eta, k, Nv)
Arguments
eta |
the prior for words over topics. Can be a numeric scalar, numeric vector, or numeric matrix. |
k |
the number of topics. |
Nv |
the total size of the vocabulary as inherited from |
Value
Returns a list with two elements: eta
and eta_class
.
eta
is the post-formatted version of eta
in the form of a
k
by Nv
numeric matrix. eta_class
is a character
denoting whether or not the user-supplied eta
was a "scalar",
"vector", or "matrix".
Generate a sample of LDA posteriors
Description
Helper function called by both posterior.tidylda and predict.tidylda to generate samples from the posterior.
Usage
generate_sample(dir_par, matrix, times)
Arguments
dir_par |
matrix of Dirichlet hyperparameters, one column per |
matrix |
character of "theta" or "beta", indicating which posterior
matrix |
times |
Integer, number of samples to draw. |
Value
Returns a tibble with one row per parameter per sample.
Glance method for tidylda
objects
Description
glance
constructs a single-row summary "glance" of a tidylda
topic model.
Usage
## S3 method for class 'tidylda'
glance(x, ...)
Arguments
x |
an object of class |
... |
other arguments passed to methods,currently not used |
Value
glance
returns a one-row tibble
with the
following columns:
num_topics
: the number of topics in the model
num_documents
: the number of documents used for fitting
num_tokens
: the number of tokens covered by the model
iterations
: number of total Gibbs iterations run
burnin
: number of burn-in Gibbs iterations run
Examples
dtm <- nih_sample_dtm
lda <- tidylda(data = dtm, k = 10, iterations = 100, burnin = 75)
glance(lda)
Initialize topic counts for gibbs sampling
Description
Implementing seeded (or guided) LDA models and transfer learning means that
we can't initialize topics with a uniform-random start. This function prepares
data and then calls a C++ function, create_lexicon
, that runs a single
Gibbs iteration to populate topic counts (and other objects) used during the
main Gibbs sampling run of fit_lda_c
. In the event that
you aren't using fancy seeding or transfer learning, this makes a random
initialization by sampling from Dirichlet distributions parameterized by
priors alpha
and eta
.
Usage
initialize_topic_counts(
dtm,
k,
alpha,
eta,
beta_initial = NULL,
theta_initial = NULL,
freeze_topics = FALSE,
threads = 1,
...
)
Arguments
dtm |
a document term matrix or term co-occurrence matrix of class |
k |
the number of topics |
alpha |
the numeric vector prior for topics over documents as formatted
by |
eta |
the numeric matrix prior for topics over documents as formatted
by |
beta_initial |
if specified, a numeric matrix for the probability of tokens
in topics. Must be specified for predictions or updates as called by
|
theta_initial |
if specified, a numeric matrix for the probability of
topics in documents. Must be specified for updates as called by
|
freeze_topics |
if |
threads |
number of parallel threads, currently unused |
... |
Additional arguments, currently unused |
Value
Returns a list with 5 elements: docs
, Zd
, Cd
, Cv
,
and Ck
. All of these are used by fit_lda_c
.
docs
is a list with one element per document. Each element is a vector
of integers of length sum(dtm[j,])
for the j-th document. The integer
entries correspond to the zero-index column of the dtm
.
Zd
is a list of similar format as docs
. The difference is that
the integer values correspond to the zero-index for topics.
Cd
is a matrix of integers denoting how many times each topic has
been sampled in each document.
Cv
is similar to Cd
but it counts how many times each topic
has been sampled for each token.
Ck
is an integer vector denoting how many times each topic has been
sampled overall.
Note
All of Cd
, Cv
, and Ck
should be derivable by summing
over Zd in various ways.
Construct a new object of class tidylda
Description
Since all three of tidylda
,
refit.tidylda
, and
predict.tidylda
call fit_lda_c
,
we need a way to format the resulting posteriors and other user-facing
objects consistently. This function does that.
Usage
new_tidylda(
lda,
dtm,
burnin,
is_prediction = FALSE,
alpha = NULL,
eta = NULL,
optimize_alpha = NULL,
calc_r2 = NULL,
calc_likelihood = NULL,
call = NULL,
threads
)
Arguments
lda |
list output of |
dtm |
a document term matrix or term co-occurrence matrix of class |
burnin |
integer number of burnin iterations. |
is_prediction |
is this for a prediction (as opposed to initial fitting,
or update)? Defaults to |
alpha |
output of |
eta |
output of |
optimize_alpha |
did you optimize |
calc_r2 |
did the user want to calculate R-squared when calculating the
the model? If |
calc_likelihood |
did you calculate the log likelihood when making a call
to |
call |
the result of calling |
threads |
number of parallel threads |
Value
Returns an S3 object of class tidylda
with the following slots:
beta
is a numeric matrix whose rows are the posterior estimates
of P(token|topic)
theta
is a numeric matrix whose rows are the posterior estimates of
P(topic|document)
lambda
is a numeric matrix whose rows are the posterior estimates of
P(topic|token), calculated using Bayes's rule.
See calc_lambda
.
alpha
is the prior for topics over documents. If optimize_alpha
is FALSE
, alpha
is what the user passed when calling
tidylda
. If optimize_alpha
is TRUE
,
alpha
is a numeric vector returned in the alpha
slot from a
call to fit_lda_c
.
eta
is the prior for tokens over topics. This is what the user passed
when calling tidylda
.
summary
is the result of a call to summarize_topics
call
is the result of match.call
called at the top
of tidylda
log_likelihood
is a tibble
whose columns are
the iteration and log likelihood at that iteration. This slot is only populated
if calc_likelihood = TRUE
r2
is a numeric scalar resulting from a call to
calc_rsquared
. This slot only populated if
calc_r2 = TRUE
Note
In general, the arguments of this function should be what the user passed
when calling tidylda
.
burnin
is used only to determine whether or not burn in iterations
were used when fitting the model. If burnin > -1
then posteriors
are calculated using lda$Cd_mean
and lda$Cv_mean
respectively.
Otherwise, posteriors are calculated using lda$Cd_mean
and
lda$Cv_mean
.
The class of call
isn't checked. It's just passed through to the
object returned by this function. Might be useful if you are using this
function for troubleshooting or something.
Abstracts and metadata from NIH research grants awarded in 2014
Description
This dataset holds information on research grants awarded by the National Institutes of Health (NIH) in 2014. The data set was downloaded in approximately January of 2015. It includes both 'projects' and 'abstracts' files.
Usage
data("nih_sample")
Format
For nih_sample
, a tibble
of 100 randomly-sampled
grants' abstracts and metadata. For nih_sample_dtm
, a
dgCMatrix-class
representing the document term matrix
of abstracts from 100 randomly-sampled grants.
Source
National Institutes of Health ExPORTER https://reporter.nih.gov/exporter
Draw from the marginal posteriors of a tidylda topic model
Description
Sample from the marginal posteriors of a tidylda
topic
model. This is useful for quantifying uncertainty around the parameters of
beta
or theta
.
Usage
posterior(x, ...)
## S3 method for class 'tidylda'
posterior(x, matrix, which, times, ...)
Arguments
x |
An object of class |
... |
Other arguments, currently not used. |
matrix |
A character of either 'theta' or 'beta', indicating from which matrix to draw posterior samples. |
which |
Row index of |
times |
Integer, number of samples to draw. |
Value
posterior
returns a tibble with one row per parameter per sample.
Returns a data frame where each row is a single sample from the posterior.
Each column is the distribution over a single parameter. The variable var
is a facet for subsetting by document (for theta) or topic (for beta).
References
Heinrich, G. (2005) Parameter estimation for text analysis. Technical report. http://www.arbylon.net/publications/text-est.pdf
Examples
# load some data
data(nih_sample_dtm)
# fit a model
set.seed(12345)
m <- tidylda(
data = nih_sample_dtm[1:20, ], k = 5,
iterations = 200, burnin = 175
)
# sample from the marginal posterior corresponding to topic 1
t1 <- posterior(
x = m,
matrix = "beta",
which = 1,
times = 100
)
# sample from the marginal posterior corresponding to documents 5 and 6
d5 <- posterior(
x = m,
matrix = "theta",
which = c(5, 6),
times = 100
)
Get predictions from a Latent Dirichlet Allocation model
Description
Obtains predictions of topics for new documents from a fitted LDA model
Usage
## S3 method for class 'tidylda'
predict(
object,
new_data,
type = c("prob", "class", "distribution"),
method = c("gibbs", "dot"),
iterations = NULL,
burnin = -1,
no_common_tokens = c("default", "zero", "uniform"),
times = 100,
threads = 1,
verbose = TRUE,
...
)
Arguments
object |
a fitted object of class |
new_data |
a DTM or TCM of class |
type |
one of "prob", "class", or "distribution". Defaults to "prob". |
method |
one of either "gibbs" or "dot". If "gibbs" Gibbs sampling is used
and |
iterations |
If |
burnin |
If |
no_common_tokens |
behavior when encountering documents that have no tokens
in common with the model. Options are " |
times |
Integer, number of samples to draw if |
threads |
Number of parallel threads, defaults to 1. Note: currently ignored; only single-threaded prediction is implemented. |
verbose |
Logical. Do you want to print a progress bar out to the console?
Only active if |
... |
Additional arguments, currently unused |
Details
If predict.tidylda
encounters documents that have no tokens in common
with the model in object
it will engage in one of three behaviors based
on the setting of no_common_tokens
.
default
(the default) sets all topics to 0 for offending documents. This
enables continued computations downstream in a way that NA
would not.
However, if no_common_tokens == "default"
, then predict.tidylda
will emit a warning for every such document it encounters.
zero
has the same behavior as default
but it emits a message
instead of a warning.
uniform
sets all topics to 1/k for every topic for offending documents.
it does not emit a warning or message.
Value
type
gives different outputs depending on whether the user selects
"prob", "class", or "distribution". If "prob", the default, returns a
a "theta" matrix with one row per document and one column per topic. If
"class", returns a vector with the topic index of the most likely topic in
each document. If "distribution", returns a tibble with one row per
parameter per sample. Number of samples is set by the times
argument.
Examples
# load some data
data(nih_sample_dtm)
# fit a model
set.seed(12345)
m <- tidylda(
data = nih_sample_dtm[1:20, ], k = 5,
iterations = 200, burnin = 175
)
str(m)
# predict on held-out documents using gibbs sampling "fold in"
p1 <- predict(m, nih_sample_dtm[21:100, ],
method = "gibbs",
iterations = 200, burnin = 175
)
# predict on held-out documents using the dot product
p2 <- predict(m, nih_sample_dtm[21:100, ], method = "dot")
# compare the methods
barplot(rbind(p1[1, ], p2[1, ]), beside = TRUE, col = c("red", "blue"))
# predict classes on held out documents
p3 <- predict(m, nih_sample_dtm[21:100, ],
method = "gibbs",
type = "class",
iterations = 100, burnin = 75
)
# predict distribution on held out documents
p4 <- predict(m, nih_sample_dtm[21:100, ],
method = "gibbs",
type = "distribution",
iterations = 100, burnin = 75,
times = 10
)
Print Method for tidylda
Description
Print a summary for objects of class tidylda
Usage
## S3 method for class 'tidylda'
print(x, digits = max(3L, getOption("digits") - 3L), n = 5, ...)
Arguments
x |
an object of class |
digits |
minimal number of significant digits |
n |
Number of rows to show in each displayed |
... |
further arguments passed to or from other methods |
Value
Silently returns x
Examples
dtm <- nih_sample_dtm
lda <- tidylda(data = dtm, k = 10, iterations = 100)
print(lda)
lda
print(lda, digits = 2)
Get Count Matrices from Beta or Theta (and Priors)
Description
This function is a core component of initialize_topic_counts
.
See details, below.
Usage
recover_counts_from_probs(prob_matrix, prior_matrix, total_vector)
Arguments
prob_matrix |
a numeric |
prior_matrix |
a matrix of same dimension as |
total_vector |
a vector of token counts of length |
Details
This function uses a probability matrix (theta or beta), its prior (alpha or eta, respectively), and a vector of counts to simulate what the the Cd or Cv matrix would be at the end of a Gibbs run that resulted in that probability matrix.
For example, theta is calculated from a matrix of counts, Cd, and a prior, alpha. Specifically, the i,j entry of theta is given by
(Cd[i, j] + alpha[i, j]) / sum(Cd[, j] + alpha[, j])
Similarly, beta comes from
(Cv[i, j] + eta[i, j]) / sum(Cv[, j] + eta[, j])
(The above are written to be general with respect to alpha and eta being matrices. They could also be vectors or scalars.)
So, this function uses the above formulas to try and reconstruct Cd or Cv from theta and alpha or beta and eta, respectively. As of this writing, this method is experimental. In the future, there will be a paper with more technical details cited here.
The priors must be matrices for the purposes of the function. This is to support topic seeding and model updates. The former requires eta to be a matrix. The latter may require eta to be a matrix. Here, alpha is also required to be a matrix for compatibility.
All that said, for now initialize_topic_counts
only
uses this function to calculate Cd.
Value
Returns a matrix corresponding to the number of times each topic sampled
for each document (Cd
) or for each token (Cv
) depending on
whether or not prob_matrix
/prior_matrix
corresponds to
theta
/alpha
or beta
/eta
respectively.
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
Update a Latent Dirichlet Allocation topic model
Description
Update an LDA model using collapsed Gibbs sampling.
Usage
## S3 method for class 'tidylda'
refit(
object,
new_data,
iterations = NULL,
burnin = -1,
prior_weight = 1,
additional_k = 0,
additional_eta_sum = 250,
optimize_alpha = FALSE,
calc_likelihood = FALSE,
calc_r2 = FALSE,
return_data = FALSE,
threads = 1,
verbose = TRUE,
...
)
Arguments
object |
a fitted object of class |
new_data |
A document term matrix or term co-occurrence matrix of class dgCMatrix. |
iterations |
Integer number of iterations for the Gibbs sampler to run. |
burnin |
Integer number of burnin iterations. If |
prior_weight |
Numeric, 0 or greater or |
additional_k |
Integer number of topics to add, defaults to 0. |
additional_eta_sum |
Numeric magnitude of prior for additional topics.
Ignored if |
optimize_alpha |
Logical. Experimental. Do you want to optimize alpha
every iteration? Defaults to |
calc_likelihood |
Logical. Do you want to calculate the log likelihood every iteration?
Useful for assessing convergence. Defaults to |
calc_r2 |
Logical. Do you want to calculate R-squared after the model is trained?
Defaults to |
return_data |
Logical. Do you want |
threads |
Number of parallel threads, defaults to 1. |
verbose |
Logical. Do you want to print a progress bar out to the console?
Defaults to |
... |
Additional arguments, currently unused |
Details
refit
allows you to (a) update the probabilities (i.e. weights) of
a previously-fit model with new data or additional iterations and (b) optionally
use beta
of a previously-fit LDA topic model as the eta
prior
for the new model. This is tuned by setting beta_as_prior = FALSE
or
beta_as_prior = TRUE
respectively.
prior_weight
tunes how strong the base model is represented in the prior.
If prior_weight = 1
, then the tokens from the base model's training data
have the same relative weight as tokens in new_data
. In other words,
it is like just adding training data. If prior_weight
is less than 1,
then tokens in new_data
are given more weight. If prior_weight
is greater than 1, then the tokens from the base model's training data are
given more weight.
If prior_weight
is NA
, then the new eta
is equal to
eta
from the old model, with new tokens folded in.
(For handling of new tokens, see below.) Effectively, this just controls
how the sampler initializes (described below), but does not give prior
weight to the base model.
Instead of initializing token-topic assignments in the manner for new
models (see tidylda
), the update initializes in 2
steps:
First, topic-document probabilities (i.e. theta
) are obtained by a
call to predict.tidylda
using method = "dot"
for the documents in new_data
. Next, both beta
and theta
are
passed to an internal function, initialize_topic_counts
,
which assigns topics to tokens in a manner approximately proportional to
the posteriors and executes a single Gibbs iteration.
refit
handles the addition of new vocabulary by adding a flat prior
over new tokens. Specifically, each entry in the new prior is equal to the
10th percentile of eta
from the old model. The resulting model will
have the total vocabulary of the old model plus any new vocabulary tokens.
In other words, after running refit.tidylda
ncol(beta) >= ncol(new_data)
where beta
is from the new model and new_data
is the additional data.
You can add additional topics by setting the additional_k
parameter
to an integer greater than zero. New entries to alpha
have a flat
prior equal to the median value of alpha
in the old model. (Note that
if alpha
itself is a flat prior, i.e. scalar, then the new topics have
the same value for their prior.) New entries to eta
have a shape
from the average of all previous topics in eta
and scaled by
additional_eta_sum
.
Value
Returns an S3 object of class c("tidylda").
Note
Updates are, as of this writing, are almost-surely useful but their behaviors have not been optimized or well-studied. Caveat emptor!
Examples
# load a document term matrix
data(nih_sample_dtm)
d1 <- nih_sample_dtm[1:50, ]
d2 <- nih_sample_dtm[51:100, ]
# fit a model
m <- tidylda(d1,
k = 10,
iterations = 200, burnin = 175
)
# update an existing model by adding documents using old model as prior
m2 <- refit(
object = m,
new_data = rbind(d1, d2),
iterations = 200,
burnin = 175,
prior_weight = 1
)
# use an old model to initialize new model and not use old model as prior
m3 <- refit(
object = m,
new_data = d2, # new documents only
iterations = 200,
burnin = 175,
prior_weight = NA
)
# add topics while updating a model by adding documents
m4 <- refit(
object = m,
new_data = rbind(d1, d2),
additional_k = 3,
iterations = 200,
burnin = 175
)
Summarize a topic model consistently across methods/functions
Description
Summarizes topics in a model. Called by tidylda
and refit.tidylda
and used to augment
print.tidylda
.
Usage
summarize_topics(theta, beta, dtm)
Arguments
theta |
numeric matrix whose rows represent P(topic|document) |
beta |
numeric matrix whose rows represent P(token|topic) |
dtm |
a document term matrix or term co-occurrence matrix of class |
Value
Returns a tibble
with the following columns:
topic
is the integer row number of beta
.
prevalence
is the frequency of each topic throughout the corpus it
was trained on normalized so that it sums to 100.
coherence
makes a call to calc_prob_coherence
using the default 5 most-probable terms in each topic.
top_terms
displays the top 5 most-probable terms in each topic.
Note
prevalence
should be proportional to P(topic). It is calculated by
weighting on document length. So, topics prevalent in longer documents get
more weight than topics prevalent in shorter documents. It is calculated
by
prevalence <- rowSums(dtm) * theta %>% colSums()
prevalence <- (prevalence * 100) %>% round(3)
An alternative calculation (not implemented here) might have been
prevalence <- colSums(dtm) * t(beta) %>% colSums()
prevalence <- (prevalence * 100) %>% round(3)
Tidy a matrix from a tidylda
topic model
Description
Tidy the result of a tidylda
topic model
Usage
## S3 method for class 'tidylda'
tidy(x, matrix, log = FALSE, ...)
## S3 method for class 'matrix'
tidy(x, matrix, log = FALSE, ...)
Arguments
x |
an object of class |
matrix |
the matrix to tidy; one of |
log |
do you want to have the result on a log scale? Defaults to |
... |
other arguments passed to methods,currently not used |
Value
Returns a tibble
.
If matrix = "beta"
then the result is a table of one row per topic
and token with the following columns: topic
, token
, beta
If matrix = "theta"
then the result is a table of one row per document
and topic with the following columns: document
, topic
, theta
If matrix = "lambda"
then the result is a table of one row per topic
and token with the following columns: topic
, token
, lambda
Functions
-
tidy(matrix)
: Tidy an individual matrix. Useful for predictions and called from tidy.tidylda
Note
If log = TRUE
then "log_" will be appended to the name of the third
column of the resulting table. e.g "beta
" becomes "log_beta
".
Examples
dtm <- nih_sample_dtm
lda <- tidylda(data = dtm, k = 10, iterations = 100, burnin = 75)
tidy_beta <- tidy(lda, matrix = "beta")
tidy_theta <- tidy(lda, matrix = "theta")
tidy_lambda <- tidy(lda, matrix = "lambda")
Create a tidy tibble for a dgCMatrix
Description
Create a tidy tibble for a dgCMatrix. Will probably be a PR to tidytext in the future
Usage
tidy_dgcmatrix(x, ...)
Arguments
x |
must be of class dgCMatrix |
... |
Extra arguments, not used |
Value
Returns a triplet matrix with columns "document", "term", and "count"
Utility function to tidy a simple triplet matrix
Description
Utility function to tidy a simple triplet matrix
Usage
tidy_triplet(x, triplets, row_names = NULL, col_names = NULL)
Arguments
x |
Object with rownames and colnames |
triplets |
A data frame or list of i, j, x |
row_names |
rownames, if not gotten from rownames(x) |
col_names |
colnames, if not gotten from colnames(x) |
Value
returns a triplet matrix in the form of a data frame. The first column indexes rows. The second column indexes columns. The third column contains the i,j values.
Note
This function ported from tidytext
, copyright
2017 David Robinson and Julia Silge. Moved the function here for stability
reasons, as it is internal to tidytext
Fit a Latent Dirichlet Allocation topic model
Description
Fit a Latent Dirichlet Allocation topic model using collapsed Gibbs sampling.
Usage
tidylda(
data,
k,
iterations = NULL,
burnin = -1,
alpha = 0.1,
eta = 0.05,
optimize_alpha = FALSE,
calc_likelihood = TRUE,
calc_r2 = FALSE,
threads = 1,
return_data = FALSE,
verbose = TRUE,
...
)
Arguments
data |
A document term matrix or term co-occurrence matrix. The preferred
class is a |
k |
Integer number of topics. |
iterations |
Integer number of iterations for the Gibbs sampler to run. |
burnin |
Integer number of burnin iterations. If |
alpha |
Numeric scalar or vector of length |
eta |
Numeric scalar, numeric vector of length |
optimize_alpha |
Logical. Do you want to optimize alpha every iteration?
Defaults to |
calc_likelihood |
Logical. Do you want to calculate the log likelihood every iteration?
Useful for assessing convergence. Defaults to |
calc_r2 |
Logical. Do you want to calculate R-squared after the model is trained?
Defaults to |
threads |
Number of parallel threads, defaults to 1. See Details, below. |
return_data |
Logical. Do you want |
verbose |
Logical. Do you want to print a progress bar out to the console?
Defaults to |
... |
Additional arguments, currently unused |
Details
This function calls a collapsed Gibbs sampler for Latent Dirichlet Allocation written using the excellent Rcpp package. Some implementation notes follow:
Topic-token and topic-document assignments are not initialized based on a
uniform-random sampling, as is common. Instead, topic-token probabilities
(i.e. beta
) are initialized by sampling from a Dirichlet distribution
with eta
as its parameter. The same is done for topic-document
probabilities (i.e. theta
) using alpha
. Then an internal
function is called (initialize_topic_counts
) to run
a single Gibbs iteration to initialize assignments of tokens to topics and
topics to documents.
When you use burn-in iterations (i.e. burnin = TRUE
), the resulting
beta
and theta
matrices are calculated by averaging over every
iteration after the specified number of burn-in iterations. If you do not
use burn-in iterations, then the matrices are calculated from the last run
only. Ideally, you'd burn in every iteration before convergence, then average
over the chain after its converged (and thus every observation is independent).
If you set optimize_alpha
to TRUE
, then each element of alpha
is proportional to the number of times each topic has be sampled that iteration
averaged with the value of alpha
from the previous iteration. This lets
you start with a symmetric alpha
and drift into an asymmetric one.
However, (a) this probably means that convergence will take longer to happen
or convergence may not happen at all. And (b) I make no guarantees that doing this
will give you any benefit or that it won't hurt your model. Caveat emptor!
The log likelihood calculation is the same that can be found on page 9 of
https://arxiv.org/pdf/1510.08628.pdf. The only difference is that the
version in tidylda
allows eta
to be a
vector or matrix. (Vector used in this function, matrix used for model
updates in refit.tidylda
. At present, the
log likelihood function appears to be ok for assessing convergence. i.e. It
has the right shape. However, it is, as of this writing, returning positive
numbers, rather than the expected negative numbers. Looking into that, but
in the meantime caveat emptor once again.
Parallelism, is not currently implemented. The threads
argument is a
placeholder for planned enhancements.
Value
Returns an S3 object of class tidylda
. See new_tidylda
.
Examples
# load some data
data(nih_sample_dtm)
# fit a model
set.seed(12345)
m <- tidylda(
data = nih_sample_dtm[1:20, ], k = 5,
iterations = 200, burnin = 175
)
str(m)
# predict on held-out documents using gibbs sampling "fold in"
p1 <- predict(m, nih_sample_dtm[21:100, ],
method = "gibbs",
iterations = 200, burnin = 175
)
# predict on held-out documents using the dot product method
p2 <- predict(m, nih_sample_dtm[21:100, ], method = "dot")
# compare the methods
barplot(rbind(p1[1, ], p2[1, ]), beside = TRUE, col = c("red", "blue"))
Bridge function for fitting tidylda
topic models
Description
Takes in arguments from various tidylda
S3 methods and fits the
resulting topic model. The arguments to this function are documented in
tidylda
.
Usage
tidylda_bridge(
data,
k,
iterations,
burnin,
alpha,
eta,
optimize_alpha,
calc_likelihood,
calc_r2,
threads,
return_data,
verbose,
mc,
...
)
Value
Returns a tidylda
S3 object as documented in new_tidylda
.