---
title: "Topic Modeling with BERTopic in R using reticulate and local LLMs"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Topic Modeling with BERTopic in R using reticulate and local LLMs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(reticulate)

# Replace the path below with the path of your Python environment
# Then uncomment the command below:
# Tip: BERTOPICR_VENV should be the folder that contains `pyvenv.cfg`.
# Sys.setenv(
#   BERTOPICR_VENV = "C:/Users/teodo/Documents/R/bertopic/bertopic4r",
#   NOT_CRAN = "true"
# )

# 1. Define the libraries you need
required_modules <- c("bertopic", "umap", "hdbscan", "sklearn", "numpy", "plotly", "datetime", "sentence_transformers", "openai", "ollama")

# macOS: if reticulate fails to load Python libraries, run once per session.
if (identical(Sys.info()[["sysname"]], "Darwin")) {
  bertopicr::configure_macos_homebrew_zlib()
}

# Optional: point reticulate at a user-specified virtualenv
venv <- Sys.getenv("BERTOPICR_VENV")
if (nzchar(venv)) {
  venv_cfg <- file.path(venv, "pyvenv.cfg")
  if (file.exists(venv_cfg)) {
    reticulate::use_virtualenv(venv, required = TRUE)
  } else {
    message("Warning: BERTOPICR_VENV does not point to a valid virtualenv: ", venv)
  }
}

# Try to find python, but don't crash if it's missing (e.g. on another user's machine)
if (!reticulate::py_available(initialize = TRUE)) {
  try(reticulate::use_python(Sys.which("python"), required = FALSE), silent = TRUE)
}

# 2. Check if they are installed
python_ready <- tryCatch({
  # Attempt to initialize python and check modules
  py_available(initialize = TRUE) &&
  all(vapply(required_modules, py_module_available, logical(1)))
}, error = function(e) FALSE)

# 3. Only evaluate chunks when Python is ready and NOT_CRAN is set
run_chunks <- python_ready && identical(Sys.getenv("NOT_CRAN"), "true")
knitr::opts_chunk$set(eval = run_chunks)

if (!python_ready) {
  message("Warning: Required Python modules (bertopic, umap-learn) not found. Vignette code will not run.")
} else {
  message("Python environment ready: ", reticulate::py_config()$python)
  if (!identical(Sys.getenv("NOT_CRAN"), "true")) {
    message("Note: Set NOT_CRAN=true to run Python-dependent chunks locally.")
  }
}
```

The package `bertopicr` is based on the Python package `BERTopic` by Maarten Grootendorst (https://github.com/MaartenGr/BERTopic), which is described in his paper:

> **BERTopic: Neural topic modeling with a class-based TF-IDF procedure**  
> Maarten Grootendorst, 2022.  
> Available at: [arXiv:2203.05794](https://arxiv.org/abs/2203.05794)

The package `bertopicr` introduces functions to train and display topic model results of `BERTopic` models in `R`. The `R` package was created with the programming support of `OpenAI`'s large language models. 

**Note:** The code below requires a specific Python environment.
If you want to preview the outputs without running the full workflow, see the
pre-computed static snapshots in the sections below. For a shorter workflow and
model persistence helpers, see the vignettes `train_and_save_model.Rmd` and
`load_and_reuse_model.Rmd`, and the updated README.

## Load R packages

Python environment selection and checks are handled in the hidden setup chunk at the top of the vignette.

*Load* the `R` packages below and *initialize* a `Python` environment with the `reticulate` package.

By default, `reticulate` uses an isolated `Python` *virtual environment* named `r-reticulate` (cf. https://rstudio.github.io/reticulate/).

The `use_python()` and the `use_virtualenv()` functions enable you to specify an alternate `Python` environment (cf. https://rstudio.github.io/reticulate/).

```{r}
library(dplyr)
library(tidyr)
library(purrr)
library(utils)
library(tibble)
library(readr)
library(tictoc)
library(htmltools)
library(bertopicr)
```

*Note*: Avoid loading conflicting `R` libraries (like `arrow`) alongside with `Python` modules `BERTopic`, `pyarrow` and `plotly`.

*Note*: If you prefer a streamlined setup, `setup_python_environment()` (see README) can install the Python dependencies. The helper functions `train_bertopic_model()`, `save_bertopic_model()`, and `load_bertopic_model()` are showcased in `train_and_save_model.Rmd` and `load_and_reuse_model.Rmd`.

## Python packages

Use `bertopicr::setup_python_environment()` (see README) to install the required Python packages and configure your environment. The hidden setup chunk at the top of this vignette checks availability and will skip Python chunks if the environment is not ready. Install `ollama` or `lm-studio` if you want to serve local language models.

After setup, we import the Python packages used for topic modeling below.

```{r eval=run_chunks}
# Import necessary Python modules
py <- import_builtins()
np <- import("numpy")
umap <- import("umap")
UMAP <- umap$UMAP
hdbscan <- import("hdbscan")
HDBSCAN <- hdbscan$HDBSCAN
sklearn <- import("sklearn")
CountVectorizer <- sklearn$feature_extraction$text$CountVectorizer
bertopic <- import("bertopic")
plotly <- import("plotly")
datetime <- import("datetime")

```


## Text preparation

The *German texts* are in the *text_clean* column. They were segmented into smaller chunks (each about 100 tokens long) to optimize topic extraction with `bertopicr`. Special characters were removed with a cleaning function. The text chunks are all in lower case. 

```{r}
rds_path <- file.path("inst/extdata", "spiegel_sample.rds")
dataset <- read_rds(rds_path)
names(dataset)
dim(dataset)
```

The collected `stopword` list includes German and English tokens and will be inserted into `Python`'s `CountVectorizer` before `c-TF-IDF` calculation. But the embedding model will process the text chunks in the *text_clean* column before `stopword` removal.

```{r}
stopwords_path <- file.path("inst/extdata", "all_stopwords.txt")
all_stopwords <- read_lines(stopwords_path)
```

Below are the lists of *texts_cleaned* and *timesteps*, which we need during model preparation, topic extraction and visualization.

```{r}
texts_cleaned = dataset$text_clean
titles = dataset$doc_id
timestamps <- as.list(dataset$date)
# timestamps <- as.integer(dataset$year)

texts_cleaned[[1]]
```


## Model Preparation

For model preparation, we are going to use `reticulate` to interface with `Python` modules. The `R` code is essentially a conversion of `Python` code. 

The Topic model preparation will also include *local* language models (via `ollama` or `lm-studio`) leveraging the `OpenAI` endpoint.

### Embeddings

`SentenceTransformer` creates the necessary *embeddings* (vector representations of text tokens) for topic modeling with `bertopic`. The first time `SentenceTransformer` is used with a specific model, the model has to be downloaded from the `huggingface` website (https://huggingface.co/), where many freely usable language models are hosted (https://huggingface.co/models). 

```{r eval=run_chunks}
# Embed the sentences
sentence_transformers <- import("sentence_transformers")
SentenceTransformer <- sentence_transformers$SentenceTransformer
# choose an appropriate embeddings model
embedding_model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
embeddings = embedding_model$encode(texts_cleaned, show_progress_bar=TRUE)

```

### Dimension reduction

In the next two steps, the `umap` module reduces the number of dimensions of embeddings, and the `hdbscan` module extracts clusters that can evaluated by the topic pipeline. 

```{r eval=run_chunks}
# Initialize UMAP and HDBSCAN models
umap_model <- UMAP(n_neighbors=15L, n_components=5L, min_dist=0.0, metric='cosine', random_state=42L)

```

Other dimension reduction methods (like `PCA` or `tSNE`) can be used instead.

### Clustering

The `hdbscan` module extracts clusters that can be evaluated by the topic pipeline. Starting with BERTopic version 0.17.0, I had to decrease the number of parallel workers to `core_dist_n_jobs` = 1 to avoid out of memory error messages on my `Windows` PC. 

```{r eval=run_chunks}
hdbscan_model <- HDBSCAN(min_cluster_size=50L, min_samples = 20L, metric='euclidean', cluster_selection_method='eom', gen_min_span_tree=TRUE, prediction_data=TRUE, core_dist_n_jobs = 1)

```

Other `clustering` methods (like `kmeans`) can be used instead.

### c-TF-IDF

The `Countvectorizer` calculates the `c-TF-IDF` frequencies and enables the `representation model` defined below to extract suitable keywords as descriptors of the extracted topics.

`Stopwords` are removed *after* `embeddings` creation, but *before* `keyword` extraction. Stopword removal is accomplished with the `CountVectorizer` method.

```{r eval=run_chunks}
# Initialize CountVectorizer
vectorizer_model <- CountVectorizer(min_df=2L, ngram_range=tuple(1L, 3L), 
                                    max_features = 10000L, max_df = 50L,
                                    stop_words = all_stopwords)
sentence_vectors <- vectorizer_model$fit_transform(texts_cleaned)
sentence_vectors_dense <- np$array(sentence_vectors)
sentence_vectors_dense <- py_to_r(sentence_vectors_dense)

```

### Representation models

In the example below, *multiple representation models* are used for keyword extraction from the identified topics and topic description: `keyBERT` (part of `BERTopic`), a language model (served locally by `ollama` via the `OpenAI` endpoint, but it is also possible to use models from `Groq` or other providers), a Maximal Marginal Relevance model (`MMR`) and a `spacy` `POS` representation model. By default, only one representation model is created by `bertopic`. 

Before running the following chunk, make sure that all representation models are downloaded and installed and set `BERTOPICR_ENABLE_REPR=true`. Otherwise, comment those representation models out that are missing or won't be used in the topic training pipeline. If you want a minimal workflow, use `train_bertopic_model()` instead (see `train_and_save_model.Rmd`).

*Note*: If you use the `train_bertopic_model()` helper (see *Quick Start* section in *Readme.md* and the vignette *load_and_reuse.Rmd*) instead of the procedure in this notebook, you can include only one representation model (*default = "none"*).

```{r eval=run_chunks && identical(Sys.getenv("BERTOPICR_ENABLE_REPR"), "true")}
# Initialize representation models
keybert_model <- bertopic$representation$KeyBERTInspired()
openai <- import("openai")
OpenAI <- openai$OpenAI
ollama <- import("ollama")
# lmstudio <- import("lmstudio")

# Point to the local server (ollama or lm-studio)
client <- OpenAI(base_url = 'http://localhost:11434/v1', api_key='ollama')
# client <- OpenAI(base_url = 'http://localhost:1234/v1', api_key='lm-studio')

prompt <- "
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic: <topic label>
"

# download an appropriate LLM to be hosted by ollama or lm-studio
openai_model <- bertopic$representation$OpenAI(client, 
                                               model = "gpt-oss:20b", 
                                               exponential_backoff = TRUE, 
                                               chat = TRUE, 
                                               prompt = prompt)

# downlaod a language model from spacy.io before use here
# Below a German spacy model is used
pos_model <- bertopic$representation$PartOfSpeech("de_core_news_lg")
# diversity set relatively high to reduce repetition of keyword word forms
mmr_model <- bertopic$representation$MaximalMarginalRelevance(diversity = 0.5)

# Combine all representation models
representation_model <- list(
  "KeyBERT" = keybert_model,
  "OpenAI" = openai_model,
  "MMR" = mmr_model,
  "POS" = pos_model
)

```

The *prompt* describes the task the language model has to accomplish, mentions the documents to work with and the topic labels that it should derive from the text contents and keywords.

### Zeroshot keywords

`Bertopic` enables us to define a `zeroshot` list of keywords that can be used to drive the topic model towards desired topic outcomes. In the topic model below, the zeroshot keyword list is disabled, but can be activated if needed.

```{r eval=run_chunks}
# We can define a number of topics of interest 
zeroshot_topic_list  <- list("german national identity", "minority issues in germany")

```

### Topic Model

In the next step, we initialize the BERTopic model pipeline and hyperparameters.

```{r eval=run_chunks}
# Initialize BERTopic model with pipeline models and hyperparameters
BERTopic <- bertopic$BERTopic
topic_model <- BERTopic(
  embedding_model = embedding_model,
  umap_model = umap_model,
  hdbscan_model = hdbscan_model,
  vectorizer_model = vectorizer_model,
  # zeroshot_topic_list = zeroshot_topic_list,
  # zeroshot_min_similarity = 0.85,
  representation_model = representation_model,
  calculate_probabilities = TRUE,
  top_n_words = 200L, # if you need more top words, insert the desired number here!!!
  verbose = TRUE
)

```


### Model Training

After all these preparational steps, the topic model is ready to be trained with:  `topic_model$fit_transform(texts, embeddings)`.

```{r eval=run_chunks}
tictoc::tic()

# Fit the model and transform the texts
fit_transform <- topic_model$fit_transform(texts_cleaned, embeddings)
topics <- fit_transform[[1]]

# Now transform the texts to get the updated probabilities
transform_result <- topic_model$transform(texts_cleaned)
probs <- transform_result[[2]]  # Extract the updated probabilities

tictoc::toc()

```

We obtain the *topic labels* with `topics <- fit_transform[[1]]` and the *topic probabilities* with `probs <- fit_transform[[2]]`.


### Topic Dynamics

Since our dataset contains time-related metadata, we can use the `timestamps` for `dynamic topic modeling`, i.e., for discovering topic development or topic sequences through time. If your data doesn't contain any time-related column, skip or disable the timestamps and topics_over_time calculations.

```{r eval=run_chunks}
# Converting R Date to Python datetime
datetime <- import("datetime")

timestamps <- as.list(dataset$date)
# timestamps <- as.integer(dataset$year)

# Convert each R date object to an ISO 8601 string
timestamps <- lapply(timestamps, function(x) {
  format(x, "%Y-%m-%dT%H:%M:%S")  # ISO 8601 format
})

# Dynamic topic model
topics_over_time  <- topic_model$topics_over_time(texts_cleaned, timestamps, nr_bins=20L, global_tuning=TRUE, evolution_tuning=TRUE)

```


### Store Results

The *topic labels* and *probabilities* are stored in a dataframe named *results*, together with other variables and metadata.

```{r}
# Combine results with additional columns
results <- dataset |> 
  mutate(Topic = topics, 
         Probability = apply(probs, 1, max))  # Assuming the highest probability for each sentence

results <- results |> 
  mutate(row_id = row_number()) |> 
  select(row_id, everything())

head(results,10) |> rmarkdown::paged_table()

```

```{r eval=run_chunks && identical(Sys.getenv("NOT_CRAN"), "true")}
results |>
  saveRDS("inst/extdata/spiegel_topic_results_df.rds", version = 2)
results |>
  write_csv("inst/extdata/spiegel_topic_results_df.csv")

```


## Results

The `R` package `bertopicr` will be used in this section to display the topic modeling results in the form of lists, data frames and visualizations. The names of the functions are nearly the same as in the `Python` package `BERTopic`. 

### Document information

The `get_document_info_df()` creates a dataframe that contains the documents and associated topics, characteristic keywords, probability scores, representative documents of a each topic and representation model results (e.g., keywords extracted by `KeyBERT`, `MMR`, `spacy`, and `LLM` descriptions of the topics).

```{r eval=run_chunks}
library(bertopicr)
document_info_df <- get_document_info_df(model = topic_model, 
                                         texts = texts_cleaned, 
                                         drop_expanded_columns = TRUE)
document_info_df |> head() |> rmarkdown::paged_table()
```

### Representative docs

First, create a data frame similar to *df_docs* below, which contains the columns Topic, Document and probs. Then use the `get_most_representative_docs()` function to extract representative documents of a chosen topic.

```{r eval=run_chunks}
# Create a data frame similar to df_docs
df_docs <- tibble(Topic = results$Topic,
                  Document = results$text_clean,
                  probs = results$Probability)
rep_docs <- get_most_representative_docs(df = df_docs, 
                                         topic_nr = 3, 
                                         n_docs = 5)
unique(rep_docs)
```

### Topic information

The function `get_topic_info_df()` creates another useful data frame, which for each of the extracted topics shows the number of associated documents (or text chunks), topic id (Name), characeristic keywords according to the chosen representation models and three (concatenated) representative documents.

```{r eval=run_chunks}
topic_info_df <- get_topic_info_df(model = topic_model, 
                                   drop_expanded_columns = TRUE)
head(topic_info_df) |> rmarkdown::paged_table()
```


### Words in Topics

The `get_topics_df()` function concentrates on the words associated with a certain topic and their probability scores. The outliers (Topic = -1) usually are not included in the analysis. But BERTopic offers a function to reduce the number of outliers and to update the topic model. 

```{r eval=run_chunks}
topics_df <- get_topics_df(model = topic_model)
head(topics_df, 10)
```

### Topic Barchart

The `visualize_barchart()` creates an interactive barchart with the top five words of the most frequently occurring topics. 

```{r eval=run_chunks && identical(Sys.getenv("NOT_CRAN"), "true")}
visualize_barchart(model = topic_model, 
                   filename = "topics_topwords_interactive_barchart.html", # default
                   open_file = FALSE) # TRUE enables output in browser
```

![topics_topwords_interactive_barchart](images/topics_topwords_interactive_barchart.png)

We might prefer to create a customizable barchart with the `ggplot2` package by using the dataframe extracted by the `get_topics_df()` function and run the `plotly` library in `R` with the `ggplotly()` function for an interactive barchart. Due to possible conflicts between `R`'s `plotly` library and `Python`'s `plotly` implementation, the interactive barchart below is disabled. 

```{r eval=run_chunks}
library(ggplot2)

barchart <- topics_df |> 
  group_by(Topic) |> 
  filter(Topic >= 0 & Topic <= 8) |> 
  slice_head(n=5) |> 
  mutate(Topic = paste("Topic", as.character(Topic)), 
         Word = reorder(Word, Score)) |> 
  ggplot(aes(Score, Word, fill = Topic)) +
  geom_col() +
  facet_wrap(~ Topic, scales = "free") +
  theme(legend.position = "none")

# # Disabled to avoid poential conflicts
# library(plotly)
# ggplotly(barchart)
```

### Find Topics

The `find_topics_df()` function is useful for semantic search. It can identify topics that are associated with a chosen query or multiple queries.

```{r eval=run_chunks}
find_topics_df(model = topic_model, 
               queries = "migration", # user input
               top_n = 10, # default
               return_tibble = TRUE) # default
```

```{r eval=run_chunks}
find_topics_df(model = topic_model, 
                               queries = c("migranten", "asylanten"),
                               top_n = 5)
```


### Get Topics

The `get_topic_df()` function creates a dataframe that contains the top words extracted from a chosen topic.

```{r eval=run_chunks}
get_topic_df(model = topic_model, 
                           topic_number = 0, 
                           top_n = 5, # default is 10
                           return_tibble = TRUE) # default
```

### Topic Distribution

The `visualize_distribution()` function produces an interactive barchart that displays the associated topics of a chosen document (or text chunk). The probability scores help to identify the most likely topic(s) of a document (or text chunk).

```{r eval=run_chunks && identical(Sys.getenv("NOT_CRAN"), "true")}
# default filename: topic_dist_interactive.html
visualize_distribution(model = topic_model, 
                       text_id = 1, # user input
                       probabilities = probs) # see model training
```

![topic_dist_interactive](images/topic_dist_interactive.png)

### Intertopic Distance Map

The semantic relatedness or distance of topics can be displayed in the form of a map with the `visualize_topics()` function. 

```{r eval=run_chunks && identical(Sys.getenv("NOT_CRAN"), "true")}
visualize_topics(model = topic_model, 
                 filename = "intertopic_distance_map") # default name
```

![intertopic_distance_map](images/intertopic_distance_map.png)

### Topic Similarity

We can create a `similarity matrix` by applying cosine similarities through the generated topic embeddings. The resulting matrix indicates how similar topics are to each other. To visualize the similarity matrix we can use the `visualize_heatmap()` function. 

```{r eval=run_chunks && identical(Sys.getenv("NOT_CRAN"), "true")}
visualize_heatmap(model = topic_model, 
                  filename = "topics_similarity_heatmap", 
                  auto_open = FALSE)
```

![topics_similarity_heatmap](images/topics_similarity_heatmap.png)


### Topic hierarchy

The best way to display the relatedness of documents (or text chunks) is the `visualize_hierarchy()` function, which creates an interactive `dendrogram`.

```{r eval=run_chunks && identical(Sys.getenv("NOT_CRAN"), "true")}
visualize_hierarchy(model = topic_model, 
                    hierarchical_topics = NULL, # default
                    filename = "topic_hierarchy", # default name, html extension
                    auto_open = FALSE) # TRUE enables output in browser
```

An additional option is the creation of a `hierarchical topics` list that can be included in the interactive `dendrogram` and enables the user to identify joint expressions.

```{r eval=run_chunks && identical(Sys.getenv("NOT_CRAN"), "true")}
hierarchical_topics = topic_model$hierarchical_topics(texts_cleaned)
visualize_hierarchy(model = topic_model, 
                    hierarchical_topics = hierarchical_topics,
                    filename = "topic_hierarchy", # default name, html extension
                    auto_open = FALSE) # TRUE enables output in browser
```

![topic_hierarchy](images/topic_hierarchy.png)


### Visualize Documents

The `visualize_documents()` function displays the identified clusters associated with a certain topic in two dimensions. Usually, it is best to reduce the dimensionality of the embeddings with `UMAP` (or another dimension reduction method) to produce intelligible visual results. The interactive plot allows the user to select one or more clusters with a double-click of the mouse. 

```{r eval=run_chunks && identical(Sys.getenv("NOT_CRAN"), "true")}
# Reduce dimensionality of embeddings using UMAP
reduced_embeddings <- umap$UMAP(n_neighbors = 10L, n_components = 2L, min_dist = 0.0, metric = 'cosine')$fit_transform(embeddings)

visualize_documents(model = topic_model, 
                    texts = texts_cleaned, 
                    reduced_embeddings = reduced_embeddings, 
                    filename = "visualize_documents", # default extension html
                    auto_open = FALSE) # TRUE enables output in browser

```

![visualize_documents](images/visualize_documents.png)

After updating to `BERTopic=0.17.0`, you might experience that the `visualize_documents()` function doesn't render the dots in the scatterplot. A simple **temporary fix** is to open the `Python` file `_documents.py` of the `visualize_documents()` function of` BERTopic` (on my `Windows` system it sits in `anaconda3\envs\bertopic\Lib\site-packages\bertopic\plotting\`) and change `go.Scattergl` to `go.Scatter` in the `fig.add_trace()` function (it occurs twice in the `Python` script).


The `visualize_documents_2d()` function is a variant of the interactive plot above, but with additional *tooltips*. Set `n_components` = 2L in *reduced_embeddings*! 

```{r eval=run_chunks && identical(Sys.getenv("NOT_CRAN"), "true")}
# Reduce dimensionality of embeddings using UMAP (n_components = 2L !!!)
reduced_embeddings <- umap$UMAP(n_neighbors = 10L, n_components = 2L, min_dist = 0.0, metric = 'cosine')$fit_transform(embeddings)

visualize_documents_2d(model = topic_model, 
                       texts = texts_cleaned, 
                       reduced_embeddings = reduced_embeddings, 
                       custom_labels = FALSE, # default
                       hide_annotation = TRUE, # default
                       tooltips = c("Topic", "Name", "Probability", "Text"), # default
                       filename = "visualize_documents_2d", # default name
                       auto_open = FALSE) # TRUE enables output in browser
```


To create an interactive 3D plot, the `visualize_documents_3d()` function can be used. This function is not implemented in the `Python` package. Set `n_components` = 3L in *reduced_embeddings*!

```{r eval=run_chunks && identical(Sys.getenv("NOT_CRAN"), "true")}
# Reduce dimensionality of embeddings using UMAP
reduced_embeddings <- umap$UMAP(n_neighbors = 10L, n_components = 3L, min_dist = 0.0, metric = 'cosine')$fit_transform(embeddings)

visualize_documents_3d(model = topic_model, 
                       texts = texts_cleaned, 
                       reduced_embeddings = reduced_embeddings, 
                       custom_labels = FALSE, # default
                       hide_annotation = TRUE, # default
                       tooltips = c("Topic", "Name", "Probability", "Text"), # default
                       filename = "visualize_documents_3d", # default name
                       auto_open = FALSE) # TRUE enables output in browser
```

![visualize_documents_3d](images/visualize_documents_3d.png)

The legend of the *updated* `visualize_documents_3d()` function shows the Topic keywords (*Name*) instead of the Topic number and includes a few *tooltips*.

### Topic Development

We can also inspect how a chosen number of topics develop during a certain period of time. The `visualize_topics_over_time()` function assumes that the `timestamps`, the `topic model` and the `topics over time model` are already defined (e.g., in the model preparation step after topic model training). The `timestamps` need to be `integers` or in a a certain `date format` (see model preparation step above). 

```{r eval=run_chunks && identical(Sys.getenv("NOT_CRAN"), "true")}
visualize_topics_over_time(model = topic_model, 
                           # see Topic Dynamics section above
                           topics_over_time_model = topics_over_time,
                           top_n_topics = 10, # default is 20
                           filename = "topics_over_time") # default, html extension
```

![topics_over_time](images/topics_over_time.png)


### Groups

If our dataset includes categorical variables (groups, classes, etc.), we can use the `visualize_topics_per_class()` function to display an interactive barchart with the groups or classes associated with a chosen topic. With a double-click of the mouse, the user can choose a single topic and inspect the frequency of the groups.

```{r eval=run_chunks && identical(Sys.getenv("NOT_CRAN"), "true")}
classes = as.list(dataset$genre) # text types
topics_per_class = topic_model$topics_per_class(texts_cleaned, classes=classes)

visualize_topics_per_class(model = topic_model, 
                           topics_per_class = topics_per_class,
                           start = 0, # default
                           end = 10, # default
                           filename = "topics_per_class", # default, html extension 
                           auto_open = FALSE) # TRUE enables output in browser
```

![topics_per_class](images/topics_per_class.png)

We could also use the `ggplot2` and `plotly` packages in `R` to produce a similar looking customized interactive barchart.

## Wordcloud

Wordclouds are not implemented in `BERTopic`, but it is possible to extract the top n words from a topic. First, spin up a new `BERTopic` model with `top_n = 200L` (i.e., the 200 most frequent words). 

```{r eval=run_chunks}
BERTopic200 <- bertopic$BERTopic
topic_model200 <- BERTopic200(
  embedding_model = embedding_model,
  umap_model = umap_model,
  hdbscan_model = hdbscan_model,
  vectorizer_model = vectorizer_model,
  # zeroshot_topic_list = zeroshot_topic_list,
  # zeroshot_min_similarity = 0.85,
  representation_model = representation_model,
  calculate_probabilities = TRUE,
  top_n_words = 200L, # !!!
  verbose = TRUE
)

tictoc::tic()

# Fit the model and transform the texts
py_fit <- topic_model200$fit(texts_cleaned, embeddings)

# ask Python for the top-200 of the desired topic:
py_topic200 <- py_fit$get_topic(1L, 200L)    # list of (word, score)

names(py_topic200)

rep_list <- py_topic200[["Main"]]

tictoc::toc()

```

Then create a dataframe to plot a wordcloud.

```{r eval=run_chunks}
df_wc <- data.frame(
  name = sapply(rep_list, `[[`, 1),
  freq = as.numeric(sapply(rep_list, `[[`, 2)),
  stringsAsFactors = FALSE
)

library(wordcloud2)
source("inst/extdata/wordcloud2a.R")

wordcloud2a(
  data            = df_wc,
  size            = 0.5,
  minSize         = 0,
  gridSize        = 1,
  fontFamily      = "Segoe UI",
  fontWeight      = "bold",
  color           = "random-dark",
  backgroundColor = "white",
  shape           = "circle",
  ellipticity     = 0.65
)

```


## Conclusion

`BERTopic` is an awesome topic modeling package in Python. The `bertopicr` package tries to bring some of the functionalities into an `R` programming environment with the magnificent `reticulate` package as interface to the `Python` backend. 

`BERTopic` offers a number of additional functions, which might be included in subsequent versions of `bertopicr`.