--- title: "Introduction to LDAShiny: Bibliometric Topic Modeling" author: "Javier De La Hoz Maestre, María José Fernandez-Gomez, Susana Mendes" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to LDAShiny: Bibliometric Topic Modeling} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Overview **LDAShiny** is an interactive R package built with the [Shiny](https://shiny.posit.co/) framework and the [golem](https://thinkr-open.github.io/golem/) architecture. It provides a complete, graphical workflow for **Latent Dirichlet Allocation (LDA) topic modeling** applied to bibliometric data exported from [Scopus](https://www.scopus.com/) and [Web of Science (WoS)](https://clarivate.com/products/scientific-and-academic-research/research-discovery-and-workflow-solutions/webofscience-platform/). The package was developed by the **GEMC Research Group** (Grupo de Estadística y Métodos Cuantitativos) at Universidad del Magdalena, Colombia. ### Citation If you use LDAShiny in your research, please cite: > De la Hoz-M., J.; Fernandez-Gomez, M. J.; Mendes, S. (2021). *LDAShiny: An > R Package for Exploratory Review of Scientific Literature Based on a Bayesian > Probabilistic Model and Machine Learning Tools.* Mathematics, 9(14), 1671. > DOI: [10.3390/math9141671](https://doi.org/10.3390/math9141671) --- ## Installation Install the development version from GitHub: ```{r install-github} # install.packages("remotes") remotes::install_github("your_user/LDAShiny") ``` Or, once published on CRAN: ```{r install-cran} install.packages("LDAShiny") ``` --- ## Launching the Application Start the interactive dashboard with a single call: ```{r launch} library(LDAShiny) run_LDAShiny() ``` By default, the application accepts file uploads up to **500 MB**. You can adjust this limit: ```{r launch-custom} run_LDAShiny(max_upload_mb = 1000) ``` The dashboard opens in your default web browser and presents five sequential modules in the left-hand sidebar, each building on the output of the previous one. --- ## Workflow Overview The full analysis pipeline consists of five modules: ``` ┌──────────────────────┐ │ 1. Data Import │ Upload Scopus CSV + WoS TXT → merged data.frame └──────────┬───────────┘ │ ┌──────────▼───────────┐ │ 2. Text Preprocessing│ Tokenise · Stopwords · Stemming · DTM └──────────┬───────────┘ │ ┌──────────▼───────────┐ │ 3. Inference (K) │ Run LDA for k_min..k_max → coherence curve └──────────┬───────────┘ │ ┌──────────▼───────────┐ │ 4. Final LDA Model │ Train model at optimal K → β, γ, word clouds └──────────┬───────────┘ │ ┌──────────▼───────────┐ │ 5. Trend Analysis │ Linear regression of topic intensity over time └──────────────────────┘ ``` --- ## Module 1 — Data Import ### Supported formats | Source | Format | Notes | |-----------------|---------------|------------------------------------| | Scopus | `.csv` | Standard Scopus export | | Web of Science | `.txt` | Plain-text tagged format | | Integrated file | `.xlsx` | Previously exported merged dataset | ### How to use 1. Select **"Merge Scopus + Web of Science"** from the action picker. 2. Upload one or more Scopus CSV files and one or more WoS TXT files. 3. Click **"Process and Merge"**. The module: - Standardises column names across both sources (`doi`, `title`, `year`, `Journal`, `abstract`, `database`). - Deduplicates records by DOI (case-insensitive). Records without a DOI are deduplicated by exact row content. - Displays a summary table, a missing-values report, and a preview of the top 10 records. - Allows customisable bar or lollipop charts of journal and year distributions, with full export control (PNG, TIFF, JPEG, PDF). The merged dataset can be downloaded as an `.xlsx` file for reuse via the **Load Integrated Excel File** option in future sessions. ### Internal helper functions Two internal functions handle the parsing and standardisation steps: - `parse_wos(path)` — reads a WoS plain-text export and returns a `data.frame` with columns `doi`, `title`, `year`, `Journal`, `abstract`, `database`. Missing fields (e.g. absent DOI) are set to `NA`. - `standardize_scopus(df)` — maps Scopus column names to the shared schema. Any missing column is filled with `NA`. --- ## Module 2 — Text Preprocessing This module converts the free-text field (typically `abstract`) into a **Document-Term Matrix (DTM)** ready for LDA. ### Options | Option | Default | Description | |--------------------------|---------------|-------------------------------------------------------| | Text column | `abstract` | Column used as the document text | | Document ID column | `title` | Column used as row identifier | | Min / Max n-gram | 1 / 2 | Unigrams and bigrams by default | | Stemming (Porter) | Enabled | Reduces words to their root form | | Remove numbers | Enabled | | | Remove punctuation | Enabled | | | Sparse filter | 0.995 | Removes terms appearing in fewer than 0.5 % of docs | | Custom stopwords | — | Upload an `.xlsx` file with one word per row | | CPUs | `max - 1` | Parallel cores for DTM construction | ### Output After clicking **"Run Preprocessing"**: - **DTM Summary** tab shows the raw and post-filter matrix dimensions. - **Term Frequency** tab presents an interactive table (`tf_mat`) with term, document frequency, and inverse document frequency columns. - The filtered DTM (`.rds`) and term-frequency table (`.xlsx`) can be downloaded for external use. ### Technical details The preprocessing pipeline uses: 1. `textmineR::CreateDtm()` for tokenisation, n-gram construction, and stopword removal. 2. `SnowballC::wordStem()` (Porter algorithm) for optional stemming. 3. `quanteda` + `tm::removeSparseTerms()` for sparsity filtering. 4. Conversion back to a `Matrix::dgCMatrix` for memory-efficient LDA fitting. --- ## Module 3 — LDA Inference (Selecting K) Choosing the right number of topics **K** is a critical step. This module fits multiple LDA models over a user-defined range of K values and evaluates each using **mean topic coherence** — a measure of the semantic interpretability of topics. ### Settings | Parameter | Default | Description | |-------------|---------|---------------------------------------------------| | k start | 5 | Minimum number of topics to test | | k end | 40 | Maximum number of topics to test | | k step | 1 | Increment between successive K values | | Iterations | 500 | Gibbs sampler iterations per model | | Burn-in | 50 | Initial iterations discarded before sampling | | Alpha | 0.1 | Document-topic concentration (Dirichlet prior) | | CPUs | max − 1 | Parallel workers (one model per core) | ### Interpreting results - The **Coherence Table** tab lists coherence score for every K tested. - The **Coherence Plot** tab shows the curve; the peak indicates the optimal K. A higher coherence score means that the top terms of a topic tend to co-occur frequently in the same documents, producing more semantically coherent topics. The plot is fully customisable (colors, themes, font sizes) and can be exported in PNG, TIFF, JPEG, or PDF formats. ### Recommendation If the coherence curve has multiple local maxima, prefer the smaller K for a more parsimonious model, unless domain knowledge justifies a larger number of topics. --- ## Module 4 — Final LDA Model Training Once an optimal K is identified, this module trains the definitive LDA model. The **K field is pre-populated** with the value selected in Module 3, though it can be overridden. ### Parameters | Parameter | Default | Description | |-----------------|---------|-----------------------------------------------------| | K | (from inference) | Number of topics | | Iterations | 500 | Gibbs sampler iterations | | Burn-in | 50 | Warm-up iterations | | Alpha | 50 / K | Document-topic prior (auto-scaled) | | Beta | 0.05 | Topic-term prior | | Optimize Alpha | Yes | Whether to update alpha during sampling | | Metrics | Likelihood, Coherence | Optional: also compute R² | ### Results tabs After clicking **"Train Final LDA Model"**: | Tab | Content | |------------------------------|--------------------------------------------------------| | Model Evaluation Metrics | K, iterations, α, β, mean coherence, log-likelihood | | Top Terms Matrix | Top M terms per topic (configurable, default M = 20) | | Document Topic Assignment | Dominant topic for each document | | Top Documents per Topic | Top M documents per topic ranked by γ weight | | Topic-Term Weights (Beta) | Full β matrix in tidy long format | | Document-Topic Weights (Gamma)| Full γ matrix in tidy long format | | Topic Word Cloud | Interactive word cloud per topic | #### Understanding Beta (β) and Gamma (γ) - **β (phi matrix)**: probability of each term given a topic. A high β value for a term in a topic means that term is strongly associated with that topic. - **γ (theta matrix)**: probability of each topic given a document. A high γ value means the document strongly belongs to that topic. #### Word Clouds Select any topic from the dropdown, set the maximum number of words, choose a color palette (from `RColorBrewer`), and click **"Generate Word Cloud"**. The word size is proportional to the term's β weight. Word clouds can be exported in PNG, JPEG, PDF, or TIFF format. ### Downloads All result tables are available as `.xlsx` files. The full trained model object can be saved as an `.rds` file: ```{r load-model} # Load a previously saved model lda_model <- readRDS("lda_final_model.rds") # Inspect the phi (topic-term) matrix head(lda_model$phi[, 1:5]) # Inspect the theta (document-topic) matrix head(lda_model$theta[, 1:5]) ``` --- ## Module 5 — Topic Trend Analysis This module examines how topics have evolved over time by fitting a **simple linear regression** of mean topic intensity (mean γ) against publication year, separately for each topic. ### Classification Each topic is assigned one of three trend categories: | Category | Criterion | Plot color | |---------------------|--------------------------------------------------|------------| | **HOT (Increasing)** | Slope > 0, p-value < threshold | Red | | **COLD (Decreasing)** | Slope < 0, p-value < threshold | Light blue | | **EQUAL (Stable)** | p-value ≥ threshold (non-significant slope) | Grey | The default significance threshold is **p = 0.05**, adjustable via the P-Value Threshold input. ### How to use 1. After training the LDA model in Module 4, navigate to **Trend Analysis**. 2. Set the p-value threshold (default: 0.05). 3. Click **"Run Trend Analysis"**. 4. Select any topic from the dropdown to visualise its trend. ### Results tabs | Tab | Content | |---------------------|-------------------------------------------------------| | Topic-Year Data | Raw yearly mean γ per topic | | Regression Results | Slope estimate and p-value for each topic | | Topic Trend Plot | Scatter plot with fitted regression line, color-coded | The trend plot is fully customisable and exportable in multiple formats. ### Downloads - **Regression Results (Excel)**: full table of slopes, p-values, and trend classifications for all topics. - **Topic-Year Data (Excel)**: the source data used for the regression. - **Plot**: current trend plot exported in the chosen format. --- ## Tips and Best Practices **Data quality** - Remove records with empty abstracts before importing; these produce empty document vectors that inflate DTM sparsity. - Ensure year information is numeric and complete for valid trend analysis. **Preprocessing** - Start with the default sparse filter (0.995) and increase it if the DTM is very large. - Upload a custom stopword list (`.xlsx`, one term per row) to remove domain-specific noise (e.g. "study", "result", "method"). **Choosing K** - Inspect the coherence curve carefully; a broad plateau often indicates a range of valid K values. - Validate the final model qualitatively by reading the top terms of each topic and checking they form coherent themes. **Reproducibility** - Download the merged dataset after Module 1 so future sessions can start directly from the **Load Integrated Excel File** option. - Save the trained model (`.rds`) to avoid retraining for downstream analyses. --- ## Session Information ```{r session-info} sessionInfo() ``` --- ## References - De la Hoz-M., J.; Fernandez-Gomez, M. J.; Mendes, S. (2021). LDAShiny: An R Package for Exploratory Review of Scientific Literature Based on a Bayesian Probabilistic Model and Machine Learning Tools. *Mathematics*, 9(14), 1671. - Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. *Journal of Machine Learning Research*, 3, 993–1022. - Chang, W., Cheng, J., Allaire, J., et al. (2023). *shiny: Web Application Framework for R*. R package version 1.7.5. - Jones, T. (2019). *textmineR: Functions for Text Mining and Topic Modeling*. R package.