---
title: "SportMiner: Text Mining and Topic Modeling for Sport Science Literature"
author:
  - name: Praveen D Chougale
    affiliation: IIT Bombay
    email: praveenmaths89@gmail.com
  - name: Usha Ananthakumar
    affiliation: IIT Bombay
    email: usha@som.iitb.ac.in
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    number_sections: true
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{SportMiner: Text Mining and Topic Modeling for Sport Science Literature}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE,
  fig.width = 7,
  fig.height = 5
)
```

# Introduction

The exponential growth of scientific literature in sport science domains presents both opportunities and challenges for researchers. While vast amounts of knowledge are being generated, systematically synthesizing and identifying research trends has become increasingly difficult. **SportMiner** addresses this challenge by providing a comprehensive, integrated toolkit for mining, analyzing, and visualizing sport science literature.

## Motivation

Traditional literature review methods are time-consuming and potentially biased. Researchers need automated tools to:

1. **Efficiently retrieve** relevant papers from large databases
2. **Systematically process** textual content at scale
3. **Discover latent themes** using advanced topic modeling
4. **Visualize trends** in publication patterns and research focus
5. **Identify knowledge gaps** and emerging research directions

## Related Work

Several R packages address components of text mining and topic modeling:

- **tm** [@Feinerer2008]: General text mining framework
- **topicmodels** [@Grun2011]: Latent Dirichlet Allocation (LDA) and Correlated Topic Models (CTM)
- **stm** [@Roberts2019]: Structural Topic Models with covariates
- **tidytext** [@Silge2016]: Text mining using tidy data principles

However, no existing package provides an integrated workflow specifically designed for scientific literature analysis with domain-specific features for sport science research.

## Contributions

**SportMiner** makes the following contributions:

1. **Integrated workflow** from data retrieval through visualization
2. **Scopus API integration** for systematic literature searches
3. **Multiple topic modeling algorithms** (LDA, CTM, STM) with comparison tools
4. **Publication-ready visualizations** following modern design principles
5. **Keyword co-occurrence networks** for understanding research connections
6. **Temporal trend analysis** for tracking research evolution

# Installation and Setup

## Installation

Install the released version from CRAN:

```{r installation}
install.packages("SportMiner")
```

Or install the development version from GitHub:

```{r install-dev}
# install.packages("devtools")
devtools::install_github("praveenchougale/SportMiner")
```

## API Configuration

SportMiner uses the Scopus API for literature retrieval. Obtain a free API key from the [Elsevier Developer Portal](https://dev.elsevier.com/).

```{r api-setup}
library(SportMiner)

# Option 1: Set directly in session
sm_set_api_key("your_api_key_here")

# Option 2: Store in .Renviron (recommended)
# usethis::edit_r_environ()
# Add: SCOPUS_API_KEY=your_api_key_here
# Restart R, then:
sm_set_api_key()
```

# Complete Workflow Example

This section demonstrates a complete analysis workflow from literature search through topic modeling and visualization.

## Step 1: Literature Retrieval

Scopus queries follow a structured syntax with field codes and Boolean operators.

### Basic Query Syntax

```{r query-basics}
# Search in title, abstract, and keywords
query_basic <- 'TITLE-ABS-KEY("machine learning" AND "sports")'

# Search specific fields
query_title <- 'TITLE("performance prediction")'
query_abstract <- 'ABS("neural networks")'
query_keywords <- 'KEY("injury prevention")'
```

### Boolean Operators and Filters

```{r query-advanced}
# Complex query with multiple conditions
query <- paste0(
  'TITLE-ABS-KEY(',
  '("machine learning" OR "deep learning" OR "artificial intelligence") ',
  'AND ("sports" OR "athlete*" OR "performance") ',
  'AND NOT "e-sports"',
  ') ',
  'AND DOCTYPE(ar) ',                    # Articles only
  'AND PUBYEAR > 2018 ',                 # Published after 2018
  'AND LANGUAGE(english) ',              # English only
  'AND SUBJAREA(MEDI OR HEAL OR COMP)'   # Relevant subject areas
)
```

### Available Search Filters

**Document Type Filters:**
- `DOCTYPE(ar)`: Journal articles
- `DOCTYPE(re)`: Review articles
- `DOCTYPE(cp)`: Conference papers

**Date Filters:**
- `PUBYEAR = 2024`: Exact year
- `PUBYEAR > 2019`: After 2019
- `PUBYEAR > 2019 AND PUBYEAR < 2025`: Between years

**Subject Area Filters:**
- `SUBJAREA(MEDI)`: Medicine
- `SUBJAREA(HEAL)`: Health Professions
- `SUBJAREA(COMP)`: Computer Science
- `SUBJAREA(PSYC)`: Psychology

### Execute Search

```{r search-execution}
papers <- sm_search_scopus(
  query = query,
  max_count = 200,
  batch_size = 100,
  view = "COMPLETE",
  verbose = TRUE
)

# Inspect results
dim(papers)
head(papers[, c("title", "year", "author_keywords")])
```

The function returns a data frame with columns including `title`, `abstract`, `author_keywords`, `year`, `doi`, and `eid`.

## Step 2: Text Preprocessing

Raw abstracts require preprocessing before topic modeling.

```{r preprocess}
processed_data <- sm_preprocess_text(
  data = papers,
  text_col = "abstract",
  doc_id_col = "doc_id",
  min_word_length = 3
)

head(processed_data)
```

The preprocessing pipeline performs:

1. **Tokenization**: Split text into individual words
2. **Lowercasing**: Convert to lowercase
3. **Stop word removal**: Remove common words (the, and, of, etc.)
4. **Number removal**: Remove numeric tokens
5. **Stemming**: Reduce words to root forms using Porter stemmer
6. **Filtering**: Keep only words with minimum length

## Step 3: Document-Term Matrix

Create a sparse matrix representation of term frequencies.

```{r dtm}
dtm <- sm_create_dtm(
  word_counts = processed_data,
  min_term_freq = 3,
  max_term_freq = 0.5
)

# Matrix dimensions
print(paste("Documents:", dtm$nrow, "| Terms:", dtm$ncol))

# Sparsity
sparsity <- 100 * (1 - slam::row_sums(dtm > 0) / (dtm$nrow * dtm$ncol))
print(paste("Sparsity:", round(sparsity, 2), "%"))
```

Parameters `min_term_freq` and `max_term_freq` control vocabulary size:
- `min_term_freq`: Minimum document frequency (removes rare terms)
- `max_term_freq`: Maximum document proportion (removes very common terms)

## Step 4: Optimal Topic Number Selection

Determine the appropriate number of topics using model evaluation metrics.

```{r optimal-k}
k_selection <- sm_select_optimal_k(
  dtm = dtm,
  k_range = seq(4, 20, by = 2),
  method = "gibbs",
  plot = TRUE
)

# View results
print(k_selection$metrics)
print(paste("Optimal k:", k_selection$optimal_k))
```

The function compares models across different values of $k$ using perplexity, a measure of model fit (lower is better).

## Step 5: Train Topic Model

Fit a Latent Dirichlet Allocation (LDA) model with the optimal number of topics.

```{r train-lda}
lda_model <- sm_train_lda(
  dtm = dtm,
  k = k_selection$optimal_k,
  method = "gibbs",
  iter = 2000,
  alpha = 50 / k_selection$optimal_k,  # Symmetric Dirichlet prior
  seed = 1729
)

# Examine top terms per topic
terms_matrix <- topicmodels::terms(lda_model, 10)
print(terms_matrix)
```

LDA [@Blei2003] models each document as a mixture of topics, where each topic is a distribution over words. The Gibbs sampling method [@Griffiths2004] estimates model parameters through Markov Chain Monte Carlo.

## Step 6: Model Comparison

Compare multiple topic modeling approaches.

```{r compare-models}
comparison <- sm_compare_models(
  dtm = dtm,
  k = 10,
  seed = 1729,
  verbose = TRUE
)

# View metrics
print(comparison$metrics)
print(paste("Recommended model:", comparison$recommendation))

# Extract best model
best_model <- comparison$models[[tolower(comparison$recommendation)]]
```

The function fits three models:
- **LDA**: Standard Latent Dirichlet Allocation
- **CTM**: Correlated Topic Model [@Blei2007] (allows topic correlations)
- **STM**: Structural Topic Model [@Roberts2014] (not yet implemented)

## Step 7: Visualization

### Topic Terms Visualization

Display the most important terms for each topic.

```{r plot-terms, fig.cap="Top terms per topic with beta weights"}
plot_terms <- sm_plot_topic_terms(
  model = lda_model,
  n_terms = 10
)
print(plot_terms)
```

The visualization shows term importance (beta values) within each topic. Higher beta indicates greater relevance to the topic.

### Topic Frequency Distribution

Show how topics are distributed across the document collection.

```{r plot-frequency, fig.cap="Document distribution across topics"}
plot_freq <- sm_plot_topic_frequency(
  model = lda_model,
  dtm = dtm
)
print(plot_freq)
```

### Topic Trends Over Time

Examine how topic prevalence changes over publication years.

```{r plot-trends, fig.cap="Topic prevalence trends over time"}
# Ensure papers have doc_id matching DTM rownames
papers$doc_id <- rownames(dtm)

plot_trends <- sm_plot_topic_trends(
  model = lda_model,
  dtm = dtm,
  metadata = papers,
  year_col = "year",
  doc_id_col = "doc_id"
)
print(plot_trends)
```

This visualization reveals emerging and declining research themes over time.

### Keyword Co-occurrence Network

Analyze relationships between author keywords.

```{r keyword-network, fig.cap="Author keyword co-occurrence network"}
network_plot <- sm_keyword_network(
  data = papers,
  keyword_col = "author_keywords",
  min_cooccurrence = 3,
  top_n = 30
)
print(network_plot)
```

Network analysis reveals:
- **Node size**: Keyword frequency
- **Edge width**: Co-occurrence strength
- **Communities**: Clusters of related keywords

# Advanced Usage

## Custom Preprocessing

Override default preprocessing parameters.

```{r custom-preprocess}
processed_custom <- sm_preprocess_text(
  data = papers,
  text_col = "abstract",
  doc_id_col = "doc_id",
  min_word_length = 4,      # Longer minimum word length
  custom_stopwords = c("study", "research", "paper")  # Additional stopwords
)
```

## Hyperparameter Tuning

LDA performance depends on hyperparameters.

```{r hyperparameters}
# Test different alpha values
alphas <- c(0.1, 0.5, 1.0)
results <- lapply(alphas, function(a) {
  model <- sm_train_lda(dtm, k = 10, alpha = a, seed = 1729)
  perplexity <- topicmodels::perplexity(model, dtm)
  list(alpha = a, perplexity = perplexity)
})

# Compare results
do.call(rbind, results)
```

## Exporting Results

Save models and visualizations for publication.

```{r export}
# Save model
saveRDS(lda_model, "lda_model.rds")

# Save plots
ggplot2::ggsave("topic_terms.png", plot_terms,
                width = 12, height = 8, dpi = 300)
ggplot2::ggsave("topic_trends.png", plot_trends,
                width = 12, height = 6, dpi = 300)

# Export document-topic assignments
topics <- topicmodels::topics(lda_model, 1)
papers$dominant_topic <- paste0("Topic_", topics)
write.csv(papers, "papers_with_topics.csv", row.names = FALSE)

# Export topic-term matrix
beta <- topicmodels::posterior(lda_model)$terms
write.csv(beta, "topic_term_matrix.csv")
```

# Case Study: Sports Analytics Literature

This case study demonstrates SportMiner on a systematic review of sports analytics literature.

## Research Question

What are the main research themes in sports analytics over the past decade, and how have they evolved?

## Method

```{r case-study}
# Comprehensive search query
query_case <- paste0(
  'TITLE-ABS-KEY(',
  '("sports analytics" OR "sports data science" OR "sports informatics" OR ',
  '"performance analysis" OR "match analysis") ',
  'AND ("data" OR "analytics" OR "statistics" OR "modeling")',
  ') ',
  'AND DOCTYPE(ar OR re) ',
  'AND PUBYEAR > 2013 ',
  'AND LANGUAGE(english)'
)

# Retrieve papers
papers_case <- sm_search_scopus(query_case, max_count = 500, verbose = TRUE)

# Full preprocessing pipeline
processed_case <- sm_preprocess_text(papers_case, text_col = "abstract")
dtm_case <- sm_create_dtm(processed_case, min_term_freq = 5, max_term_freq = 0.4)

# Model selection
k_case <- sm_select_optimal_k(dtm_case, k_range = seq(6, 18, by = 2), plot = TRUE)

# Train final model
model_case <- sm_train_lda(dtm_case, k = k_case$optimal_k,
                           iter = 2000, seed = 1729)

# Visualizations
terms_plot <- sm_plot_topic_terms(model_case, n_terms = 12)
trends_plot <- sm_plot_topic_trends(model_case, dtm_case, papers_case)
```

## Results Interpretation

The topic model with $k = 12$ topics identified distinct research themes:

1. **Performance prediction models**: Machine learning for outcome forecasting
2. **Injury prevention**: Biomechanical analysis and risk assessment
3. **Tactical analysis**: Team strategy and formation analysis
4. **Player evaluation**: Rating systems and talent identification
5. **Training optimization**: Load monitoring and periodization
6. **Computer vision**: Automated video analysis
7. **Wearable sensors**: Real-time monitoring systems
8. **Network analysis**: Team dynamics and interactions
9. **Social media analytics**: Fan engagement analysis
10. **Betting markets**: Prediction markets and odds analysis
11. **Fantasy sports**: Player selection algorithms
12. **Officials and refereeing**: Decision-making analysis

Temporal trends reveal:
- Increasing focus on deep learning and AI (2018-2024)
- Declining emphasis on traditional statistical methods
- Emerging interest in explainable AI and interpretability

# Computational Performance

SportMiner is designed for efficiency with large document collections.

## Benchmarks

```{r benchmarks}
# Test on varying document sizes
sizes <- c(100, 500, 1000, 2000)
times <- sapply(sizes, function(n) {
  subset_dtm <- dtm_case[1:min(n, dtm_case$nrow), ]
  system.time({
    sm_train_lda(subset_dtm, k = 10, iter = 1000)
  })["elapsed"]
})

# Display results
data.frame(documents = sizes, time_seconds = times)
```

## Optimization Tips

1. **Start small**: Test on subset before full corpus
2. **Reduce iterations**: Use 500-1000 for exploration, 2000+ for final models
3. **Parallel processing**: Enable for large k ranges in `sm_select_optimal_k()`
4. **DTM filtering**: Aggressive term filtering reduces computational burden

# Best Practices

## Reproducibility

Always set random seeds for reproducible results:

```{r reproducibility}
sm_train_lda(dtm, k = 10, seed = 1729)
sm_compare_models(dtm, k = 10, seed = 1729)
```

## Query Design

1. **Start broad, refine iteratively**: Begin with general queries, narrow based on results
2. **Test on Scopus web interface**: Verify query syntax and result counts
3. **Document your queries**: Save queries in a text file or R script
4. **Consider synonyms**: Include alternative terms and spellings

## Preprocessing Decisions

- **min_word_length**: 3 is standard; use 4 for technical domains
- **min_term_freq**: Higher values (5-10) for large corpora reduce noise
- **max_term_freq**: 0.5-0.8 removes very common domain terms
- **Stemming**: Reduces vocabulary but may decrease interpretability

## Model Selection

1. **Don't rely solely on metrics**: Inspect topic terms for interpretability
2. **Check topic coherence**: Topics should be semantically meaningful
3. **Consider domain knowledge**: Validate topics with subject matter experts
4. **Multiple k values**: Research questions may be answerable at different granularities

## Visualization

All plots use `theme_sportminer()` for consistent aesthetics:

```{r custom-viz}
library(ggplot2)

# Customize theme parameters
plot_terms + theme_sportminer(base_size = 14, grid = FALSE)
```

# Summary

**SportMiner** provides an integrated, efficient workflow for analyzing sport science literature. The package combines database querying, text preprocessing, topic modeling, and visualization in a unified framework. Researchers can rapidly identify research trends, discover thematic structures, and track field evolution over time.

## Key Features

- Scopus API integration with flexible query syntax
- Robust text preprocessing pipeline
- Multiple topic modeling algorithms with comparison tools
- Publication-ready visualizations with sensible defaults
- Keyword network analysis for understanding research connections
- Comprehensive documentation and reproducible examples

## Future Development

Planned enhancements include:

- Additional databases (PubMed, Web of Science)
- Structural Topic Models (STM) with metadata covariates
- Interactive visualizations with **shiny**
- Topic coherence metrics beyond perplexity
- Multilingual support
- Integration with **bibliometrix** for citation analysis

## Acknowledgments

We thank the reviewers for valuable feedback that improved this package.

# References

::: {#refs}
:::

# Computational Details

```{r session-info}
sessionInfo()
```

# Appendix: Function Reference

## Data Retrieval Functions

- `sm_set_api_key()`: Configure Scopus API credentials
- `sm_search_scopus()`: Search Scopus database
- `sm_get_indexed_keywords()`: Retrieve indexed keywords for papers

## Preprocessing Functions

- `sm_preprocess_text()`: Tokenize and clean text data
- `sm_create_dtm()`: Create document-term matrix

## Topic Modeling Functions

- `sm_train_lda()`: Fit LDA model
- `sm_select_optimal_k()`: Select optimal number of topics
- `sm_compare_models()`: Compare LDA, CTM, and STM

## Visualization Functions

- `sm_plot_topic_terms()`: Visualize top terms per topic
- `sm_plot_topic_frequency()`: Show topic distribution
- `sm_plot_topic_trends()`: Plot topic trends over time
- `sm_keyword_network()`: Create keyword co-occurrence network
- `theme_sportminer()`: Custom ggplot2 theme