--- title: "Installation, Initialization, and Data Cleaning" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Installation, Initialization, and Data Cleaning} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Prerequisites leadeR relies on [spaCy](https://spacy.io/), a Python NLP library, via the [spacyr](https://spacyr.quanteda.io/) R package. You will need: - **Python** (3.8 or later) - **spaCy** with an English language model Install spaCy and the English model from a terminal: ```bash pip install spacy python -m spacy download en_core_web_sm ``` ## Installing leadeR Install leadeR from GitHub: ```{r} # install.packages("remotes") remotes::install_github("mmukaigawara/leadeR") ``` ## Initialization Before using any leadeR function, initialize spaCy and (optionally) set a seed for reproducibility of bootstrap results. ```{r} library(leadeR) library(data.table) spacyr::spacy_initialize() set.seed(02138) ``` ## Sample data The package ships with three speeches by John F. Kennedy: | Dataset | Date | Occasion | |------------------|--------------------|---------------------------------------------| | `jfk19610120` | January 20, 1961 | Inaugural Address | | `jfk19610925` | September 25, 1961 | Address Before the UN General Assembly | | `jfk19630610` | June 10, 1963 | Commencement Address at American University | ```{r} head(jfk19571101) ``` ## Text cleaning Speech transcripts often contain editorial annotations in brackets, parentheses, or curly braces. The `clean_text()` function removes these and normalizes whitespace. ```{r} jfk1 <- clean_text(jfk19610120) jfk2 <- clean_text(jfk19610925) jfk3 <- clean_text(jfk19630610) ``` Users may need additional cleaning steps depending on the source of their text data (e.g., removing headers, footers, or speaker labels).