--- title: "SemanticDistance_Monologues_Lists" author: "Jamie Reilly, Hannah R. Mechtenberg, Emily B. Myers, Jonathan E. Peelle" date: "`r Sys.Date()`" vignette: > %\VignetteIndexEntry{SemanticDistance_Monologues_Lists} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} vignetteBuilder: knitr output: rmarkdown::html_vignette: toc: yes --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` For our purposes a monologue is a ordered language sample that does not contain turns and/or speaker information. Monologues consist of narratives, stories, etc. An unordered list is a bag-of-words where word order no longer matters. You might be interested in applying kmeans, hierarchical clustering, or graph metrics to elucidate structure within a list of words. Alternatively, you might be interested in evaluating distances as ordered time series. Whatever your goal, the vignette to follow will illustrate how to prep and clean your text data using `SemanticDistance`. ```{r, message=FALSE, echo=F, warning=F} # Load SemanticDistance library(SemanticDistance) ``` # A typical monologue transcript Included in the package ```{r, echo=F} knitr::kable(head(Monologue_Typical, 1), format = "pipe") ``` # A typical unordered word list Included in the package ```{r, echo=F} knitr::kable(head(Unordered_List, 1), format = "pipe") ``` # Step 1: Clean Monologue or List: `clean_monologue_or_list` Transforms all text to lowercase then optionally cleans (omit stopwords, omit non-alphabetic chars), lemmatizes (transforms morphological derivatives of words to their standard dictionary entries), and splits multiword utterances into a one-word-per row format. You can generally leave split_strings in its default state (TRUE). `clean_monologue` appends several new variables to your original dataframe: **id_row_orig** a numeric identifier marking the original row where a word or group of words appeared; **'id_row_postsplit** a unique identifier marking each word's ordered position in the dataframe after splitting multiword utterances across rows; **word_clean** result of all cleaning operations, needed for distance calculations.
Function Arguments:
`dat` raw dataframe with at least one column of text
`wordcol` quoted variable column name where your target text lives (e.g., 'mytext')
`omit_stops` omits stopwords, T/F default is TRUE
`lemmatize` transforms raw word to lemmatized form, T/F default is TRUE
```{r, message=FALSE} Monologue_Cleaned <- clean_monologue_or_list(dat=Monologue_Typical, wordcol='mytext', omit_stops=TRUE, lemmatize=TRUE) knitr::kable(head(Monologue_Cleaned, 12), format = "pipe", digits=2) ``` # Step 2: Choose Distance Option/Compute Distances ## Option 1: Ngram-to-Word Distance: `dist_ngram2word` Computes cosine distance for two models (embedding and experiential) using a rolling ngram approach consisting of groups of words (ngrams) to the next word. *IMPORTANT* the function looks backward from the target word skipping over NAs until filling the desired ngram size.
Function Arguments:
`dat` dataframe of a monologue transcript cleaned and prepped with clean_monologue fn
`ngram` window size preceding each new content word, ngram=1 means each word is compared to the word before it. ```{r, message=FALSE} Ngram2Word_Dists1 <- dist_ngram2word(dat=Monologue_Cleaned, ngram=1) #distance word-to-word head(Ngram2Word_Dists1) ``` ## Option 2: Ngram-to-Ngram Distance: `dist_ngram2ngram` User specifies n-gram size (e.g., ngram=2). Distance computed from each two-word chunk to the next iterating all the way down the dataframe until there are no more words to 'fill out' the last ngram. Note this distance function **only works on monologue transcripts** where there are no speakers delineated and word order matters.
Function Arguments:
`dat` dataframe w/ a monologue sample cleaned and prepped
`ngram` chunk size (chunk-to-chunk), in this case ngram=2 means chunks of 2 words compared to the next chunk ```{r, message=FALSE} Ngram2Ngram_Dist1 <- dist_ngram2ngram(dat=Monologue_Cleaned, ngram=2) head(Ngram2Ngram_Dist1) ``` ## Option 3: Anchor-to-Word Distance: `dist_anchor` Models semantic distance from each successive new word to the average of the semantic vectors for the first block of N content words. This anchored distance provides a metric of overall semantic drift as a language sample unfolds relative to a fixed starting point.
Function Arguments:
`dat` dataframe monologue sample cleaned and prepped using 'clean_monologue'
`anchor_size` = size of the initial chunk of words for chunk-to-new-word comparisons fn ```{r, message=FALSE} Anchored_Dists1 <- dist_anchor(dat=Monologue_Cleaned, anchor_size=4) head(Anchored_Dists1) ```