--- title: "SemanticDistance Dialogues" author: "Jamie Reilly, Hannah R. Mechtenberg, Emily B. Myers, Jonathan E. Peelle" date: "`r Sys.Date()`" vignette: > %\VignetteIndexEntry{SemanticDistance Dialogues} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} vignetteBuilder: knitr output: rmarkdown::html_vignette: toc: yes --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ```{r, message=FALSE, echo=F, warning=F} # Load SemanticDistance library(SemanticDistance) ``` # Dialogues This could be a conversation transcript or any language sample where you care about talker/interlocutor information (e.g., computing semantic distance across turns in a conversation). Your dataframe should nominally contain a text column and a speaker/talker column. sample dialogue transcript included in the package ```{r, eval=T, message=F, warning=F} knitr::kable(head(Dialogue_Typical, 6), format = "pipe") ``` # Step 1: Clean Dialogue Transcript (clean_dialogue) Decide on your cleaning parameters (e.g., stopwords? lemmatization?). Specify these in the argument(s) to your function calls.
Arguments to `clean_dialogue()` are:
`dat` your raw dataframe with at least one column of text AND a talker column
`wordcol` column name (quoted) containing the text you want cleaned
`who_talk` column name (quoted) containing the talker ID (will convert to factor)
`omit_stops` omits stopwords, T/F default is TRUE
`lemmatize` transforms raw word to lemmatized form, T/F default is TRUE ```{r, message=FALSE} Dialogue_Cleaned <- clean_dialogue(dat=Dialogue_Typical, wordcol="text", who_talking="speaker", omit_stops=TRUE, lemmatize=TRUE) knitr::kable(head(Dialogue_Cleaned, 12), format = "pipe") ``` ## Step 2: Compute Semantic Distances ## Dialogue Distance Turn-to-Turn (dist_dialogue) Averages the semantic vectors for all content words in a turn then computes the cosine distance to the average of the semantic vectors of the content words in the subsequent turn. Note: this function only works on dialogue samples marked by a talker variable (e.g., conversation transcripts). It averages across the semantic vectors of all words within a turn and then computes cosine distance to all the words in the next turn. You just need to feed it a transcript formatted with clean_dialogue. 'dist_dialogue' will return a summary dataframe that distance values aggregated by talker and turn (id_turn). Arguments to `dist_dialogue` are:
`dat` = dataframe w/ a dialogue sample cleaned and prepped using 'clean_dialogue' ```{r, message=FALSE} DialogueDists <- dist_dialogue(dat=Dialogue_Cleaned, who_talking="speaker") knitr::kable(head(DialogueDists, 12), format = "pipe", digits=2) ```