---
title: "SemanticDistance Dialogues"
author: "Jamie Reilly, Hannah R. Mechtenberg, Emily B. Myers, Jonathan E. Peelle"
date: "`r Sys.Date()`"
vignette: >
%\VignetteIndexEntry{SemanticDistance Dialogues}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
vignetteBuilder: knitr
output:
rmarkdown::html_vignette:
toc: yes
---
```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```
```{r, message=FALSE, echo=F, warning=F}
# Load SemanticDistance
library(SemanticDistance)
```
# Dialogues
This could be a conversation transcript or any language sample where you care about talker/interlocutor information (e.g., computing semantic distance across turns in a conversation). Your dataframe should nominally contain a text column and a speaker/talker column.
sample dialogue transcript included in the package
```{r, eval=T, message=F, warning=F}
knitr::kable(head(Dialogue_Typical, 6), format = "pipe")
```
# Step 1: Clean Dialogue Transcript (clean_dialogue)
Decide on your cleaning parameters (e.g., stopwords? lemmatization?). Specify these in the argument(s) to your function calls.
Arguments to `clean_dialogue()` are:
`dat` your raw dataframe with at least one column of text AND a talker column
`wordcol` column name (quoted) containing the text you want cleaned
`who_talk` column name (quoted) containing the talker ID (will convert to factor)
`omit_stops` omits stopwords, T/F default is TRUE
`lemmatize` transforms raw word to lemmatized form, T/F default is TRUE
```{r, message=FALSE}
Dialogue_Cleaned <- clean_dialogue(dat=Dialogue_Typical, wordcol="text", who_talking="speaker", omit_stops=TRUE, lemmatize=TRUE)
knitr::kable(head(Dialogue_Cleaned, 12), format = "pipe")
```
## Step 2: Compute Semantic Distances
## Dialogue Distance Turn-to-Turn (dist_dialogue)
Averages the semantic vectors for all content words in a turn then computes the cosine distance to the average of the semantic vectors of the content words in the subsequent turn. Note: this function only works on dialogue samples marked by a talker variable (e.g., conversation transcripts). It averages across the semantic vectors of all words within a turn and then computes cosine distance to all the words in the next turn. You just need to feed it a transcript formatted with clean_dialogue. 'dist_dialogue' will return a summary dataframe that distance values aggregated by talker and turn (id_turn). Arguments to `dist_dialogue` are:
`dat` = dataframe w/ a dialogue sample cleaned and prepped using 'clean_dialogue'
```{r, message=FALSE}
DialogueDists <- dist_dialogue(dat=Dialogue_Cleaned, who_talking="speaker")
knitr::kable(head(DialogueDists, 12), format = "pipe", digits=2)
```