Type: | Package |
Version: | 0.1.1 |
Title: | Compute Semantic Distance Between Text Constituents |
Maintainer: | Jamie Reilly <jamie_reilly@temple.edu> |
Description: | Cleans and formats language transcripts guided by a series of transformation options (e.g., lemmatize words, omit stopwords, split strings across rows). 'SemanticDistance' computes two distinct metrics of cosine semantic distance (experiential and embedding). These values reflect pairwise cosine distance between different elements or chunks of a language sample. 'SemanticDistance' can process monologues (e.g., stories, ordered text), dialogues (e.g., conversation transcripts), word pairs arrayed in columns, and unordered word lists. Users specify options for how they wish to chunk distance calculations. These options include: rolling ngram-to-word distance (window of n-words to each new word), ngram-to-ngram distance (2-word chunk to the next 2-word chunk), pairwise distance between words arrayed in columns, matrix comparisons (i.e., all possible pairwise distances between words in an unordered list), turn-by-turn distance (talker to talker in a dialogue transcript). 'SemanticDistance' includes visualization options for analyzing distances as time series data and simple semantic network dynamics (e.g., clustering, undirected graph network). |
License: | LGPL (≥ 3) |
Encoding: | UTF-8 |
URL: | https://github.com/Reilly-ConceptsCognitionLab/SemanticDistance, https://reilly-conceptscognitionlab.github.io/SemanticDistance/ |
BugReports: | https://github.com/Reilly-ConceptsCognitionLab/SemanticDistance/issues |
Depends: | R (≥ 3.5) |
Imports: | ape, cluster, dendextend, dplyr, graphics, httr, igraph, lsa, magrittr, purrr, rlang, stats, stringi, stringr, textstem, tidyselect, tm, tidyr, textclean, tools, utils, wesanderson, |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
VignetteBuilder: | knitr |
RoxygenNote: | 7.3.2 |
Collate: | 'backup.R' 'clean_dialogue.R' 'clean_monologue_or_list.R' 'clean_paired_cols.R' 'data.R' 'dist_anchor.R' 'dist_dialogue.R' 'dist_ngram2ngram.R' 'dist_ngram2word.R' 'dist_paired_cols.R' 'eval_kmeans_clustersize.R' 'globals.R' 'reexports.R' 'replacements_25.R' 'utils.R' 'wordlist_to_network.R' 'zzz.R' |
Config/testthat/edition: | 3 |
NeedsCompilation: | no |
Packaged: | 2025-08-27 08:02:07 UTC; Jamie |
Author: | Jamie Reilly |
Repository: | CRAN |
Date/Publication: | 2025-09-01 16:50:02 UTC |
A Typical Dialogue Transcript
Description
A sample dyadic conversation transcript where two people are conversing.
Usage
Dialogue_Typical
Format
## "Dialogue_Typical" A data frame with 5 rows and 2 columns:
- word
fictional text from a language transcript
- speaker
Mary or Peter: fictional speaker identities
...
The Grandfather Passage: A Standardized Reading Passage
Description
A monologue discourse sample. Grandfather Passage is a well-known test of reading aloud.
Usage
Grandfather_Passage
Format
## "Grandfather_Passage" A data frame with 1 observation of 1 variable:
- mytext
text from the Grandfather Passage unsplit
...
A Typical Monologue Transcript
Description
Dataframe with ordered text squashed into a single cell.
Usage
Monologue_Typical
Format
## "Monologue_Typical" A data frame with 1 row and 1 column
- mytext
text from a language transcript
...
SD15_2025_complete Experiential Semantic Distance Values
Description
Word embeddings (300 dimensions, 59061 words). Each word is one row.
Usage
SD15_2025_complete
Format
## "SD15_2025_complete" A data frame with 25,050 observations of 16 variables
- word
word characterized across 15 ratings
- Param_auditory_z
z-score of auditory salience from Lancaster Sensorimotor Norms
- Param_gustatory_z
z-score of gustatory salience from Lancaster Sensorimotor Norms
- Param_haptic_z
z-score of haptic salience from Lancaster Sensorimotor Norms
- Param_interoceptive_z
z-score of interoceptive salience from Lancaster Sensorimotor Norms
- Param_visual_z
z-score of visual salience from Lancaster Sensorimotor Norms
- Param_olfactory_z
z-score of olfactory salience from Lancaster Sensorimotor Norms
- Param_handarm_z
z-score of handarm motor salience from Lancaster Sensorimotor Norms
- Param_excitement_z
z-score of excitement salience from affectvec
- Param_surprised_z
z-score of surprise salience from affectvec
- Param_fear_z
z-score of fear salience from affectvec
- Param_anger_z
z-score of anger salience from affectvec
- Param_disgust_z
z-score of disgust salience from affectvec
- Param_sadness_z
z-score of sadness salience from affectvec
- Param_happiness_z
z-score of happiness salience from affectvec
- Param_contempr_z
z-score of contempt salience from affectvec
...
Stopword List
Description
List of stopwords
Usage
Temple_stops25
Format
## "Temple_stops25" A data frame with 829 observations of 4 variables
- id_orig
numeric identifier
- word
stopword target
- length
length in words
- pos
universal part-of-speech tag
...
Unordered_List
Description
No talker delineated. List of 17 words spanning 4 semantic categories, Good for examining clustering
Usage
Unordered_List
Format
## "Unordered_List" A data frame with 1 rows and 1 columns:
- mytext
unsplit list of words containing musical instruments, weapongs, fruits, emotions
Word Pairs in Columns
Description
first target word for computing distance in one column, second word in another column.
Usage
Word_Pairs
Format
## "Word_Pairs" A data frame with 27 rows and 2 columns:
- word1
text corresponding to the first word in a pair to contrast
- word2
text corresponding to the second word in a pair to contrast
...
clean_dialogue
Description
Cleans a transcript where there are two or more talkers. User specifies the dataframe and column name where target text is stored in addition a factor variable corresponding to the identity of the person producing corresponding text. Users also specify cleaning parameters for stopword removal and lemmatization (both defaulting to TRUE). Function splits and unlists text so that the output is in a one-row-per-word format marked by a unique numeric identifier (i.e., 'id_orig'). Function appends a turn_count sequence used for aggregating all the words within each turn. If a speaker generates no complete observations because of stopword removal, the turn counter will not increment until a talker switch AND a complete observation is observed.
Usage
clean_dialogue(dat, wordcol, who_talking, omit_stops = TRUE, lemmatize = TRUE)
Arguments
dat |
a datataframe with at least one target column of string data |
wordcol |
quoted column name storing the strings that will be cleaned and split |
who_talking |
quoted column name with speaker/talker identities will be factorized |
omit_stops |
T/F user wishes to remove stopwords (default is TRUE) |
lemmatize |
T/F user wishes to lemmatize each string (default is TRUE) |
Value
a dataframe
clean_monologue_or_list
Description
Cleans and formats text. User specifies the dataframe and column name where target text is stored as arguments to the function. Default option is to lemmatize strings. Function splits and unlists text so that the output is in a one-row-per-word format marked by a unique numeric identifier (i.e., 'id_orig')
Cleans and formats text. User specifies the dataframe and column name where target text is stored as arguments to the function. Default option is to lemmatize strings. Function splits and unlists text so that the output is in a one-row-per-word format marked by a unique numeric identifier (i.e., 'id_orig')
Usage
clean_monologue_or_list(dat, wordcol, omit_stops = TRUE, lemmatize = TRUE)
clean_monologue_or_list(dat, wordcol, omit_stops = TRUE, lemmatize = TRUE)
Arguments
dat |
a dataframe with at least one target column of string data |
wordcol |
quoted column name storing the strings that will be cleaned and split |
omit_stops |
option for omitting stopwords default is TRUE |
lemmatize |
option for lemmatizing strings default is TRUE |
Value
a dataframe
a dataframe
clean_paired_cols
Description
Cleans a transcript where word pairs are arrayed in two columns.
Usage
clean_paired_cols(dat, wordcol1, wordcol2, lemmatize = TRUE)
Arguments
dat |
a dataframe with two columns of words you want pairwise distance for |
wordcol1 |
quoted column name storing the first string for comparison |
wordcol2 |
quoted column name storing the second string for comparison |
lemmatize |
T/F user wishes to lemmatize each string (default is TRUE) |
Value
a dataframe
dist_anchor
Description
Function takes dataframe cleaned using 'clean_monologue', computes rolling chunk-to-chunk distance between user-specified ngram size (e.g., 2-word chunks)
Usage
dist_anchor(dat, anchor_size = 10)
Arguments
dat |
a dataframe prepped using 'clean_monologue' fn |
anchor_size |
an integer specifying the number of words in the initial chunk for comparison to new words as the sample unfolds |
Value
a dataframe
dist_dialogue
Description
Function takes dataframe cleaned using 'clean_dialogue' and computes two metrics of semantic distance turn-to-turn indexing a 'talker' column. Sums all the respective semantic vectors within each tuern, cosine distance to the next turn's composite vector
Usage
dist_dialogue(dat, who_talking)
Arguments
dat |
a dataframe prepped using 'clean_dialogue' fn with talker data and turncount appended |
who_talking |
factor variable with two levels specifying an ID for the person producing the text in 'word_clean' |
Value
a dataframe
dist_ngram2ngram
Description
Function takes dataframe cleaned using 'clean_monologue', computes rolling chunk-to-chunk distance between user-specified ngram size (e.g., 2-word chunks)
Usage
dist_ngram2ngram(dat, ngram)
Arguments
dat |
a dataframe prepped using 'clean_monologue' fn |
ngram |
an integer specifying the window size of words for computing distance to a target word |
Value
a dataframe
dist_ngram2word
Description
Function takes dataframe cleaned using 'clean_monologue', computes two metrics of semantic distance for each word relative to the average of the semantic vectors within an n-word window appearing before each word. User specifies the window (ngram) size. The window 'rolls' across the language sample providing distance metrics
Usage
dist_ngram2word(dat, ngram)
Arguments
dat |
a dataframe prepped using 'clean_monologue' fn |
ngram |
an integer specifying the window size of words for computing distance to a target word will go back skipping NAs until content words equals the ngram window |
Value
a dataframe
dist_paired_cols
Description
Function takes dataframe cleaned using 'clean_2columns', computes two metrics of semantic distance for each word pair arrayed in Col1 vs. Col2
Usage
dist_paired_cols(dat)
Arguments
dat |
a dataframe prepped using clean_2columns' with word pairs arrayed in two columns |
Value
a dataframe
Glove Semantic Embeddings
Description
Word embeddings (300 hyperparameter dimensions, 59061 words). Each word is one row.
Usage
glowca_25
Format
## "glowca_25" A data frame with 59061 observations of 301 variables
- word
word characterized across embeddings
- Param_1
hyperparameter number 1
- Param_300
hyperparameter number 300
...
Load all .rda files from a GitHub data folder into the package environment
Description
Load all .rda files from a GitHub data folder into the package environment
Usage
load_github_data(
repo = "Reilly-ConceptsCognitionLab/SemanticDistance_Data",
branch = "main",
data_folder = "data",
envir = parent.frame()
)
Arguments
repo |
GitHub repository (e.g., "username/repo") |
branch |
Branch name (default: "main") |
data_folder |
Remote folder containing .rda files (default: "data/") |
envir |
Environment to load into (default: package namespace) |
Value
nothing, loads data (as rda files) from github repository needed for other package functions
wordlist_to_network
Description
Takes a vector of words with semantic distance ratings, converts to a square matrix, then to a euclidean distance matrix (all word pairs), then plots the words in either a cluster dendrogram or simple igraph network
Usage
wordlist_to_network(
dat,
wordcol,
output = "dendrogram",
dist_type = "embedding"
)
Arguments
dat |
dataframe with text in it (cleaned using clean_monologue_or_list function |
wordcol |
quoted argument identifying column in dataframe with target text |
output |
quoted argument for type of output default is 'dendrogram', alternate is 'network' |
dist_type |
quoted argument semantic norms for running distance matrix on default='embedding', other is 'SD15' |
Details
This function internally calls eval_kmeans_clustersize for cluster evaluation. The dendrogram visualization is based on hierarchical clustering of semantic distances.
Value
a plot of a dendrogram or an igraph network AND a cosine distance matrix