Type: | Package |
Title: | R Implementation of Wordpiece Tokenization |
Version: | 2.1.3 |
Description: | Apply 'Wordpiece' (<doi:10.48550/arXiv.1609.08144>) tokenization to input text, given an appropriate vocabulary. The 'BERT' (<doi:10.48550/arXiv.1810.04805>) tokenization conventions are used by default. |
Encoding: | UTF-8 |
URL: | https://github.com/macmillancontentscience/wordpiece |
BugReports: | https://github.com/macmillancontentscience/wordpiece/issues |
Depends: | R (≥ 3.3.0) |
License: | Apache License (≥ 2) |
RoxygenNote: | 7.1.2 |
Imports: | dlr (≥ 1.0.0), fastmatch (≥ 1.1), memoise (≥ 2.0.0), piecemaker (≥ 1.0.0), rlang, stringi (≥ 1.0), wordpiece.data (≥ 1.0.2) |
Suggests: | covr, knitr, rmarkdown, testthat (≥ 3.0.0) |
VignetteBuilder: | knitr |
Config/testthat/edition: | 3 |
NeedsCompilation: | no |
Packaged: | 2022-03-03 14:19:39 UTC; jonathan.bratt |
Author: | Jonathan Bratt |
Maintainer: | Jonathan Bratt <jonathan.bratt@macmillan.com> |
Repository: | CRAN |
Date/Publication: | 2022-03-03 15:10:02 UTC |
Determine Casedness of Vocabulary
Description
Determine Casedness of Vocabulary
Usage
.get_casedness(v)
## Default S3 method:
.get_casedness(v)
## S3 method for class 'wordpiece_vocabulary'
.get_casedness(v)
## S3 method for class 'character'
.get_casedness(v)
Arguments
v |
An object of class |
Value
TRUE if the vocabulary is case-sensitive, FALSE otherwise.
Determine Vocabulary Casedness
Description
Determine whether or not a wordpiece vocabulary is case-sensitive.
Usage
.infer_case_from_vocab(vocab)
Arguments
vocab |
The vocabulary as a character vector. |
Details
If none of the tokens in the vocabulary start with a capital letter, it will be assumed to be uncased. Note that tokens like "\[CLS\]" contain uppercase letters, but don't start with uppercase letters.
Value
TRUE if the vocabulary is cased, FALSE if uncased.
Constructor for Class wordpiece_vocabulary
Description
Constructor for Class wordpiece_vocabulary
Usage
.new_wordpiece_vocabulary(vocab, is_cased)
Arguments
vocab |
Character vector of tokens. |
is_cased |
Logical; whether the vocabulary is cased. |
Value
The vocabulary with is_cased
attached as an attribute, and the
class wordpiece_vocabulary
applied.
Process a Vocabulary for Tokenization
Description
Process a Vocabulary for Tokenization
Usage
.process_vocab(v)
## Default S3 method:
.process_vocab(v)
## S3 method for class 'wordpiece_vocabulary'
.process_vocab(v)
## S3 method for class 'character'
.process_vocab(v)
Arguments
v |
An object of class |
Value
A character vector of tokens for tokenization.
Process a Wordpiece Vocabulary for Tokenization
Description
Process a Wordpiece Vocabulary for Tokenization
Usage
.process_wp_vocab(v)
## Default S3 method:
.process_wp_vocab(v)
## S3 method for class 'wordpiece_vocabulary'
.process_wp_vocab(v)
## S3 method for class 'integer'
.process_wp_vocab(v)
## S3 method for class 'character'
.process_wp_vocab(v)
Arguments
v |
An object of class |
Value
A character vector of tokens for tokenization.
Validator for Objects of Class wordpiece_vocabulary
Description
Validator for Objects of Class wordpiece_vocabulary
Usage
.validate_wordpiece_vocabulary(vocab)
Arguments
vocab |
wordpiece_vocabulary object to validate |
Value
vocab
if the object passes the checks. Otherwise, abort with
message.
Tokenize an Input Word-by-word
Description
Tokenize an Input Word-by-word
Usage
.wp_tokenize_single_string(words, vocab, unk_token, max_chars)
Arguments
words |
Character; a vector of words (generated by space-tokenizing a single input). |
vocab |
Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations. |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
Value
A named integer vector of tokenized words.
Tokenize a Word
Description
Tokenize a single "word" (no whitespace). The word can technically contain punctuation, but in BERT's tokenization, punctuation has been split out by this point.
Usage
.wp_tokenize_word(word, vocab, unk_token = "[UNK]", max_chars = 100)
Arguments
word |
Word to tokenize. |
vocab |
Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations. |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
Value
Input word as a list of tokens.
Load a vocabulary file, or retrieve from cache
Description
Load a vocabulary file, or retrieve from cache
Usage
load_or_retrieve_vocab(vocab_file)
Arguments
vocab_file |
path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number corresponding to the index of that token in the vocabulary. |
Value
The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.
Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing (the order of the tokens), it would break any pre-trained models.
Load a vocabulary file
Description
Load a vocabulary file
Usage
load_vocab(vocab_file)
Arguments
vocab_file |
path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number corresponding to the index of that token in the vocabulary. |
Value
The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.
Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing (the order of the tokens), it would break any pre-trained models.
Examples
# Get path to sample vocabulary included with package.
vocab_path <- system.file("extdata", "tiny_vocab.txt", package = "wordpiece")
vocab <- load_vocab(vocab_file = vocab_path)
Format a Token List as a Vocabulary
Description
We use a special named integer vector with class wordpiece_vocabulary to
provide information about tokens used in wordpiece_tokenize
.
This function takes a character vector of tokens and puts it into that
format.
Usage
prepare_vocab(token_list)
Arguments
token_list |
A character vector of tokens. |
Value
The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.
Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing (the order of the tokens), it would break any pre-trained models.
Examples
my_vocab <- prepare_vocab(c("some", "example", "tokens"))
class(my_vocab)
attr(my_vocab, "is_cased")
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- fastmatch
- rlang
- wordpiece.data
Set a Cache Directory for wordpiece
Description
Use this function to override the cache path used by wordpiece for the
current session. Set the WORDPIECE_CACHE_DIR
environment variable
for a more permanent change.
Usage
set_wordpiece_cache_dir(cache_dir = NULL)
Arguments
cache_dir |
Character scalar; a path to a cache directory. |
Value
A normalized path to a cache directory. The directory is created if the user has write access and the directory does not exist.
Retrieve Directory for wordpiece Cache
Description
The wordpiece cache directory is a platform- and user-specific path where wordpiece saves caches (such as a downloaded vocabulary). You can override the default location in a few ways:
Option:
wordpiece.dir
Useset_wordpiece_cache_dir
to set a specific cache directory for this sessionEnvironment:
WORDPIECE_CACHE_DIR
Set this environment variable to specify a wordpiece cache directory for all sessions.Environment:
R_USER_CACHE_DIR
Set this environment variable to specify a cache directory root for all packages that use the caching system.
Usage
wordpiece_cache_dir()
Value
A character vector with the normalized path to the cache.
Tokenize Sequence with Word Pieces
Description
Given a sequence of text and a wordpiece vocabulary, tokenizes the text.
Usage
wordpiece_tokenize(
text,
vocab = wordpiece_vocab(),
unk_token = "[UNK]",
max_chars = 100
)
Arguments
text |
Character; text to tokenize. |
vocab |
Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations. |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
Value
A list of named integer vectors, giving the tokenization of the input sequences. The integer values are the token ids, and the names are the tokens.
Examples
tokens <- wordpiece_tokenize(
text = c(
"I love tacos!",
"I also kinda like apples."
)
)