Releases will be numbered with the following semantic versioning format:
<major>.<minor>.<patch>
And constructed with the following guidelines:
BUG FIXES
replace_emoticon replaced emoticon-like substrings
within actual words.
Spotted thanks to Carolyn Challoner; see
issue
#46.
replace_number failed if the number pattern
contained two leading decimals or hyphens. Spotted thanks to Stefano De
Sabbata; see
issue
#60.
replace_word_elongation failed for repeating of the
same character but of different case (e.g.,
replace_word_elongation("Ooo") resulted in NA.
This has been corrected. Additionally, the
elongation.search.pattern defined as
"(?i)(?:^|\\b)\\w*([a-z])(?:\\1{2,})\\w*($|\\b)" has been
moved exterally, to a parameter, allowing the user to alter this pattern
if desired. Spotted thanks to Stefano De Sabbata; see
issue
#59.
NEW FEATURES
replace_misspelling added as a way to replace
misspelled words with their most likely replacement using
hunspell in the backend. Suggested by Surin Space; see
issue
#39.
as_ordinal added as a convenience wrapper for
english::ordinal that takes integers and converts them to
ordinal form.
%like% added as an binary operator similar to SQL’s
LIKE.
MINOR FEATURES
fix_mdyyyy added to correct dates in the form of
m/d/yyyy to yyyy-mm-dd.IMPROVEMENTS
replace_html pics up the ability to replace “«”
& “»” with ASCII equivalents “<<” & “>>”. Suggested
by Ilya Shutov; see
issue
#48.
All internal calls to grepl() now have
perl = TRUE added as this is generally a speed up.
Suggested by Kyle Haynes (see
#51).
CHANGES
filter_element() and filter_row() have
been deprecated for a few years.Version update to comply with changes in the glue package’s API.
BUG FIXES
fgsub had a bug in which the the original
pattern in fgsub matches the location in the
string but when the replacement occurs this was done on the entire
string rather than the location of the first pattern match.
This means the extracted string was used as a search and might be found
in places other than the original location (e.g., a leading boundary in
‘^T’ replaced with ’__’ may have led to ’__he __itle’ rather than ’__he
Title’ as expected in the string ‘The Title’). See
#35 for
details. The fix will add some time to the computation but is
safer.NEW FEATURES
replace_to/replace_from added to remove
from/to begin/end of string to/from a character(s).
The following replacement functions were added to provide
remediation for problems found in check_text:
replace_email, replace_hash,
replace_tag, & replace_url.
MINOR FEATURES
check_text picks up a checks and
n argument. The former allows the user to specify which
checks to conduct. The latter allows the user to truncate the output to
n number of elements with a closing ...[truncated].... This
makes the function more useful and the code easier to maintain.IMPROVEMENTS
replace_non_ascii did not replace all non-ASCII
characters. This has been fixed by an explicit replacement of ‘[^ -~]+’
which are all non-ASCII characters. See
issue
#34 for
details.Maintenance release to bring package up to date with the lexicon package API changes.
NEW FEATURES
match_tokens added to find all the tokens that match
a regex(es) within a given text vector. This useful when combined with
the replace_tokens function.
Fixed versions of
drop_element/keep_element added to allow for
dropping elements specified by a known vector rather than a
regex.
The collapse and glue functions from
the glue package are reexported for easy string
manipulation.
replace_date added for normalizing dates.
replace_time added for normalizing time
stamps.
replace_money added for normalizing money
references.
mgsub picks up a safe argument using
the mgsub package as the backend. In addition
mgsub_regex_safe added to make the usage explicit. The safe
mode comes at the cost of speed.
IMPROVEMENTS
replace_names drops the replacement of
c('An', 'To', 'Oh', 'So', 'Do', 'He', 'Ha', 'In', 'Pa', 'Un')
which are likely words and not names.
replace_html picks ups some additional symbol
replacements including:
c("™", "“", "”", "‘", "’", "•", "·", "⋅", "–", "—", "≠", "½", "¼", "¾", "°", "←", "→", "…").
NEW FEATURES
replace_kern added to replace a form of informal
emphasis in which the writer takes words >2 letters long, capitalizes
the entire word, and places spaces in between each letter. This was
contributed by Stack Overflow’s @ctwheels:
https://stackoverflow.com/a/47438305/1000343.
replace_internet_slang added to replace Internet
acronyms and abbreviations with machine friendly word
equivalents.
replace_word_elongation added to replace word
elongations (a.k.a. “word lengthening”) with the most likely normalized
word form. See http://www.aclweb.org/anthology/D11-105 for
details.
fgsub added for the ability to match, extract,
operate a function over the extracted strings, & replace the
original matches with the extracted strings. This performs similar
functionality to gsubfn::gsubfn but is less powerful. For
more powerful needs see the gsubfn package.
BUG FIXES
replace_grade did not use fixed = TRUE for
its call to mgsub. This could result in the plus signs
being interpreted as meta-characters. This has been corrected.NEW FEATURES
replace_names added to remove/replace common first
and last names from text data.
make_plural added to make a vector of singular noun
forms plural.
replace_emoji and
replace_emoji_identifier added for replacing emojis with
text or an identifier token for use in the sentimentr
package.
MINOR FEATURES
mgsub_regex and mgsub_fixed to provide
wrappers for mgsub that makes their use apparent without
setting the fixed command.
replace_curly_quote added to replace curly quotes
with straight versions.
IMPROVEMENTS
replace_non_ascii now uses
stringi::stri_trans_general to coerce more non-ASCII
characters to ASCII format.
check_text now checks for HTML characters/tags.
Thanks to @Peter
Gensler for suggesting this (see
issue
#15).
CHANGES
filter_ functions deprecated in favor of
drop_/keep_ versions of filter functions. This
was change was to address the opposite meaning that
dplyr’s filter has, which retains rows
matching a pattern be default.BUG FIXES
replace_tokens added to complement mgsub
for times when the user wants to replace fixed tokens with a single
value or remove them entirely. This yields an optimized solution that is
much faster than mgsub.CHANGES
mgusb no longer uses trim = TRUE by
default.BUG FIXES
check_text reported to use
replace_incomplete rather than
add_missing_endmark when endmark is missing.NEW FEATURES
The replace_emoticon, replace_grade and
replace_rating functions have been moved from the
sentimentr package to textclean as
these are cleaning functions. This makes the functions more modular and
generalizable to all types of text cleaning. These functions are still
imported and exported by sentimentr.
replace_html added to remove html tags and repalce
symbols with appropriate ASCII symbols.
add_missing_endmarks added to detect missing
endmarks and replace with the desired symbol.
IMPROVEMENTS
replace_number now uses the english package
making it faster and more maintainable. In addition, the function now
handles decimal places as well.BUG FIXES
check_text reported NA as non-ASCII. This
has been fixed.NEW FEATURES
check_text added to report on potential problems in
a text vector.
replace_ordinal added to replace ordinal numbers
(e.g., 1st) with word representation (e.g., first).
swap added to swap two patterns
simultaneously.
filter_element added to exclude matching elements
from a vector.
This package is a collection of tools to clean and process text. Many of these tools have been taken from the qdap package and revamped to be more intuitive, better named, and faster.