hutils packageMy name is Hugh. I’ve written some miscellaneous functions that don’t
seem to belong in a particular package. I’ve usually put these in
R/utils.R when I write a package. Thus,
hutils.
This vignette just goes through each exported function.
library(knitr)
suggested_packages <- c("geosphere", "nycflights13", "dplyr", "ggplot2", "microbenchmark")
opts_chunk$set(eval = all(vapply(suggested_packages, requireNamespace, quietly = TRUE, FUN.VALUE = FALSE)))tryCatch({
library(geosphere)
library(nycflights13)
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
library(microbenchmark)
library(data.table, warn.conflicts = FALSE)
library(magrittr)
library(hutils, warn.conflicts = FALSE)
},
# requireNamespace does not detect errors like
# package ‘dplyr’ was installed by an R version with different internals; it needs to be reinstalled for use with this R version
error = function(e) {
opts_chunk$set(eval = FALSE)
})
options(digits = 4)These are simple additions to magrittr‘s aliases,
including: capitalized forms of and and or
that invoke && and || (the ’long-form’
logical operators) and nor / neither
functions.
The main motivation is to make the source code easier to indent. I occasionally find such source code easier to use.
nor (or neither which is identical) returns
TRUE if and only if both arguments are
FALSE.
coalesce and if_elseThese are near drop-in replacements for the equivalent functions from
dplyr. They are included here because they are very useful
outside of the tidyverse, but may be required in circumstances where
importing dplyr (with all of its dependencies) would be
inappropriate.
They attempt to be drop-in replacements but:
hutils::if_else only works with logical,
integer, double, and character
type vectors. Lists and factors won’t work.hutils::coalesce short-circuits on its first argument;
if there are no NAs in x then x
is returned, even if the other vectors are the wrong length or
type.In addition, hutils::if_else is generally faster than
dplyr::if_else:
my_check <- function(values) {
all(vapply(values[-1], function(x) identical(values[[1]], x), logical(1)))
}
set.seed(2)
cnd <- sample(c(TRUE, FALSE, NA), size = 100e3, replace = TRUE)
yes <- sample(letters, size = 100e3, replace = TRUE)
no <- sample(letters, size = 100e3, replace = TRUE)
na <- sample(letters, size = 100e3, replace = TRUE)
microbenchmark(dplyr = dplyr::if_else(cnd, yes, no, na),
hutils = hutils::if_else(cnd, yes, no, na),
check = my_check) %>%
print
cnd <- sample(c(TRUE, FALSE, NA), size = 100e3, replace = TRUE)
yes <- sample(letters, size = 1, replace = TRUE)
no <- sample(letters, size = 100e3, replace = TRUE)
na <- sample(letters, size = 1, replace = TRUE)
microbenchmark(dplyr = dplyr::if_else(cnd, yes, no, na),
hutils = hutils::if_else(cnd, yes, no, na),
check = my_check) %>%
printThis speed advantage also appears to be true of
coalesce:
x <- sample(c(letters, NA), size = 100e3, replace = TRUE)
A <- sample(c(letters, NA), size = 100e3, replace = TRUE)
B <- sample(c(letters, NA), size = 100e3, replace = TRUE)
C <- sample(c(letters, NA), size = 100e3, replace = TRUE)
microbenchmark(dplyr = dplyr::coalesce(x, A, B, C),
hutils = hutils::coalesce(x, A, B, C),
check = my_check) %>%
printespecially during short-circuits:
To drop a column from a data.table, you set it to
NULL
There’s nothing wrong with this, but I’ve found the following a
useful alias, especially in a magrittr pipe.
DT <- data.table(A = 1:5, B = 1:5, C = 1:5)
DT %>%
drop_col("A") %>%
drop_col("B")
# or
DT <- data.table(A = 1:5, B = 1:5, C = 1:5)
DT %>%
drop_cols(c("A", "B"))These functions simple invoke the canonical form, so won’t be any faster.
Additionally, one can drop columns by a regular expression using
drop_colr:
drop_constant_colsWhen a table is filtered, the filtrate is often redundant.
drop_empty_colsThis function drops columns in which all the values are
NA.
duplicated_rowsThere are many useful functions for detecting duplicates in R.
However, in interactive use, I often want to not merely see which values
are duplicated, but also compare them to the original. This is
especially true when I am comparing duplicates across a subset
of columns in a a data.table.
To emphasize the miscellany of this package, I now present
haversine_distance which simply returns the distance
between two points on the Earth, given their latitude and longitude.
I prefer this to other packages’ implementations. Although the
geosphere package can do a lot more than calculate
distances between points, I find the interface for
distHaversine unfortunate as it cannot be easily used
inside a data.frame. In addition, I’ve found the arguments
clearer in hutils::haversine_distance rather than trying to
remember whether to use byrow inside the
matrix function while passing to
distHaversine.
DT1 <- data.table(lat_orig = runif(1e5, -80, 80),
lon_orig = runif(1e5, -179, 179),
lat_dest = runif(1e5, -80, 80),
lon_dest = runif(1e5, -179, 179))
DT2 <- copy(DT1)
microbenchmark(DT1[, distance := haversine_distance(lat_orig, lon_orig,
lat_dest, lon_dest)],
DT2[, distance := distHaversine(cbind(lon_orig, lat_orig),
cbind(lon_orig, lat_orig))])
rm(DT1, DT2)mutate_otherThere may be occasions where a categorical variable in a
data.table may need to modified to reduce the number of
distinct categories. For example, you may want to plot a chart with a
set number of facets, or ensure the smooth operation of
randomForest, which accepts no more than 32 levels in a
feature.
mutate_other keeps the n most common categories
and changes the other categories to Other.
ngrepThis is a ‘dumb’ negation of grep. In recent versions of
R, the option invert = FALSE exists. A slight advantage of
ngrep is that it’s shorter to type. But if you don’t have
arthritis, best use invert = FALSE or
!grepl.
notin ein enotin
pinThese functions provide complementary functionality to
%in%:
%notin%%notin% is the negation of %in%, but also
uses the package fastmatch to increase the speed of the
operation
%ein% and %enotin%The functions %ein% and %enotin% are
motivated by a different sort of problem. Consider the following
statement:
iris <- as.data.table(iris)
iris[Species %in% c("setosa", "versicolour")] %$%
mean(Sepal.Length / Sepal.Width)On the face of it, this appears to give the average ratio of Iris
setosa and Iris versicolour irises. However, it only gives
the average ratio of setosa irises, as the correct spelling is
Iris versicolor not -our. This particular error is
easy to make, (in fact when I wrote this vignette, the first hit of
Google for iris dataset made the same spelling error), but
it’s easy to imagine similar mistakes, such as mistaking the
capitalization of a value. The functions %ein% and
%enotin% strive to reduce the occurrence of this mistake.
The functions operate exactly the same as %in% and
%enotin% but error if any of the table of values to be
matched against is not present in any of the values:
iris <- as.data.table(iris)
iris[Species %ein% c("setosa", "versicolour")] %$%
mean(Sepal.Length / Sepal.Width)The e stands for ‘exists’; i.e. they should be
read as “exists and in” and “exists and not in”.
%pin%This performs a partial match (i.e grepl) but
with a possibly more readable or intuitive syntax
If the RHS has more than one element, the matching is done on alternation (i.e. OR):
There is an important qualification: if the RHS is NULL,
then the result will be TRUE along the length of
x, contrary to the behaviour of %in%. This is
not entirely unexpected as NULL could legitimately be
interpreted as \(\varepsilon\), the
empty regular expression, which occurs in every string.
provide.dirThis is the same as dir.create but checks whether the
target exists or not and does nothing if it does. Motivated by
\providecommand in \(\rm\LaTeX{}\), which creates a macro only
if it does not exist already.
select_whichThis provides a similar role to dplyr::select_if but was
originally part of package:grattan so has a different name.
It simply returns the columns whose values return
TRUE when Which is applied. Additional columns
(which may or not may satisfy Which) may be included by
using .and.dots. (To remove columns, you can use
drop_col).
set_cols_firstUp to and including data.table 1.10.4, one could only
reorder the columns by supplying all the columns. You can use
set_cols_first and set_cols_last to put
columns first or last without supplying all the columns.
In some circumstances, you need to know that the key of
a data.table is unique. For example, you may expect a join
to be performed later, without specifying mult='first' or
permitting Cartesian joins. data.table does not require a
key to be unique and does not supply tools to check the
uniqueness of keys. hutils supplies two simple functions:
has_unique_key which when applied to a
data.table returns TRUE if and only if the
table has a key and it is unique.
set_unique_key does the same as setkey but
will error if the resultant key is not unique.
hutils v1.1.0aucThe area under the (ROC) curve gives a single value to measure the tradeoff between true positives and false positives.
hutils v1.2.0RQThis is simply a shorthand to test whether a package needs installing. The package name need not be quoted, for convenience.
ahullThis locates the biggest rectangle beneath a curve:
if (!identical(Sys.info()[["sysname"]], "Darwin"))
ggplot(data.table(x = c(0, 1, 2, 3, 4), y = c(0, 1, 2, 0.1, 0))) +
geom_area(aes(x, y)) +
geom_rect(data = ahull(, c(0, 1, 2, 3, 4), c(0, 1, 2, 0.1, 0)),
aes(xmin = xmin,
xmax = xmax,
ymin = ymin,
ymax = ymax),
color = "red") set.seed(101)
ahull_dt <-
data.table(x = c(0:100) / 100,
y = cumsum(rnorm(101, 0.05)))
if (!identical(Sys.info()[["sysname"]], "Darwin"))
ggplot(ahull_dt) +
geom_area(aes(x, y)) +
geom_rect(data = ahull(ahull_dt),
aes(xmin = xmin,
xmax = xmax,
ymin = ymin,
ymax = ymax),
color = "red") +
geom_rect(data = ahull(ahull_dt,
incl_negative = TRUE),
aes(xmin = xmin,
xmax = xmax,
ymin = ymin,
ymax = ymax),
color = "blue") +
geom_rect(data = ahull(ahull_dt,
incl_negative = TRUE,
minH = 4),
aes(xmin = xmin,
xmax = xmax,
ymin = ymin,
ymax = ymax),
color = "green") +
geom_rect(data = ahull(ahull_dt,
incl_negative = TRUE,
minW = 0.25),
aes(xmin = xmin,
xmax = xmax,
ymin = ymin,
ymax = ymax),
color = "white",
fill = NA)hutils v1.3.0weighted_quantileSimply a version of quantile supporting weighted
values:
mutate_ntileTo add a column of ntiles (say, for later summarizing):
flights %>%
as.data.table %>%
.[, .(year, month, day, origin, dest, distance)] %>%
mutate_ntile(distance, n = 5L)You can use non-standard evaluation (as above) or you can quote the
col argument. Use character.only = TRUE to
ensure column is only interpreted as character.
hutils 1.4.0%<->%Referred to as swap in the documentation. Used to swap
values between object names
average_bearingDetermine the average bearing of vectors. Slightly more difficult than simply the average modulo 360 since its the most acute sector is desired.
dir2This is a faster version of list.files for Windows only,
utilizing the dir command on the command prompt.
replace_pattern_inA cousin of find_pattern_in, but instead of collecting
the results, it replaces the contents sought with the replacement
provided.