| Title: | Hierarchical Neyman-Pearson Classification for Ordered Classes |
| Version: | 0.2.0 |
| Description: | The Hierarchical Neyman-Pearson (H-NP) classification framework extends the Neyman-Pearson classification paradigm to multi-class settings where classes have a natural priority ordering. This is particularly useful for classification in unbalanced dataset, for example, disease severity classification, where under-classification errors (misclassifying patients into less severe categories) are more consequential than other misclassifications. The package implements H-NP umbrella algorithms that controls under-classification errors under user specified control levels with high probability. It supports the creation of H-NP classifiers using scoring functions based on built-in classification methods (including logistic regression, support vector machines, and random forests), as well as user-trained scoring functions. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.2 |
| Imports: | dplyr, e1071, MASS, nnet, randomForest |
| NeedsCompilation: | no |
| Packaged: | 2026-06-27 06:30:08 UTC; scsma |
| Author: | Che Shen [aut, cre] (Implementation and maintenance), Lujia Yang [aut] (Testing and debugging), Lijia Wang [aut] (Original theory and supervision), Shunan Yao [aut] (Supervision and debugging) |
| Maintainer: | Che Shen <chshen3-c@my.cityu.edu.hk> |
| Repository: | CRAN |
| Date/Publication: | 2026-06-27 06:40:02 UTC |
Generate Synthetic Three-class Ball Data
Description
Generate a synthetic three-class dataset in which each class is sampled uniformly from a 3-dimensional ball with a given center and radius. Useful for demonstrating and testing the HNP Umbrella pipeline.
Usage
generate_ball_data(n, centers, radii)
Arguments
n |
Integer. Number of samples to draw for each class. |
centers |
A list of three numeric vectors of length 3, giving the center coordinates of the balls for classes "A", "B", and "C". |
radii |
Numeric vector of length 3 giving the radius of each class ball. |
Value
A data.frame with feature columns x1, x2, x3 and a factor
column y with levels c("A","B","C"), containing 3 * n rows.
Examples
set.seed(123)
data <- generate_ball_data(
n = 100,
centers = list(c(0, 0, 0), c(2, 2, 2), c(4, 0, 0)),
radii = c(1, 1, 1)
)
HNP Under-classification Error Box Plot
Description
Summarize and visualize the under-classification errors and the
overall misclassification error across repeated runs for a T-class
problem. Given a list of confusion matrices (one per run), it draws grouped
boxplots of the class-wise under-classification errors and the overall
error. When a second list of confusion matrices is supplied, the two
methods (e.g. a classical classifier vs. H-NP) are compared side by side.
Optional control levels and tolerances add reference lines and
(1 - delta) quantile markers.
Usage
hnp_boxplot(
conf_1,
conf_2 = NULL,
levels = NULL,
tolerances = NULL,
name_1 = "Classical",
name_2 = "H-NP"
)
Arguments
conf_1 |
A list of |
conf_2 |
Optional list of |
levels |
Optional numeric vector of length |
tolerances |
Optional numeric vector of length |
name_1 |
Character. Legend label for the first method. |
name_2 |
Character. Legend label for the second method (used only when
|
Value
Invisibly draws the boxplot and returns a list with two components:
classwise (a data.frame of per-class control level, tolerance,
under-classification error mean/sd, and violation rate) and overall
(a data.frame of the overall misclassification error mean and sd).
References
Lijia Wang, Y. X. Rachel Wang, Jingyi Jessica Li, and Xin Tong (2024). "Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data." Journal of the American Statistical Association, 119(545), 39-51. doi:10.1080/01621459.2023.2270657
Examples
set.seed(123)
make_cm <- function() {
m <- matrix(sample(5:20, 9, replace = TRUE), 3, 3)
dimnames(m) <- list(c("1", "2", "3"), c("1", "2", "3"))
m
}
conf_classical <- replicate(20, make_cm(), simplify = FALSE)
conf_hnp <- replicate(20, make_cm(), simplify = FALSE)
tab <- hnp_boxplot(
conf_1 = conf_classical,
conf_2 = conf_hnp,
levels = c(0.1, 0.1),
tolerances = c(0.1, 0.1)
)
Classes Mapping function for HNP Algorithm Map ordered class labels to internal levels "1", ..., "T"
Description
Validate the class column and re-label the provided class names
to internal factor levels "1", "2", ..., "T". The input order of the
labels in ... is treated as the importance order: the first label maps to
level "1" (highest priority), and the last label maps to the
level "T" (lowest priority). Supports any number of classes T >= 2.
Usage
hnp_map_classes(data, class_col, ...)
Arguments
data |
A data.frame or data.table containing the dataset. |
class_col |
Character scalar. Name of the class/label column in |
... |
Two or more original class labels given in importance order. The first label maps to "1", the second to "2", and so on. Labels must be non-empty, non-NA, and unique. |
Value
A data.frame with the input data added with class_col converted to a factor whose levels
are the internal labels c("1","2",...,"T").
Examples
df <- data.frame(y = c("low", "mid", "high", "mid"), x1 = rnorm(4))
df2 <- hnp_map_classes(df, class_col = "y", "high", "mid", "low")
HNP Classifier's Performance Summary
Description
Evaluate a T-class classifier produced by
the HNP pipeline (or any compatible model/function). Given features and
true labels, it computes the confusion matrix, class-wise false
positive/negative rates, under- and over-classification errors, overall
accuracy, and a row-normalized error table. The class priority order is
resolved from importance_order, from attributes attached to the
classifier, or inferred from the labels.
Usage
hnp_summary(classifier, X, Y, importance_order = NULL)
Arguments
classifier |
Either a function |
X |
A data.frame of features (or a score matrix) to predict on. |
Y |
A vector of true class labels of length |
importance_order |
Optional character vector giving the class priority
order (most severe first). If |
Value
A list with components of the HNP experiment results including confusion_matrix,
false_positive_rate, false_negative_rate, overall_accuracy,
predictions, predictions_sample, under_classification_error,
over_classification_error, total_under_classification_error,
total_over_classification_error, error_table, class_levels,
internal_levels, label_mapping, importance_order, and order_source.
References
Lijia Wang, Y. X. Rachel Wang, Jingyi Jessica Li, and Xin Tong (2024). "Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data." Journal of the American Statistical Association, 119(545), 39-51. doi:10.1080/01621459.2023.2270657
Examples
set.seed(123)
n <- 300
X <- data.frame(x1 = rnorm(n), x2 = rnorm(n))
Y <- sample(c("A", "B", "C"), n, replace = TRUE)
clf <- hnp_umbrella(
X = X, Y = Y,
levels = c(0.1, 0.1),
tolerances = c(0.1, 0.1),
importance_order = c("A", "B", "C"),
method = "logistic"
)
result <- hnp_summary(clf, X = X, Y = Y, importance_order = c("A", "B", "C"))
HNP Umbrella Algorithm
Description
Implementation of the Hierarchical Neyman-Pearson (HNP) Umbrella
algorithm for general T-class classification (T >= 2). Classes are
ordered by importance_order and re-labelled internally to the classes "1", ..., "T".
The algorithm trains (or reuses) a scoring-type classifier (e.g. random forest, support
vector machine, or logistic regression) and selects thresholds that control the
under-classification error of the more severe classes at the requested
levels with the given tolerances, either by treating upper bounds as thresholds or by
a grid search that minimizes the weighted remaining error.
Usage
hnp_umbrella(
X,
Y,
levels,
tolerances,
importance_order,
method = "logistic",
pretrained_model = NULL,
input_is_score = FALSE,
grid_search = TRUE,
grid_set = NULL,
max_grid = 15,
max_combinations = 2000,
hnp_split = NULL,
verbose = FALSE
)
Arguments
X |
A data.frame of features, or an |
Y |
A vector of class labels of length |
levels |
Numeric vector of length |
tolerances |
Numeric vector of length |
importance_order |
Character vector giving the class priority order, most severe first. Defines the mapping to internal levels "1", ..., "T". |
method |
Character string: one of 'randomforest', 'svm', or 'logistic'. The base method used when training a scoring-type classifier internally. |
pretrained_model |
Optional. A user-supplied scoring model: a fitted
model object, a function returning an |
input_is_score |
Logical. If |
grid_search |
Logical. If |
grid_set |
Optional list of length |
max_grid |
Integer. Maximum number of candidate thresholds considered per class in the grid search. |
max_combinations |
Integer. Maximum number of threshold combinations explored in the grid search. |
hnp_split |
Optional list of length |
verbose |
Logical. If |
Value
A HNP Umbrella classifier function function(new_data, output_internal = FALSE)
that returns predicted class labels (original labels by default, internal
labels if output_internal = TRUE), with attributes recording the selected
thresholds, objective value, importance_order, label mapping, and other
metadata. Returns NULL if no feasible classifier is found.
References
Lijia Wang, Y. X. Rachel Wang, Jingyi Jessica Li, and Xin Tong (2024). "Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data." Journal of the American Statistical Association, 119(545), 39-51. doi:10.1080/01621459.2023.2270657
Examples
set.seed(123)
n <- 500
X <- data.frame(x1 = rnorm(n), x2 = rnorm(n))
Y <- sample(c("A", "B", "C"), n, replace = TRUE, prob = c(0.2, 0.3, 0.5))
clf <- hnp_umbrella(
X = X, Y = Y,
levels = c(0.1, 0.1),
tolerances = c(0.1, 0.1),
importance_order = c("A", "B", "C"),
method = "logistic"
)
preds <- clf(X)
Neural Network Classifier Helper Function
Description
Fit a single-hidden-layer softmax neural network
(nnet::nnet) for multi-class classification and return the trained model
together with the metadata needed to produce class scores. This can be used
to supply a pretrained scoring model to the HNP Umbrella pipeline.
Usage
train_nn_and_get_scores(X, Y)
Arguments
X |
A data.frame or matrix of predictors/features. |
Y |
A vector or factor of class labels of length |
Value
A list with components: model (the fitted nnet object),
class_levels (the ordered class labels), and feature_names (the column
names of X).
Examples
set.seed(123)
X <- data.frame(x1 = rnorm(60), x2 = rnorm(60))
Y <- factor(sample(c("1", "2", "3"), 60, replace = TRUE))
fit <- train_nn_and_get_scores(X, Y)