Title: | Various Methods for the Two Sample Problem in D>1 Dimensions |
Version: | 1.0.0 |
Description: | The routine twosample_test() in this package runs the two-sample test using various test statistic for multivariate data. The user can also run several tests and then find a p value adjusted for simultaneous inference. The p values are found via permutation or via the parametric bootstrap. The routine twosample_power() allows the estimation of the power of the tests. The routine run.studies() allows a user to quickly study the power of a new method and how it compares to those included in the package. For details of the methods and references see the included vignettes. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
LinkingTo: | Rcpp |
Imports: | Rcpp, parallel, stats, FNN, copula, ade4, gTests, igraph, lsa, microbenchmark, mvtnorm, Ball |
Suggests: | rmarkdown, knitr, ggplot2, egg, testthat (≥ 3.0.0) |
VignetteBuilder: | knitr |
Depends: | R (≥ 3.5) |
LazyData: | true |
NeedsCompilation: | yes |
Packaged: | 2025-07-21 17:52:08 UTC; Wolfgang |
Author: | Wolfgang Rolke |
Maintainer: | Wolfgang Rolke <wolfgang.rolke@upr.edu> |
Repository: | CRAN |
Date/Publication: | 2025-07-22 11:40:11 UTC |
MD2sample: Various Methods for the Two Sample Problem in D>1 Dimensions
Description
The routine twosample_test() in this package runs the two-sample test using various test statistic for multivariate data. The user can also run several tests and then find a p value adjusted for simultaneous inference. The p values are found via permutation or via the parametric bootstrap. The routine twosample_power() allows the estimation of the power of the tests. The routine run.studies() allows a user to quickly study the power of a new method and how it compares to those included in the package. For details of the methods and references see the included vignettes.
Author(s)
Maintainer: Wolfgang Rolke wolfgang.rolke@upr.edu (ORCID)
Friedman-Rafsky (FR) test
Description
FR test is a multivariate generalization of nonparametric two-sample test. This function is an implementation with customized options, including a visualization of the minimum spanning tree (MST).
Usage
FR.test(
samp1,
samp2,
use.cosine = FALSE,
binary = FALSE,
binary.cutoff = 2,
plot.MST = FALSE,
col = c("#F0E442", "#56B4E9"),
label.names = c("Sample 1", "Sample 2"),
vertex.size = 5,
edge.width = 1,
...
)
Arguments
samp1 |
Numeric matrix or data frame for Sample 1. Rows are multivariate dimensions, and columns are samples. E.g. gene by cell. |
samp2 |
Numeric matrix or data frame for Sample 2. |
use.cosine |
An option if to use cosine distance. Logical variable. By default ( |
binary |
An option if to use binary values. Logical variable. Default: |
binary.cutoff |
Numeric value for binary cutoff. Binary value = 1 if greater than |
plot.MST |
Boolean variable indicating if to plot the minimum spanning tree (MST). Default: |
col |
Character vector of length two for customized colors of the nodes in MST. Default: |
label.names |
Character vector of length two for customized names of the two samples. Default: |
vertex.size , edge.width , ... |
Additional plotting parammeters passed to |
Value
Test statistics and p-values.
runs |
Total number of subtrees. |
runs.samp1 |
Number of subtrees of Sample 1. |
runs.samp2 |
Number of subtrees of Sample 2. |
stat |
The standardized FR statistic. |
p.value |
P-value of the FR test. |
equivalent to R command order
Description
equivalent to R command order
Usage
Order(x)
Arguments
x |
numeric vector |
Value
integer vector
equivalent to R command rank
Description
equivalent to R command rank
Usage
Rank(x)
Arguments
x |
numeric vector |
Value
integer vector
Find test statistics for continuous data
Description
Find test statistics for continuous data
Usage
TS_cont(x, y, TSextra)
Arguments
x |
A matrix. |
y |
A matrix |
TSextra |
list with some helper functions |
Value
A numeric vector with test statistics
This function runs a number of two sample tests which find their own p value.
Description
This function runs a number of two sample tests which find their own p value.
Usage
TS_cont_pval(x, y)
Arguments
x |
a matrix of numbers if data is continuous or of counts if data is discrete. |
y |
a matrix of numbers if data is continuous or of counts if data is discrete. |
Value
A list of two numeric vectors, the test statistics and the p values.
Find test statistics for discrete data
Description
Find test statistics for discrete data
Usage
TS_disc(x, y, vx, vy, TSextra)
Arguments
x |
a vector of integers. |
y |
a vector of integers. |
vx |
a vector of real. |
vy |
a vector of real. |
TSextra |
list with extra infos |
Value
A numeric vector with test statistics
Count Events in Bins
Description
This function discretizes continuous data in two dimensions
Usage
bincounterR(
x,
y,
nbins = c(5, 5),
Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2)
)
Arguments
x |
a matrix of numbers or a list |
y |
a matrix of numbers |
nbins |
=c(5,5) number of bins in x and in y direction |
Ranges |
=matrix(c(-Inf, Inf, -Inf, Inf),2,2) a 2x2 matrix with lower and upper bounds |
Value
a list counts and vals
This function calculates the test statistics
Description
This function calculates the test statistics
Usage
calcTS(dta, TS, typeTS, TSextra)
Arguments
dta |
data set |
TS |
routine |
typeTS |
format of TS |
TSextra |
list passed to TS function |
Value
A vector of numbers
Create case studies
Description
This function creates the functions needed to run the various case studies.
Usage
case.studies(
which,
n = 200,
nx = n,
ny = n,
nbins = -1,
Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2),
ReturnCaseNames = FALSE
)
Arguments
which |
name of the case study. |
n |
=200, sample size for both data sets |
nx |
=n, sample size for first data sets |
ny |
=n, sample size for second data sets |
nbins |
=-1, number of bins for chi-square test, 2D only |
Ranges |
= matrix(c(-Inf, Inf, -Inf, Inf), 2, 2), ranges of variables |
ReturnCaseNames |
=FALSE, should list of case studies be returned? |
Value
a list of functions and vectors
Change Marginals of 2D Data
Description
This function modifies the marginals of some data set
Usage
change.marginals(dta, which, theta)
Arguments
dta |
a list with two data sets |
which |
method for modifying the data sets. |
theta |
how much to modify |
Value
a list of functions
Chi-square test for continuous data, example of user supplied function
Description
This function does the chi square two sample test for continuous data in two dimensions
Usage
chiTS.cont(x, y, TSextra)
Arguments
x |
a matrix of numbers |
y |
a matrix of numbers |
TSextra |
list with some info |
Value
either a test statistic or a p value
Chi-square test for discrete data, example of user supplied function
Description
This function does the chi square two sample test for discrete data in two dimensions
Usage
chiTS.disc(x, y, vals_x, vals_y, TSextra)
Arguments
x |
a vector with counts |
y |
a vector with counts |
vals_x |
a vector with points |
vals_y |
a vector with points |
TSextra |
list with some info |
Value
either a test statistic or a p value
This function does the chi square two sample test for continuous data in two dimensions
Description
This function does the chi square two sample test for continuous data in two dimensions
Usage
chisq2D_test_cont(
dta_x,
dta_y,
Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2),
nbins = c(5, 5),
minexpcount = 5
)
Arguments
dta_x |
a matrix of numbers |
dta_y |
a matrix of numbers |
Ranges |
=matrix(c(-Inf, Inf, -Inf, Inf),2,2) a 2x2 matrix with lower and upper bounds |
nbins |
=c(5,5) number of bins in x and y direction |
minexpcount |
=5 minimum counts required per bin |
Value
a list with statistics and p values
This function does the chi square two sample test for continuous data in two dimensions
Description
This function does the chi square two sample test for continuous data in two dimensions
Usage
chisq2D_test_disc(dta, minexpcount = 5)
Arguments
dta |
a list with values and counts |
minexpcount |
=5 minimum counts required per bin |
Value
a list with statistics and p values
This function does the tests by Chen and Friedman
Description
This function does the tests by Chen and Friedman
Usage
edge.tests(x, y)
Arguments
x |
a matrix of numbers |
y |
a matrix of numbers |
Value
a list with statistics and p values
find pairwise Euclidean distances
Description
find pairwise Euclidean distances
Usage
find_dist(dta)
Arguments
dta |
A list of matrices |
Value
A matrix
simulate continuous data
Description
simulate continuous data
Usage
gen_cont(x, y, TSextra)
Arguments
x |
first data set |
y |
second data set |
TSextra |
extra stuff |
Value
A list of permuted vectors
simulate new discrete data
Description
simulate new discrete data
Usage
gen_disc(x, y, vals_x, vals_y, TSextra)
Arguments
x |
first data set, counts |
y |
second data set, counts |
vals_x |
values of discrete random variable x |
vals_y |
values of discrete random variable y |
TSextra |
extra stuff |
Value
A list of permuted vectors
simulate new data
Description
simulate new data
Usage
gen_sim_data(dta, TSextra)
Arguments
dta |
data set |
TSextra |
extra stuff |
Value
A list of permuted vectors
Finds the empirical distribution function
Description
Finds the empirical distribution function
Usage
mdecdf(dta, pts)
Arguments
dta |
a matrix of data points |
pts |
a matrix of evaluation points |
Value
a numeric vector
Find the power of various continuous tests via simutation or permutation.
Description
Find the power of various continuous tests via simutation or permutation.
Usage
powerC(rxy, xparam, yparam, TS, typeTS, TSextra, B = 1000L)
Arguments
rxy |
a function that generates x and y data. |
xparam |
arguments for r1. |
yparam |
arguments for r2. |
TS |
routine to calculate test statistics for non-chi-square tests |
typeTS |
indicator for type of test statistics |
TSextra |
additional info passed to TS, if necessary |
B |
=1000 number of simulation runs |
Value
A list values of test statistics
This function finds the power for tests that find their own p values
Description
This function finds the power for tests that find their own p values
Usage
power_pvals(
rxy,
avals,
bvals,
TS,
typeTS,
TSextra = list(a = 0),
nbins = c(5, 5),
minexpcount = 5,
Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2),
alpha = 0.05,
B = 1000
)
Arguments
rxy |
a function to generate data |
avals |
vector of parameter values |
bvals |
vector of parameter values |
TS |
routine that runs the test and returns p values |
typeTS |
type of problem, continuous or discrete |
TSextra |
=list(a=0) a list of things passed to TS, if needed |
nbins |
=c(5, 5), bins for chi square tests. |
minexpcount |
=5, lowest required count for chi-square test |
Ranges |
=matrix(c(-Inf, Inf, -Inf, Inf),2,2) a 2x2 matrix with lower and upper bounds |
alpha |
=0.05 type I error probability of test |
B |
=1000 number of simulation runs |
Value
A matrix of power values
power_studies_results
Description
the results of the included power studies
Usage
power_studies_results
Format
'power_studies_results'
A list of matrices with powers
This function creates data for Dalitz plot
Description
This function creates data for Dalitz plot
Usage
rDalitz(nyadd = 0, which = 1, addon, n = 200, nx = n, ny = n)
Arguments
nyadd |
=0 |
which |
type of Daliz plot. |
addon |
what to add to uniform |
n |
=200, sample size of both data sets. |
nx |
=n, sample size of x data set. |
ny |
=n, sample size of y data set. |
Value
a list of functions
Benchmarking for Multivariate Two-Sample Tests
Description
This function runs the case studies included in the package.
Usage
run.studies(
Continuous = TRUE,
study,
TS,
TSextra,
With.p.value = FALSE,
nsample = 200,
alpha = 0.05,
param_alt,
SuppressMessages = FALSE,
B = 1000,
maxProcessor
)
Arguments
Continuous |
=TRUE, run cases for continuous data. |
study |
either the name of the study, or its number in the list. If missing all the studies are run. |
TS |
routine to calculate new test statistics. |
TSextra |
list passed to TS (optional). |
With.p.value |
=FALSE, does user supplied routine return p values? |
nsample |
= 200, desired sample size. 200 is used in included case studies. |
alpha |
=0.05, type I error probability of tests. 0.05 is used in included case studies. |
param_alt |
(list of) values of parameter under the alternative hypothesis. If missing included values are used. |
SuppressMessages |
=FALSE, should informative messages be printed? |
B |
= 1000, number of simulation runs. |
maxProcessor |
number of cores to use. If missing the number of physical cores-1 is used. If set to 1 no parallel processing is done. |
Details
For details consult vignette(package="MD2sample")
Value
A (list of ) matrices of p.values.
Examples
#The new test is a (included) chi square test:
TSextra=list(which="pval", nbins=rbind(c(3,3), c(4,4)))
run.studies(Continuous=TRUE, study=c("NormalD2", "tD2"),
TS=MD2sample::chiTS.cont, TSextra=TSextra, B=100)
This function does some rounding to nice numbers
Description
This function does some rounding to nice numbers
Usage
## S3 method for class 'digits'
signif(x, d = 4)
Arguments
x |
a list of two vectors |
d |
=4 number of digits to round to |
Value
A list with rounded vectors
Helper function to find test statistics of simulated data.
Description
Helper function to find test statistics of simulated data.
Usage
simTS(dta, TS, typeTS, TSextra, B)
Arguments
dta |
a list |
TS |
test statistic routine |
typeTS |
type of routine |
TSextra |
a list |
B |
number of simulation runs |
Value
a matrix
Helper function to find test statistics of simulated data.
Description
Helper function to find test statistics of simulated data.
Usage
simpvals(
dta,
TS,
typeTS,
TSextra,
A,
Continuous,
Ranges,
nbins,
minexpcount,
B
)
Arguments
dta |
a list |
TS |
test statistic routine |
typeTS |
type of routine |
TSextra |
a list |
A |
a matrix |
Continuous |
logical |
Ranges |
a matrix |
nbins |
a vector |
minexpcount |
an integer |
B |
number of simulation runs |
Value
a matrix
run tests using either simulation or permutation.
Description
run tests using either simulation or permutation.
Usage
testC(dta, TS, typeTS, TSextra, B = 5000L)
Arguments
dta |
a list with the data |
TS |
routine to calculate test statistics for non-chi-square tests |
typeTS |
type of a test statistic |
TSextra |
additional info passed to TS, if necessary |
B |
=5000, number of simulation runs. |
Value
A list with test statistics and p values
This function checks whether the correct methods have been requested
Description
This function checks whether the correct methods have been requested
Usage
test_methods(doMethods, Continuous, ReturnMethodNames = FALSE)
Arguments
doMethods |
="all" Which methods should be included? |
Continuous |
is data continuous |
ReturnMethodNames |
= FALSE should names of methods be returned? |
Value
TRUE/FALSE or a character vector
test function
Description
test function
Usage
timecheck(dta, TS, typeTS, TSextra)
Arguments
dta |
data set |
TS |
test statistics |
typeTS |
format of TS |
TSextra |
additional info TS |
Value
Mean computation time
transform data to interval (0,1).
Description
transform data to interval (0,1).
Usage
transform01(dta)
Arguments
dta |
a list. |
Value
a list
Power Estimation for Multivariate Two-Sample Tests
Description
Estimate the power of various two sample tests using Rcpp and parallel computing.
Usage
twosample_power(
f,
...,
TS,
TSextra,
alpha = 0.05,
B = 1000,
nbins = c(5, 5),
minexpcount = 5,
Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2),
samplingmethod = "Binomial",
rnull,
With.p.value = FALSE,
DoTransform = TRUE,
SuppressMessages = FALSE,
LargeSampleOnly = FALSE,
maxProcessor,
doMethods = "all"
)
Arguments
f |
function to generate a list with data sets x and y for continuous data or a matrix with columns vals_x, vals_y, x and y for discrete data. |
... |
additional arguments passed to f, up to 2. |
TS |
routine to calculate test statistics for new tests. |
TSextra |
additional info passed to TS, if necessary. |
alpha |
=0.05, the type I error probability of the hypothesis test. |
B |
=1000, number of simulation runs. |
nbins |
=c(5, 5), number of bins for chi square test if Dim=2. |
minexpcount |
=5, lowest required count for chi-square test. |
Ranges |
=matrix(c(-Inf, Inf, -Inf, Inf),2,2), a 2x2 matrix with lower and upper bounds. |
samplingmethod |
="Binomial" for Binomial sampling or "independence" for independence sampling in the discrete data case. |
rnull |
function to generate new data sets for parametric bootstrap. |
With.p.value |
=FALSE, does user supplied routine return p values? |
DoTransform |
=TRUE, should data be transformed to to unit hypercube? |
SuppressMessages |
=FALSE, should messages be printed? |
LargeSampleOnly |
=FALSE, should only methods with large sample theories be run? |
maxProcessor |
number of cores to use. If missing the number of physical cores-1 is used. If set to 1 no parallel processing is done. |
doMethods |
="all", which methods should be included? |
Details
For details consult vignette("MD2sample","MD2sample")
Value
A numeric matrix or vector of power values.
Examples
#Note that the resulting power estimates are meaningless because
#of the extremely low number of simulation runs B, required because of CRAN timing rule
#
#Power of tests when one data set comes from a standard normal multivariate distribution function
#and the other data set from a multivariate normal with correlation
#number of simulation runs is ridiculously small because of CRAN submission rules
f=function(a=0) {
S=diag(2)
x=mvtnorm::rmvnorm(100, sigma = S)
S[1,2]=a
S[2,1]=a
y=mvtnorm::rmvnorm(120, sigma = S)
list(x=x, y=y)
}
twosample_power(f, c(0, 0.5), B=10, maxProcessor=1)
#Power of use supplied test. Example is a (included) chi-square test:
TSextra=list(which="statistics", nbins=rbind(c(3,3), c(4,4)))
twosample_power(f, c(0, 0.5), TS=chiTS.cont, TSextra=TSextra, B=10, maxProcessor=1)
#Same example, but this time the user supplied routine calculates p values:
TSextra=list(which="pvalues", nbins=c(4,4))
twosample_power(f, c(0, 0.5), TS=chiTS.cont, TSextra=TSextra, B=10,
With.p.value=TRUE, maxProcessor=1)
#Example for discrete data
g=function(p1, p2) {
x = table(sample(1:4, size=1000, replace = TRUE))
y = table(sample(1:4, size=500, replace = TRUE, prob=c(p1,p2,1,1)))
cbind(vals_x=rep(1:2,2), vals_y=rep(1:2, each=2), x=x, y=y)
}
twosample_power(g, 1.5, 1.6, B=10, maxProcessor=1)
Multivariate Two-Sample Tests
Description
This function runs a number of two sample tests using Rcpp and parallel computing.
Usage
twosample_test(
x,
y,
vals_x = NA,
vals_y = NA,
TS,
TSextra,
B = 5000,
nbins = c(5, 5),
minexpcount = 5,
Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2),
DoTransform = TRUE,
samplingmethod = "Binomial",
rnull,
SuppressMessages = FALSE,
LargeSampleOnly = FALSE,
maxProcessor,
doMethods = "all"
)
Arguments
x |
Continuous data: either a matrix of numbers, or a list with two matrices called x and y. if it is a matrix Observations are in different rows. Discrete data: a vector of counts or a matrix with columns named vals_x, vals_y, x and y. |
y |
a matrix of numbers if data is continuous or a vector of counts if data is discrete. |
vals_x |
=NA, a vector of values for discrete random variables, or NA if data is continuous. |
vals_y |
=NA, a vector of values for discrete random variables, or NA if data is continuous. |
TS |
user supplied routine to calculate test statistics for new tests. |
TSextra |
(optional) additional info passed to TS, if necessary. |
B |
=5000, number of simulation runs for permutation test. |
nbins |
=c(5,5), for chi square tests (2D only). |
minexpcount |
=5, lowest required count for chi-square test (2D only). |
Ranges |
=matrix(c(-Inf, Inf, -Inf, Inf),2,2), a 2x2 matrix with lower and upper bounds (2D only). |
DoTransform |
=TRUE, should data be transformed to unit hypercube? |
samplingmethod |
="Binomial" for Binomial sampling or "independence" for independence sampling. |
rnull |
function to generate new data sets for simulation as an alternative to the permutation method. |
SuppressMessages |
=FALSE, should informative messages be printed? |
LargeSampleOnly |
=FALSE, should only methods with large sample theories be run? |
maxProcessor |
number of cores to use. If missing the number of physical cores-1 is used. If set to 1 no parallel processing is done. |
doMethods |
="all", Which methods should be included? |
Details
For details consult vignette("MD2sample","MD2sample")
Value
A list of two numeric vectors, the test statistics and the p values.
Examples
#Two continuous data sets from a multivariate normal:
x = mvtnorm::rmvnorm(100, c(0,0))
y = mvtnorm::rmvnorm(120, c(0,0))
twosample_test(x, y, B=100, maxProcessor=1)
#Using a new test, this one is an (included) chi square test.
#Also enter data as a list:
TSextra=list(which="statistics", nbins=rbind(c(3,3), c(4,4)))
dta=list(x=x, y=y)
twosample_test(dta, TS=chiTS.cont, TSextra=TSextra, B=100, maxProcessor=1)
#Two discrete data sets from some distribution:
x = table(sample(1:4, size=1000, replace = TRUE))
y = table(sample(1:4, size=1000, replace = TRUE, prob=c(1,2,1,1)))
vals_x=rep(1:2,2)
vals_y=rep(1:2, each=2)
twosample_test(x, y, vals_x, vals_y, B=100, maxProcessor=1)
#Run a discrete chi square test and enter the data as a matrix:
TSextra=list(which="statistics")
dta=cbind(x=x, y=y, vals_x=vals_x, vals_y=vals_y)
twosample_test(dta, TS=chiTS.disc, TSextra=TSextra, B=100, maxProcessor=1)
Adjusted p values
Description
This function runs a number of two sample tests using Rcpp and parallel computing and then finds the correct p value for the combined tests.
Usage
twosample_test_adjusted_pvalue(
x,
y,
vals_x = NA,
vals_y = NA,
B = c(5000, 1000),
nbins = c(5, 5),
minexpcount = 5,
samplingmethod = "Binomial",
Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2),
DoTransform = TRUE,
rnull,
SuppressMessages = FALSE,
maxProcessor,
doMethods
)
Arguments
x |
Continuous data: either a matrix of numbers, or a list with two matrices called x and y. if it is a matrix Observations are in different rows. Discrete data: a vector of counts or a matrix with columns named vals_x, vals_y, x and y. |
y |
a matrix of numbers if data if data is continuous or a vector of counts if data is discrete. |
vals_x |
=NA, a vector of values for discrete random variable, or NA if data is continuous. |
vals_y |
=NA, a vector of values for discrete random variable, or NA if data is continuous. |
B |
=c(5000, 1000), number of simulation runs for permutation test and for estimation of the empirical distribution function. |
nbins |
=c(5, 5), number of bins for chi square tests (2D only). |
minexpcount |
= 5, minimum required expected counts for chi-square tests. |
samplingmethod |
="Binomial" or "independence" for discrete data. |
Ranges |
=matrix(c(-Inf, Inf, -Inf, Inf),2,2) a 2x2 matrix with lower and upper bounds. |
DoTransform |
=TRUE, should data be transformed to interval (0,1)? |
rnull |
routine for parametric bootstrap. |
SuppressMessages |
= FALSE, print informative messages? |
maxProcessor |
number of cores for parallel processing. |
doMethods |
Which methods should be included? If missing a small number of methods that generally have good power are used. |
Details
For details consult the vignette("MD2sample","MD2sample")
Value
NULL, results are printed out.
Examples
#Note that the number of simulation runs B is very small to
#satisfy CRAN's run time constraints.
#Two continuous data sets from a multivariate normal:
x = mvtnorm::rmvnorm(100, c(0,0))
y = mvtnorm::rmvnorm(120, c(0,0))
twosample_test_adjusted_pvalue(x, y, maxProcessor=1, B=20)
#Two discrete data sets from some distribution:
x = table(sample(1:4, size=1000, replace = TRUE))
y = table(sample(1:4, size=500, replace = TRUE, prob=c(1, 1.5, 1, 1)))
twosample_test_adjusted_pvalue(x, y, rep(1:2,2), rep(1:2, each=2), maxProcessor=1, B=20)
sort vector y by values in vector x
Description
sort vector y by values in vector x
Usage
xysort(y, x)
Arguments
y |
numeric vector |
x |
numeric vector |
Value
numeric vector