---
title: "MD2sample"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{MD2sample}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: [MD2sample.bib]  
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(MD2sample)
B=200 
```

**Note** that all examples are run without parallel processing and a small number of simulation runs B to satisfy CRAN submission rules.

This package brings together a number of routines for the two sample problem for multivariate data. There are two data sets x and y and we want to test whether they were generated by the same probability distribution. 

The highlights of this package are:

-  it runs a large number of tests simultaneously.  
-  the user can provide their own tests  
-  it works for continuous and for discrete data.  
-  some  tests use the permutation method to find p values, others use some large sample theory.   
-  the user can also provide a routine to generate simulated data for the goodness-of-fit/two-sample hybrid problem.  
-  tests can be combined and a single adjusted p value found  
-  it uses Rcpp and parallel programming to be very fast.  
-  it includes routines that make power studies very simple.  
-  it has a routine that runs (up to) 50 different case studies, comparing the power of a new method to those included in the package. This will make it very easy for someone who has developed a new method to compare its performance vs. those included in the package.

## Continuous Data

### Example 1 

We generate two-dimensional data sets of size 100 and 120 respectively from multivariate normal distributions and run the test:

```{r}
set.seed(123)
```

```{r}
x1 = mvtnorm::rmvnorm(100, c(0,0))
y1 = mvtnorm::rmvnorm(120, c(0,0))
twosample_test(x1, y1, B=B, maxProcessor = 1)
```


**Arguments of twosample_test**

-  *x* a matrix of numbers with different observations in different rows,  or a list of x and y.  
-  *y* a matrix of numbers, or missing if x is a list with both data sets.  
-  *TS* a routine to calculate test statistics or p values for new tests.  
-  *TSextra* additional info passed to TS, if necessary.  
-  *B=5000* number of simulation runs for permutation test.  
-  *nbins=c(5,5)* bins for chi square tests (2D only).  
-  *minexpcount=5* lowest required count for chi-square test.  
-  *Ranges=matrix(c(-Inf, Inf, -Inf, Inf),2,2)* a 2x2 matrix with lower and upper bounds of observations.  
-  *rnull* routine that generates data under the null hypothesis, for hybrid problem (see below). 
-  *DoTransform=TRUE* should data be transfomed to unit hypercube?  
-  *SuppressMessages=FALSE* should informative messages be printed? 
-  *maxProcessor* maximum number of cores to use. If set to 1 no parallel processing is used.  
-  *doMethods="all"* names of method to include.


### New Tests

The routine *chiTS.cont* (included in the package) finds either the test statistic or the p value of a chi square test in two dimensions. First we need to set

```{r}
TSextra=list(which="statistic", nbins=cbind(c(3,3), c(3,4)))
```

this list is passed to *chiTS.cont* and tells the routine to

- return the test statistic (and not the p value)  
- Run two tests, first with 3 and 3 bins in the x and y direction and then 3 and 4 bins.  


```{r}
twosample_test(x1, y1, TS=chiTS.cont, TSextra=TSextra, 
               B=B, maxProcessor = 1)
```

**Arguments and output of new test routine for continuous data**

The arguments have to be x and y for the two data sets and (optionally) a list called *TSextra* for any additional input needed to find test statistic.

Note that the output vector of the routine has to be a **named** vector. 
If the routine is written in Rcpp parallel programming is not available.

### Adjusted p values

If several tests are run one can use the routine *twosample_test_adjusted_pvalue* to find a p value that is adjusted for simultaneous testing:

```{r}
twosample_test_adjusted_pvalue(x1, y1, B=c(B,B),
                               maxProcessor = 1)
```

### Power Estimation

The routine **twosample_power** allows us to estimate the power of the tests. 

### Example 2

Let's say we want to estimate the power when one data set comes from a multivariate standard normal distribution and the other from a normal distribution with a different covariance. We first need a function that generates these data sets:

```{r}
f=function(a=0) {
   S=diag(2) 
   x=mvtnorm::rmvnorm(100, sigma = S)
   S[1,2]=a
   S[2,1]=a
   y=mvtnorm::rmvnorm(120, sigma = S)
   list(x=x, y=y)
}   
```

Now we can run

```{r}
twosample_power(f, c(0,0.5), B=B, maxProcessor=1)
```

**Arguments of twosample_power**

-  *f* a function that generates data
-  *...* arguments passed to f, up to 2.   
-  *TS* a routine to calculate test statistics or p values for new tests.  
-  *TSextra* additional info passed to TS, if necessary.
-  *alpha=0.05* desired type I error probability.  
-  *B=1000* number of simulation runs for power estimation.  
-  *nbins=c(5,5)* bins for chi square tests (2D only).  
-  *minexpcount=5* lowest required count for chi-square test.  
-  *Ranges=matrix(c(-Inf, Inf, -Inf, Inf),2,2)* a 2x2 matrix with lower and upper bounds of observations.  
-  *rnull* routine that generates data under the null hypothesis, for hybrid problem (see below).  
-  *DoTransform=TRUE* should data be transformed to unit hypercube?  
-  *SuppressMessages=FALSE* should informative messages be printed?  
-  *LargeSampleOnly=FALSE* should only methods with large sample theories be run?  
-  *doMethods="all"* names of method to include.


Again the user can provide their own routine:

```{r}
twosample_power(f, c(0,0.5), TS=chiTS.cont, 
                TSextra=TSextra, B=B, maxProcessor=1)
```

The routine that generates data can also have two arguments:

```{r}
f1=function(a=0, n=100) {
   S=diag(2) 
   x=mvtnorm::rmvnorm(100, sigma = S)
   S[1,2]=a
   S[2,1]=a
   y=mvtnorm::rmvnorm(n, sigma = S)
   list(x=x, y=y)
}   
```

If the user routine returns p values run

```{r}
TSextra=list(which="pvalue", nbins=c(5,5))
twosample_power(f1, c(0,0.5), c(100, 120), 
                TS=chiTS.cont, TSextra=TSextra, B=B,
                With.p.value=TRUE, maxProcessor=1)
```


## Discrete Data

First note that tests for discrete data are implemented only in two dimensions. 

### Example 3

We consider the case of data from binomial distributions:

```{r}
vals_x = 0:5 #possible values of first random variable
vals_y = 0:6 #possible values of second random variable
a1x = rbinom(1000, 5, 0.5) 
a2x = rbinom(1000, 6, 0.5)
a1y = rbinom(1200, 5, 0.5) 
a2y = rbinom(1200, 6, 0.5)
x2=matrix(0, 6*7, 4)
colnames(x2)=c("vals_x", "vals_y", "x", "y")
x2[, 1]=rep(vals_x, length(vals_y))
x2[, 2]=rep(vals_y, each=length(vals_x))
for(i in 0:5) {
  for(j in 0:6) {
    x2[x2[,1]==i&x2[,2]==j, 3]=sum(a1x==i&a2x==j)
    x2[x2[,1]==i&x2[,2]==j, 4]=sum(a1y==i&a2y==j)
  }
}
twosample_test(x2, B=B, maxProcessor = 1)
```

**Arguments of twosample_test**

As in the case of continuous data the arguments include *TS*, *TSextra*, *minexpcount*, *rnull*, *maxProcessor* and *doMethods*. In addition we now have

-  *x* either a vector of counts for first variable, or a matrix with columns vals_x, vals_y, x and y.  
- *y* vector with counts of second variable  
- *vals_x* possible values of first variable  
- *vals_y* possible values of second variable  
- *samplingmethod ="Binomial"* how the permuted data is found. The alternative is "independence", which is slightly more accurate but also much slower. 

### Adjusted p values

Again one can find a p value adjusted for simultaneous testing:

```{r}
twosample_test_adjusted_pvalue(x2, B=c(B,B), maxProcessor=1)
```

### User supplied tests

The routine *chiTS.disc* (included in package) does a chi-square test for discrete data:

```{r}
TSextra=list(which="statistic")
twosample_test(x2, TS=chiTS.disc, TSextra=TSextra,
               B=B, maxProcessor = 1)
```


### Power Estimation

Again we need a routine that generates data sets. In the discrete case this has to be a routine whose output is a matrix with columns named vals_x, vals_y, x and y

```{r}
g=function(a=0) {
   x=cbind(rbinom(1000, 5, 0.5), rpois(1000, 1))
   x[,2][x[,2]>5]=5
   lambda=1+a*x[,1]/5
   y=cbind(rbinom(1200, 5, 0.5), rpois(1200, lambda))
   y[,2][y[,2]>5]=5
   vx=0:5
   vy=0:5
   A=matrix(0,length(vx)*length(vy),4)                  
   k=0
   for(i in seq_along(vx))
     for(j in seq_along(vy)) {
       k=k+1
       A[k,1]=vx[i]
       A[k,2]=vy[j]
       A[k,3]=sum(x[,1]==vx[i] & x[,2]==vy[j])
       A[k,4]=sum(y[,1]==vx[i] & y[,2]==vy[j])
     }
   colnames(A)=c("vals_x", "vals_y", "x", "y")
   A
}
```

```{r}
twosample_power(g, c(0, 0.25, 0.5), B=200, maxProcessor=1)
```

or using a user supplied test:

```{r}
TSextra=list(which="statistic")
twosample_power(g, c(0, 0.25, 0.5), B=200, 
                TS=chiTS.disc, TSextra=TSextra, maxProcessor=1)
TSextra=list(which="pvalue")
twosample_power(g, c(0, 0.25, 0.5), B=200, 
    TS=chiTS.disc, TSextra=TSextra, With.p.value = TRUE,
    maxProcessor=1)
```

## Goodness-of-fit/Two-sample hybrid problem

We have a data set x and we want to test whether it comes from a bivariate normal distribution. Instead of a goodness-of-fit test, however, we generate a Monte Carlo data set y from a bivariate normal rv with mean and covariance estimated from the real data set x, and then run a two-sample test.

In this scenario the two data sets are not independent, and the permutation approach to finding p values is extremely conservative, that is, the true type I error probability is much smaller than the nominal one. This in turn makes the power of the test much lower as well. Instead one can supply a routine that generates new data, just as one would in a goodness-of-fit test. Moreover, all the methods who find their own p values will now fail completely and so sshould not be run.

```{r}
# generate real and MC data sets:
f=function(mu) {
    x=mvtnorm::rmvnorm(100, c(mu, mu))
    y=mvtnorm::rmvnorm(100, apply(x, 2, mean), cor(x))
    list(x=x, y=y)
}
#True data is a mixture of normal and uniform
g=function(alpha=0) {
    x=rbind(mvtnorm::rmvnorm((1-alpha)*100, c(0, 0)),
            matrix(runif(200*alpha),ncol=2))
    y=mvtnorm::rmvnorm(100, apply(x, 2, mean), cor(x))
    list(x=x, y=y)
}
# generate two-sample data set
rnull=function(dta) {
   x=mvtnorm::rmvnorm(nrow(dta$x), apply(dta$x, 2, mean), cor(dta$x))
   y=mvtnorm::rmvnorm(nrow(x), apply(x, 2, mean), cor(x))
   list(x=x, y=y)
}
```

```{r}
# Only run these methods for hypbrid problem
mt=c("KS", "K", "CvM", "AD", "NN1", "NN5", "AZ", "BF", "BG")
# Null hypothesis is true:
twosample_power(f, c(0, 1), doMethods = mt, B=200, maxProcessor = 1)
twosample_power(f, c(0, 1), rnull=rnull, B=200, maxProcessor = 1)
# Null hypothesis is false:
twosample_power(g, c(0, 0.5), doMethods = mt, B=200, maxProcessor = 1)
twosample_power(g, c(0, 0.5), rnull=rnull, B=200, maxProcessor = 1)
```

As we can see, the true type I error using the permutation method is much smaller than the nominal $\alpha=0.05$, and so the power is lower as well.

## Benchmarking 

The routine *run.studies* allows the user to quickly compare the power of a new test with the power of the included tests for 25 different combinations of null hypothesis vs alternative for continuous data and 20 for discrete data. It also allows the user to find the power for those case studies for different sample size and type I error probabilities.

As an example, let's compare the power of the test based on differences in means to the included ones for two of the studies.

Note that these examples are not run so the package can be submitted to CRAN:


```{r eval=FALSE}
run.studies(Continuous=TRUE, 
            TS=newTest_1, 
            study=c("NormalD2", "tD2"), 
            B=1000)
```

**Arguments of run.studies**

-  *Continuous=TRUE* run cases for continuous data.  
-  *TS* routine to calculate test statistics.  
-  *study* either the name of the study, or its number. If missing all the studies are run.  
-  *TSextra* list passed to TS, if any.  
-  *With.p.value=FALSE* does user supplied routine return p values?  
-  *nsample=200* desired sample size.  
-  *alpha=0.05*  type I error.  
-  *param_alt* (list of) values of parameter under the alternative hypothesis. If missing included values are used.  
-  *SuppressMessages=FALSE* should messages be printed?  
-  *B = 1000* number of simulation runs.

The data set *power_studies_results* included in MD2sample has the results for the included tests using a sample size of 200, a true type I error probability of 0.05 and two values of the parameter under the alternative, on which makes the null hypothesis true and one which makes it false. If the user wants different numbers he can run:

```{r eval=FALSE}
run.studies(Continuous=TRUE, 
            study=c("NormalD2", "tD2"), 
            param_alt=cbind(c(0.4, 0.4), c(0.7, 0.7)), 
            alpha=0.1, B=100)
```

Say the new method can find p values without simulation:

Note that the routine should return a **named** vector with the p values.

```{r}
TSextra=list(which="pvalue", nbins=cbind(c(3,3), c(4,4)))
run.studies(Continuous=TRUE, 
                study=c("NormalD2", "tD2"),
                TS=chiTS.cont, 
                TSextra=TSextra,
                With.p.value = TRUE, 
                B=500,
                SuppressMessages = TRUE,
                maxProcessor=1)
```

## Goodness-of-fit / twosample hybrid problem.

Consider the following situation. We have a data set $x$ and want to test whether it comes from some probability distribution *F*, that is, we have a goodness-of-fit problem. However, it is not possible to calculate probabilities from $F$, probably because this would require integration in high dimensions. We are, though, able to generate a second data set $y$ under $F$, and so can run a twosample test.

It can be shown that if the model $F$ is not fully specified but includes parameters that have to be estimated from $x$, finding p values using the permutation method leads to an extremely conservative tests, that is, the true type I error probability is much smaller than the desired one. Instead one can use a parametric bootstrap approach, that is a new data set $y$ can be generated in each simulation run.

### Example 5

Let's say we wish to test whether the data set $x$ comes from a multivariate normal distribution with unknown mean and covariance:

```{r}
f=function(a) {
    x=mvtnorm::rmvnorm(500, c(0.5, 0.5))
    y=rbind(matrix(runif(a*1000), ncol=2),
            mvtnorm::rmvnorm((1-a)*500, c(0.5,0.5)))
    list(x=x, y=y)
}    
rnull=function(dta) {
  muhat=apply(dta$x, 2, mean)
  sigmahat=cor(dta$x)
  list(x=mvtnorm::rmvnorm(nrow(dta$x), muhat, sigmahat),  
       y=mvtnorm::rmvnorm(nrow(dta$y), muhat, sigmahat))
}
```

```{r}
dta=f(0) # Null hypothesis is true
twosample_test(dta, rnull=rnull, B=B, maxProcessor = 1)
```

```{r}
dta=f(0.2) # Null hypothesis is false
twosample_test(dta, rnull=rnull, B=B, maxProcessor = 1)
```

Again we can find adjusted p values:

```{r eval=FALSE}
twosample_test_adjusted_pvalue(dta, rnull=rnull, 
                               B=B, maxProcessor = 1)
```

and find the power of the tests:

```{r eval=FALSE}
twosample_power(f, c(0, 0.5), rnull=rnull, B=B, maxProcessor=1)
```


## Methods included in MD2sample for continuous data

### Methods whose p values are found via simulation

#### Methods based on empirical distribution functions

Denote by $\mathbf{z}$ the combined data set $x_1,..,x_n,y_1,..y_m$. Let $\hat{F}$ and $\hat{G}$ be the empirical distribution functions of the two data sets, and let $\hat{H}$ be the empirical distribution function of $\mathbf{z}$  

**Kolmogorov-Smirnov test**

This classic test uses the test statistic

$$\max\{\vert \hat{F}(z_i)-\hat{G}(z_i)\vert;z_1,..,z_{n+m}\}$$

It was first proposed in [@Kolmogorov1933] and [@Smirnov1939].

**Kiuper's test**

A variant of Kolmorogov-Smirnov:

$$\max\{\hat{F}(z_i)-\hat{G}(z_i);z_1,..,z_{n+m}\}-\min\{ \hat{F}(z_i)-\hat{G}(z_i);z_1,..,z_{n+m}\}$$
This test was first discussed in [@Kuiper1960].

**Cramer-vonMises test**

$$\sum_{i=1}^{n+m} \left(\hat{F}(z_i)-\hat{G}(z_i)\right)^2$$
the extension to the two-sample problem of the Cramer-vonMises criterion is discussed in [@Anderson1962].

**Anderson-Darling test**

$$\sum_{i=1}^{n+m} \frac{\left(\hat{F}(z_i)-\hat{G}(z_i)\right)^2}{\hat{H}(z_i)(1-\hat{H}(z_i))}$$
It was first proposed in [@anderson1952].

#### Methods based on nearest neighbors

The test statistics are the average number of nearest neighbors of the $\mathbf{x}$ data set that are also from $\mathbf{x}$ plus the average number of nearest neighbors of the $\mathbf{y}$ data set that are also from $\mathbf{y}$. *NN1* uses one nearest neighbor and $NN5$ uses 5.



#### Methods based on distances between observations

We denote by $||.||$ Euclidean distance

**Aslan-Zech test**

This test discussed in [@aslanzech2005] uses the test statistic

$$
\begin{aligned}
&\frac{1}{nm}\sum_{i=1}^n \sum_{j=1}^m \log(||x_i-y_j||) -\\
&\frac{1}{n^2}\sum_{i=1}^n \sum_{i<j} \log(||x_i-x_j||) - \\
&\frac{1}{m^2}\sum_{i=1}^m \sum_{i<j} \log(||y_i-y_j||) 
\end{aligned}
$$
**Baringhaus-Franz test**

Similar to the Aslan-Zech test, it uses the test statistic

$$
\begin{aligned}
&\frac{nm}{n+m}\left[\frac{1}{nm}\sum_{i=1}^n \sum_{j=1}^m \sqrt{||x_i-y_j||} + \right.\\
&\frac{1}{n^2}\sum_{i=1}^n \sum_{i<j} \sqrt{||x_i-x_j||} -\\
&\left. \frac{1}{m^2}\sum_{i=1}^m \sum_{i<j} \sqrt{||y_i-y_j||} \right]\\
\end{aligned}
$$
and was first proposed in [@baringhaus2004].

**Biswas-Ghosh test**

Another variation of test based on Euclidean distance was discussed in [@biswas2014].

$$
\begin{aligned}
&B_{xy} = \frac{1}{nm}\sum_{i=1}^n \sum_{j=1}^m \sqrt{||x_i-y_j||} \\
&B_{xx}= \frac{2}{n(n-1)}\sum_{i=1}^n \sum_{i<j} \sqrt{||x_i-x_j||} \\
&B_{yy}=\frac{2}{m(m-1)}\sum_{i=1}^m \sum_{i<j} \sqrt{||y_i-y_j||}\\
&\left(B_{xx}-B_{xy}\right)^2+\left(B_{yy}-B_{xy}\right)^2
\end{aligned}
$$

### Methods whose p values are found using a large sample theory


**Friedman-Rafski test**

This test is a multi-dimensional extension of the classic Wald-Wolfowitz test bases on minimum spanning trees. It was discussed in [@friedman1979].


**Simple Nearest Neighboor test**

Similar to the nearest neigboor tests described earlier, it uses only the number of nearest neighbors of the first data set that are also from the first data set. This number has a binomial distribution, and this can be used to find p values.

**Chen-Friedman tests**

These tests, discussed in [@chen2017], are implemented in the *gTests* [@gtests2017] package.

**Ball Divergence test**

A test described in [@ball2018] and implemented in the R package [@ballpackage2021].

## Methods included in MD2sample for discrete data

Implemented for discrete data are versions of the Kolmogorov-Smirnov, Kuiper, Cramer-vonMises, Anderson-Darling, nearest neighboor, Aslan-Zech, Baringhaus-Franz tests as well as a chisquare test.

# References