---
title: "MDgof-hybrid"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{MDgof-hybrid}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(error=TRUE,
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, include=FALSE}
library(MDgof)
B=200 
st=function() knitr::knit_exit()
examples=MDgof::hybrid.mdgof.vignette
```

**Note** the examples are not actually run  but their results are stored in the included data set *hybrid.mdgof.vignette*. This is done  to satisfy CRAN submission rules regarding the execution time of vignettes. If you want to run the Rmarkdown file yourself, set ReRunExamples=TRUE.

```{r}
ReRunExamples=FALSE
```

In a standard goodness-of-fit problem we have a data set $x$ and a probability model, given in form of the cumulative distribution function $F$, and want to test $H_0: X\sim F$, that is the data set was generated by $F$.

There are however circumstances where evaluation of the function $F(x)$ is difficult or even impossible, especially for high-dimensional data. It is however possible to generate data from $F$. Then a test can be run as follows:

1. Generate a Monte Carlo data set $y$ from $F$.  
2. Run a two-sample test n $x$ and $y$.

In what follows we will call this a hybrid test.

One additional "feature" of this approach is that often we are free to choose the sample size of $y$. Generally on would expect the power of the tests to increase.

The companion R package *MD2sample* has a large number of methods already implemented and is used in routines that follow.

## Example 1: Continuous data without parameter estimation 

We generate a two-dimensional data set of size 100 from a multivariate standard normal distribution and run the test with the null hypothesis $H_0:X\sim N(\mathbf{0}, \mathbf{I_2})$:

```{r}
set.seed(123)
```

```{r ex1}
rnull=function() mvtnorm::rmvnorm(100, c(0, 0))
x=rnull()
```

```{r eval=ReRunExamples}
examples[["ex1a"]]=hybrid_test(x, rnull, B=B)
```

```{r}
examples[["ex1a"]]
```

$B$ is the number of simulation runs to estimated the p-value, $B=5000$ is the default but if the data set is large a smaller value should suffice.

**Arguments of hybrid_test for continuous data**

-  *x* a matrix of numbers with different observations in different rows (for continuous data).  
-  *rnull* routine that generates data under the null hypothesis.    
-  *phat*  routine that finds parameter estimates.  
-  *TS* a routine to calculate test statistics or p values for new tests.  
-  *TSextra* additional info passed to TS, if necessary.  
-  *maxProcessor* maximum number of cores to use. If set to 1 no parallel processing is used.  
-  *doMethods="all"* names of method to include.  
-  *B=5000* number of simulation runs.  

## Example 2: Continuous data with parameter estimation 

In this example we will test whether the data comes from  bivariate normal distribution, with means and variance-covariance estimated from the data.

```{r ex2}
rnull=function(p) mvtnorm::rmvnorm(200, 
      mean=p[1:2],
      sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2))
phat=function(x) 
   c(apply(x, 2, mean), c(apply(x, 2, var), cov(x[,1],x[,2])))
x=rnull(c(1, 2, 4, 9,0.6))
phat(x)
```

```{r eval=ReRunExamples}
examples[["ex2a"]]=hybrid_test(x, rnull, phat, B=B)
```

```{r}
examples[["ex2a"]]
```

In contrast in the next case we will generate data from a multivariate t distribution with 5 degrees of freedom. We also use a larger sample size.

```{r}
y=mvtnorm::rmvt(300, sigma=diag(2), df=5)
```

```{r eval=ReRunExamples}
examples[["ex2b"]]=hybrid_test(y, rnull, phat, B=B)
```

```{r}
examples[["ex2b"]]
```

## New Tests

*hybrid_test* allows the user to supply their own (two-sample) test method, either one that finds its own p-value or a method that requires simulation.

The routine *chiTS.cont* (included in the *MD2sample* package) finds either the test statistic or the p value of a chi square test in two dimensions. First we need to set

```{r}
TSextra=list(which="statistic", nbins=cbind(c(3,3), c(3,4)))
```

this list is passed to *chiTS.cont* and tells the routine to

- return the test statistic (and not the p value)  
- Run two tests, first with 3 and 3 bins in the x and y direction and then 3 and 4 bins. 

## Example 3

In this example we use the same setup as in example 2. 

```{r eval=ReRunExamples}
examples[["ex3a"]]=hybrid_test(x, rnull, phat, 
         TS=MD2sample::chiTS.cont, TSextra=TSextra,  
         B=B)
```

```{r}
examples[["ex3a"]]
```

```{r, eval=ReRunExamples}
examples[["ex3b"]]=hybrid_test(y, rnull, phat, 
         TS=MD2sample::chiTS.cont, TSextra=TSextra,  
         B=B)
```

```{r}
examples[["ex3b"]]
```

**Arguments and output of new test routine for continuous data**

see the vignette *MD2sample*.


### Power Estimation

The routine **hybrid_power** allows the user to estimate the power of the tests via two-sample methods.

### Example 4

Let's say we want to estimate the power when the null hypothesis specifies a standard bivariate normal distribution, but the data actually comes from a bivariate normal distribution with mean vector $(0,0)$, variances equal to 1 and a covariance $\rho$.  We first need a function that generates these data sets:


```{r ex4}
rnull=function() mvtnorm::rmvnorm(100, c(0, 0))
ralt=function(s) mvtnorm::rmvnorm(100, c(0, 0),
                sigma=matrix(c(1, s, s, 1), 2, 2))
```

Now we can run

```{r, eval=ReRunExamples}
examples[["ex4a"]]=hybrid_power(rnull, ralt, 
      c(0, 0.8), B=B, maxProcessor=1)
```

```{r}
examples[["ex4a"]]
```

**Arguments of hybrid_power**

-  *rnull* routine that generates data under the null hypothesis.  
-  *ralt* routine that generates data under the alternative hypothesis.  
-  *param_alt* vector of parameter values passed to ralt.  
-  *phat*  routine that finds parameter estimates.  
-  *TS* a routine to calculate test statistics or p values for new tests.  
-  *TSextra* additional info passed to TS, if necessary.  
-  *With.p.value=FALSE*  does user supplied routine return p values? 
-  *alpha=0.05* desired type I error probability.  
-  *maxProcessor* maximum number of cores to use. If set to 1 no parallel processing is used.  
-  *B=1000* number of simulation runs.  

Again the user can provide their own routine:

```{r}
TSextra=list(which="pvalue", nbins=c(3,4))
```

```{r eval=ReRunExamples}
examples[["ex4b"]]=hybrid_power(rnull, ralt, c(0, 0.8), TS=MD2sample::chiTS.cont, 
          TSextra=TSextra, B=500, 
          With.p.value=TRUE)
```

```{r}
examples[["ex4b"]]
```


## Discrete Data

### Example 5

We consider the case of data from binomial distributions:

```{r ex5}
rnull=function() {
   x=rbinom(1000, 5, 0.5)
   y=rbinom(1000, 3, 0.5+x/20)
   MDgof::sq2rec(table(x, y))
}
x=rnull()
```

```{r eval=ReRunExamples}
examples[["ex5"]]=hybrid_test(x, rnull, B=B)
```

```{r}
examples[["ex5"]]
```

The MDgof routine *sq2rec* above changes the format of the data from that output by the table command to what is required by *hybrid_test*.

The arguments of hybrid_test for discrete data are the same as those for continuous data. The routines determine that the data is discrete if *x* is a matrix with three columns and the the values in the third column are integers.


### User supplied tests

The routine *chiTS.disc* (included in the *MD2sample* package) does a chi-square test for discrete data:


```{r}
TSextra=list(which="statistic")
```

```{r, eval=ReRunExamples}
examples[["ex6c"]]=hybrid_test(x, rnull,  TS=MD2sample::chiTS.disc, TSextra=TSextra,  B=B)
```

```{r}
examples[["ex6c"]]
```


### Power Estimation

Power estimation for discrete data is done the same way as for continuous data:

### Example 7

We consider again the case of data from binomial distributions as in example 5. As an alternative we change the distribution of Y.

```{r ex7}
rnull=function() {
   x=rbinom(1000, 5, 0.5)
   y=rbinom(1000, 3, 0.5+x/20)
   MDgof::sq2rec(table(x, y))
}
ralt=function(p) {
   x=rbinom(1000, 5, p)
   y=rbinom(1000, 3, 0.5+x/20)
   MDgof::sq2rec(table(x, y))
}
x=ralt(0.475)
```

```{r, eval=ReRunExamples}
examples[["ex7b"]]= hybrid_power(rnull, ralt, c(0.5, 0.475), B=200)
```

```{r}
examples[["ex7b"]]
```

or using a user-supplied test that find a p value:

```{r}
TSextra=list(which="pvalue")
```

```{r eval=ReRunExamples}
examples[["ex7d"]]=hybrid_power(rnull, ralt, c(0.5, 0.475), TS=MD2sample::chiTS.disc, TSextra=TSextra,  
         With.p.value = TRUE, B=B)
```

```{r}
examples[["ex7d"]]
```

```{r}
saveRDS(examples, "hybrid.mdgof.vignette.rds")
```