--- title: "MDgof-hybrid" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{MDgof-hybrid} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(error=TRUE, collapse = TRUE, comment = "#>" ) ``` ```{r setup, include=FALSE} library(MDgof) B=200 st=function() knitr::knit_exit() examples=MDgof::hybrid.mdgof.vignette ``` **Note** the examples are not actually run but their results are stored in the included data set *hybrid.mdgof.vignette*. This is done to satisfy CRAN submission rules regarding the execution time of vignettes. If you want to run the Rmarkdown file yourself, set ReRunExamples=TRUE. ```{r} ReRunExamples=FALSE ``` In a standard goodness-of-fit problem we have a data set $x$ and a probability model, given in form of the cumulative distribution function $F$, and want to test $H_0: X\sim F$, that is the data set was generated by $F$. There are however circumstances where evaluation of the function $F(x)$ is difficult or even impossible, especially for high-dimensional data. It is however possible to generate data from $F$. Then a test can be run as follows: 1. Generate a Monte Carlo data set $y$ from $F$. 2. Run a two-sample test n $x$ and $y$. In what follows we will call this a hybrid test. One additional "feature" of this approach is that often we are free to choose the sample size of $y$. Generally on would expect the power of the tests to increase. The companion R package *MD2sample* has a large number of methods already implemented and is used in routines that follow. ## Example 1: Continuous data without parameter estimation We generate a two-dimensional data set of size 100 from a multivariate standard normal distribution and run the test with the null hypothesis $H_0:X\sim N(\mathbf{0}, \mathbf{I_2})$: ```{r} set.seed(123) ``` ```{r ex1} rnull=function() mvtnorm::rmvnorm(100, c(0, 0)) x=rnull() ``` ```{r eval=ReRunExamples} examples[["ex1a"]]=hybrid_test(x, rnull, B=B) ``` ```{r} examples[["ex1a"]] ``` $B$ is the number of simulation runs to estimated the p-value, $B=5000$ is the default but if the data set is large a smaller value should suffice. **Arguments of hybrid_test for continuous data** - *x* a matrix of numbers with different observations in different rows (for continuous data). - *rnull* routine that generates data under the null hypothesis. - *phat* routine that finds parameter estimates. - *TS* a routine to calculate test statistics or p values for new tests. - *TSextra* additional info passed to TS, if necessary. - *maxProcessor* maximum number of cores to use. If set to 1 no parallel processing is used. - *doMethods="all"* names of method to include. - *B=5000* number of simulation runs. ## Example 2: Continuous data with parameter estimation In this example we will test whether the data comes from bivariate normal distribution, with means and variance-covariance estimated from the data. ```{r ex2} rnull=function(p) mvtnorm::rmvnorm(200, mean=p[1:2], sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2)) phat=function(x) c(apply(x, 2, mean), c(apply(x, 2, var), cov(x[,1],x[,2]))) x=rnull(c(1, 2, 4, 9,0.6)) phat(x) ``` ```{r eval=ReRunExamples} examples[["ex2a"]]=hybrid_test(x, rnull, phat, B=B) ``` ```{r} examples[["ex2a"]] ``` In contrast in the next case we will generate data from a multivariate t distribution with 5 degrees of freedom. We also use a larger sample size. ```{r} y=mvtnorm::rmvt(300, sigma=diag(2), df=5) ``` ```{r eval=ReRunExamples} examples[["ex2b"]]=hybrid_test(y, rnull, phat, B=B) ``` ```{r} examples[["ex2b"]] ``` ## New Tests *hybrid_test* allows the user to supply their own (two-sample) test method, either one that finds its own p-value or a method that requires simulation. The routine *chiTS.cont* (included in the *MD2sample* package) finds either the test statistic or the p value of a chi square test in two dimensions. First we need to set ```{r} TSextra=list(which="statistic", nbins=cbind(c(3,3), c(3,4))) ``` this list is passed to *chiTS.cont* and tells the routine to - return the test statistic (and not the p value) - Run two tests, first with 3 and 3 bins in the x and y direction and then 3 and 4 bins. ## Example 3 In this example we use the same setup as in example 2. ```{r eval=ReRunExamples} examples[["ex3a"]]=hybrid_test(x, rnull, phat, TS=MD2sample::chiTS.cont, TSextra=TSextra, B=B) ``` ```{r} examples[["ex3a"]] ``` ```{r, eval=ReRunExamples} examples[["ex3b"]]=hybrid_test(y, rnull, phat, TS=MD2sample::chiTS.cont, TSextra=TSextra, B=B) ``` ```{r} examples[["ex3b"]] ``` **Arguments and output of new test routine for continuous data** see the vignette *MD2sample*. ### Power Estimation The routine **hybrid_power** allows the user to estimate the power of the tests via two-sample methods. ### Example 4 Let's say we want to estimate the power when the null hypothesis specifies a standard bivariate normal distribution, but the data actually comes from a bivariate normal distribution with mean vector $(0,0)$, variances equal to 1 and a covariance $\rho$. We first need a function that generates these data sets: ```{r ex4} rnull=function() mvtnorm::rmvnorm(100, c(0, 0)) ralt=function(s) mvtnorm::rmvnorm(100, c(0, 0), sigma=matrix(c(1, s, s, 1), 2, 2)) ``` Now we can run ```{r, eval=ReRunExamples} examples[["ex4a"]]=hybrid_power(rnull, ralt, c(0, 0.8), B=B, maxProcessor=1) ``` ```{r} examples[["ex4a"]] ``` **Arguments of hybrid_power** - *rnull* routine that generates data under the null hypothesis. - *ralt* routine that generates data under the alternative hypothesis. - *param_alt* vector of parameter values passed to ralt. - *phat* routine that finds parameter estimates. - *TS* a routine to calculate test statistics or p values for new tests. - *TSextra* additional info passed to TS, if necessary. - *With.p.value=FALSE* does user supplied routine return p values? - *alpha=0.05* desired type I error probability. - *maxProcessor* maximum number of cores to use. If set to 1 no parallel processing is used. - *B=1000* number of simulation runs. Again the user can provide their own routine: ```{r} TSextra=list(which="pvalue", nbins=c(3,4)) ``` ```{r eval=ReRunExamples} examples[["ex4b"]]=hybrid_power(rnull, ralt, c(0, 0.8), TS=MD2sample::chiTS.cont, TSextra=TSextra, B=500, With.p.value=TRUE) ``` ```{r} examples[["ex4b"]] ``` ## Discrete Data ### Example 5 We consider the case of data from binomial distributions: ```{r ex5} rnull=function() { x=rbinom(1000, 5, 0.5) y=rbinom(1000, 3, 0.5+x/20) MDgof::sq2rec(table(x, y)) } x=rnull() ``` ```{r eval=ReRunExamples} examples[["ex5"]]=hybrid_test(x, rnull, B=B) ``` ```{r} examples[["ex5"]] ``` The MDgof routine *sq2rec* above changes the format of the data from that output by the table command to what is required by *hybrid_test*. The arguments of hybrid_test for discrete data are the same as those for continuous data. The routines determine that the data is discrete if *x* is a matrix with three columns and the the values in the third column are integers. ### User supplied tests The routine *chiTS.disc* (included in the *MD2sample* package) does a chi-square test for discrete data: ```{r} TSextra=list(which="statistic") ``` ```{r, eval=ReRunExamples} examples[["ex6c"]]=hybrid_test(x, rnull, TS=MD2sample::chiTS.disc, TSextra=TSextra, B=B) ``` ```{r} examples[["ex6c"]] ``` ### Power Estimation Power estimation for discrete data is done the same way as for continuous data: ### Example 7 We consider again the case of data from binomial distributions as in example 5. As an alternative we change the distribution of Y. ```{r ex7} rnull=function() { x=rbinom(1000, 5, 0.5) y=rbinom(1000, 3, 0.5+x/20) MDgof::sq2rec(table(x, y)) } ralt=function(p) { x=rbinom(1000, 5, p) y=rbinom(1000, 3, 0.5+x/20) MDgof::sq2rec(table(x, y)) } x=ralt(0.475) ``` ```{r, eval=ReRunExamples} examples[["ex7b"]]= hybrid_power(rnull, ralt, c(0.5, 0.475), B=200) ``` ```{r} examples[["ex7b"]] ``` or using a user-supplied test that find a p value: ```{r} TSextra=list(which="pvalue") ``` ```{r eval=ReRunExamples} examples[["ex7d"]]=hybrid_power(rnull, ralt, c(0.5, 0.475), TS=MD2sample::chiTS.disc, TSextra=TSextra, With.p.value = TRUE, B=B) ``` ```{r} examples[["ex7d"]] ``` ```{r} saveRDS(examples, "hybrid.mdgof.vignette.rds") ```