--- title: "MDgof-Examples" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{MDgof-Examples} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(error=TRUE, collapse = TRUE, comment = "#>" ) options(tinytex.clean = FALSE) ``` ```{r setup, include=FALSE} library(MDgof) B=200 st=function() knitr::knit_exit() examples=MDgof::examples.mdgof.vignette ``` **Note** the examples are not actually run but their results are stored in the included data set *examples.mdgof.vignette*. This is done to satisfy CRAN submission rules regarding the execution time of vignettes. If you want to run the Rmarkdown file yourself, set ReRunExamples=TRUE. ```{r} ReRunExamples=FALSE ``` This package brings together a number of routines for the goodness-of-fit problem for multivariate data. There is a data set $\mathbf{x}$ and a cumulative distribution function $F$ (possibly depending on parameters that have to be estimated from x), and we want to test $H_0:\mathbf{X}\sim F$. The highlights of this package are: - it runs a large number of tests simultaneously. - it allows for fully specified distributions as well as those whose parameter(s) need to be estimated from the data. - the user can provide their own tests - it works for continuous and for discrete data (in two dimensions only). - p values are found via simulation. - tests can be combined and a single adjusted p value found - it uses Rcpp and parallel programming to be very fast. - it includes routines that make power studies very simple. - it has a routine that runs (up to) 120 different case studies, comparing the power of a new method to those included in the package. This will make it very easy for someone who has developed a new method to compare its performance (aka power) versus those included in the package. For descriptions of the included method see the vignette *MDgof-Methods*. ## Example 1: Continuous data without parameter estimation We generate a two-dimensional data set of size 100 from a multivariate standard normal distribution and run the test with the null hypothesis $H_0:X\sim N(\mathbf{0}, \mathbf{I_2})$: ```{r} set.seed(123) ``` ```{r ex1} # cumulative distribution function under the null hypothesis pnull=function(x) { f=function(x) mvtnorm::pmvnorm(rep(-Inf, length(x)), x) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } # function that generates data under the null hypothesis rnull=function() mvtnorm::rmvnorm(100, c(0, 0)) x=rnull() ``` ```{r eval=ReRunExamples} examples[["ex1a"]]=gof_test(x, pnull, rnull, B=B) ``` ```{r} examples[["ex1a"]] ``` $B$ is the number of simulation runs to estimated the p-value, $B=5000$ is the default but if the data set is large a smaller value should suffice. If the density of the distribution under the null hypothesis is also known it can be included and two more tests can be run: ```{r} dnull=function(x) { f=function(x) mvtnorm::dmvnorm(x) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } ``` ```{r eval=ReRunExamples} examples[["ex1b"]]=gof_test(x, pnull, rnull, dnull=dnull, B=B) ``` ```{r} examples[["ex1b"]] ``` **Arguments of gof_test for continuous data** - *x* a matrix of numbers with different observations in different rows (for continuous data). - *pnull* a routine to calcuate values of the distribution specified in the null hypothesis. - *rnull* routine that generates data under the null hypothesis. - *phat* routine that finds parameter estimates. - *dnull* routine that calculates the density. - *TS* a routine to calculate test statistics or p values for new tests. - *TSextra* additional info passed to TS, if necessary. - *rate=0* rate of Poisson distribution if sample size is random. - *nbins=c(5,5)* bins for chi square tests (2D only). - *Ranges=matrix(c(-Inf, Inf, -Inf, Inf),2,2)* a 2xd matrix with lower and upper bounds of observations. each column corresponds to one dimension. - *minexpcount=5* lowest required count for chi-square test. - *maxProcessor* maximum number of cores to use. If set to 1 no parallel processing is used. - *doMethods="all"* names of method to include. - *B=5000* number of simulation runs. - *ReturnTSextra=FALSE*, should setup info be returned? Useful error checking. ## Example 2: Continuous data with parameter estimation In this example we will test whether the data comes from bivariate normal distribution, with means and variance-covariance estimated from the data. ```{r ex2} pnull=function(x, p) { f=function(x) mvtnorm::pmvnorm(rep(-Inf, length(x)), x, mean=p[1:2], sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2)) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } rnull=function(p) mvtnorm::rmvnorm(200, mean=p[1:2], sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2)) dnull=function(x, p) { f=function(x) mvtnorm::dmvnorm(x, mean=p[1:2], sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2)) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } phat=function(x) c(apply(x, 2, mean), c(apply(x, 2, var), cov(x[,1],x[,2]))) x=rnull(c(1, 2, 4, 9,0.6)) phat(x) ``` ```{r eval=ReRunExamples} examples[["ex2a"]]=gof_test(x, pnull, rnull, phat, dnull=dnull, B=B) ``` ```{r} examples[["ex2a"]] ``` In contrast in the next case we will generate data from a multivariate t distribution with 5 degrees of freedom: ```{r} y=mvtnorm::rmvt(200, sigma=diag(2), df=5) ``` ```{r eval=ReRunExamples} examples[["ex2b"]]=gof_test(y, pnull, rnull, phat, dnull=dnull, B=B) ``` ```{r} examples[["ex2b"]] ``` ## New Tests *gof_test* allows the user to supply their own test method, either one that finds its own p-value or a method that requires simulation. As an example the routine *newTS* (included in the package) finds either the test statistic or the p value of a chi square test in two dimensions for either continuous or discrete data. ## Example 3 In this example we use the same setup as in example 2. *newTS* requires the following object *TSextra* (a list): ```{r ex3} TSextra=list(Continuous=TRUE, WithEstimation=TRUE, Withpvalue=TRUE) ``` ```{r eval=ReRunExamples} examples[["ex3a"]]=gof_test(x, pnull, rnull, phat, TS=newTS, TSextra=TSextra, B=B) ``` ```{r} examples[["ex3a"]] ``` ```{r, eval=ReRunExamples} examples[["ex3b"]]=gof_test(y, pnull, rnull, phat, TS=newTS, TSextra=TSextra, B=B) ``` ```{r} examples[["ex3b"]] ``` **Arguments and output of new test routine for continuous data** The arguments have to be: - *x* the data set. - *pnull* the distribution under the null hypothesis - *p* the vector with parameter estimates, or 0 if no parameters are estimated. - *TSextra* (optional) a list for any additional input needed to find test statistic. The output vector of the routine has to be a **named** vector. If the routine is written in Rcpp parallel programming is not available. ### Adjusted p values If several tests are run one can use the routine *gof_test_adjusted_pvalue* to find a p value that is adjusted for simultaneous testing: ```{r, eval=ReRunExamples} examples[["ex3c"]]=gof_test_adjusted_pvalue(x, pnull, rnull, dnull=dnull, phat=phat, B=c(B,B)) ``` ```{r} a=examples[["ex3c"]] for(i in seq_along(a)) print(paste(names(a)[i],": ", round(a[i],4))) ``` The *BR* test had the smallest p-value ($0.0193$), which is adjusted to $0.1817$ to account for the fact that $10$ tests where run simultaneously. ### Power Estimation The routine **gof_power** allows the user to estimate the power of the tests. ### Example 4 Let's say we want to estimate the power when the null hypothesis specifies a standard bivariate normal distribution, but the data actually comes from a bivariate normal distribution with mean vector $(0,0)$, variances equal to 1 and a covariance $\rho$. We first need a function that generates these data sets: ```{r ex4} pnull=function(x) { f=function(x) mvtnorm::pmvnorm(rep(-Inf, length(x)), x) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } rnull=function() mvtnorm::rmvnorm(100, c(0, 0)) dnull=function(x) { f=function(x) mvtnorm::dmvnorm(x) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } ralt=function(s) mvtnorm::rmvnorm(100, c(0, 0), sigma=matrix(c(1, s, s, 1), 2, 2)) ``` Now we can run ```{r, eval=ReRunExamples} examples[["ex4a"]]=gof_power(pnull, rnull, ralt, c(0, 0.8), dnull=dnull, B=B, maxProcessor=1) ``` ```{r} examples[["ex4a"]] ``` **Arguments of gof_power** - *pnull* a routine to calcuate values of the distribution specified in the null hypothesis. - *rnull* routine that generates data under the null hypothesis. - *ralt* routine that generates data under the alternative hypothesis. - *param_alt* vector of parameter values passed to ralt. - *phat* routine that finds parameter estimates. - *dnull* routine that calculates the density. - *TS* a routine to calculate test statistics or p values for new tests. - *TSextra* additional info passed to TS, if necessary. - *With.p.value=FALSE* does user supplied routine return p values? - *alpha=0.05* desired type I error probability. - *rate=0* rate of Poisson distribution if sample size is random. - *nbins=c(5,5)* bins for chi square tests (2D only). - *minexpcount=5* lowest required count for chi-square test (2D only). - *Ranges=matrix(c(-Inf, Inf, -Inf, Inf),2,2)* a 2x2 matrix with lower and upper bounds of observations. (2D only) - *maxProcessor* maximum number of cores to use. If set to 1 no parallel processing is used. - *B=1000* number of simulation runs. Again the user can provide their own routine: ```{r} TSextra=list(Continuous=TRUE, WithEstimation=FALSE, Withpvalue=TRUE) ``` ```{r eval=ReRunExamples} examples[["ex4b"]]=gof_power(pnull, rnull, ralt, c(0, 0.8), TS=newTS, TSextra=TSextra, B=500, With.p.value=TRUE) ``` ```{r} examples[["ex4b"]] ``` ## Discrete Data First note that tests for discrete data are implemented only in two dimensions. The data set is a matrix with three columns. Each row has a specific value for the x variable, the y variable and the corresponding counts. ### Example 5 We consider the case of data from binomial distributions: ```{r ex5} pnull=function(x) { f=function(x) sum(dbinom(0:x[1], 5, 0.5)*pbinom(x[2], 3, 0.5+0:x[1]/20)) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } rnull=function() { x=rbinom(1000, 5, 0.5) y=rbinom(1000, 3, 0.5+x/20) MDgof::sq2rec(table(x, y)) } x=rnull() x ``` ```{r eval=ReRunExamples} examples[["ex5"]]=gof_test(x, pnull, rnull, B=B) ``` ```{r} examples[["ex5"]] ``` The MDgof routine *sq2rec* above changes the format of the data from that output by the table command to what is required by *gof_test*. The arguments of gof_test for discrete data are the same as those for continuous data. The routines determine that the data is discrete if *x* is a matrix with three columns and the the values in the third column are integers. ### Example 6 One use of the discrete data methods is if the data is actually continuous but the data set is very large, so that the continuous methods take to long to run. In that case one can discretize the data and run a discrete method. In this example we have one million observations from a bivariate random variable and we want to test whether they are from independent dependent Beta distributions, with a parameter to be estimated from the data. The null hypothesis specifies the model $X\sim Beta(a, a)$, $Y|X=x\sim Beta(x, x)$, with the parameter $a$. Here is an example of data drawn from this model, using $a=2$: ```{r ex6} rnull_cont=function(p) { x=rbeta(250, p, p) y=rbeta(250, x, x) cbind(x=x, y=y) } dta_cont=rnull_cont(2) ggplot2::ggplot(data=as.data.frame(dta_cont), ggplot2::aes(x,y)) + ggplot2::geom_point() ``` Here we have the joint density $$ \begin{aligned} &f(x, y) = f_X(x)f_{Y|X=x}(y|x) = \\ &\frac{\Gamma(2a)}{\Gamma(a)^2}[x(1-x)]^a \frac{\Gamma(2x)}{\Gamma(x)^2}[y(1-y)]^x \end{aligned} $$ First we need an estimator of $a$. We can find the maximum likelihood estimator using the Newton-Raphson algorithm: $$ \begin{aligned} &l(a) = n\left[\log \Gamma(2a)-2\log \Gamma(a)+\frac{a}{n}\sum\log [x_i(1-x_i)]\right]\\ &\frac{dl}{da}=2n\left[di(2a)-di(a)+\frac{1}{2n}\sum\log [x_i(1-x_i)]\right]\\ &\frac{d^2l}{da^2}=2n\left[2 tri(2a)-tri(a)\right]\\ \end{aligned} $$ where $di$ and $tri$ are the first and second derivative of the log gamma function. Here is an implementation ```{r} phat_cont=function(x) { n=nrow(x) sx=sum( log(x[,1]*(1-x[,1])) )/2/n num=function(a) digamma(2*a)-digamma(a)+sx denom=function(a) 2*trigamma(2*a)-trigamma(a) anew=2 repeat { aold=anew anew=aold-num(aold)/denom(aold) if(abs(aold-anew)<0.001) break } anew } phat_cont(dta_cont) ``` For the function $pnull$ we make use of the following general calculation: $$ \begin{aligned} &F(x,y) = P(X\le x;Y\le y) =\\ &\int_{-\infty}^x \int_{-\infty}^y f(u,v) dvdu = \\ &\int_{-\infty}^x \int_{-\infty}^y f_X(u)f_{Y|X=u}(v|u) dvdu = \\ &\int_{-\infty}^x f_X(u) \left\{\int_{-\infty}^y f_{Y|X=u}(v|u) dv\right\}du = \\ &\int_{-\infty}^x f_X(u)F_{Y|X=u}(v|u) dv \end{aligned} $$ which is implemented in ```{r} pnull=function(x, p) { f=function(x) { g=function(u) dbeta(u, p, p)*pbeta(x[2], u, u) integrate(g, 0, x[1])$value } if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } ``` and with this we can run the test ```{r, eval=ReRunExamples} examples[["ex6a"]]=gof_test(dta_cont, pnull, rnull_cont, phat=phat_cont, Ranges=matrix(c(0, 1,0, 1), 2, 2), maxProcessor = 1, B=B) ``` ```{r} examples[["ex6a"]] ``` Now, though, let's assume that the data set has one million observations. In that case the routine above will be much to slow. Instead we can discretize the problem. In order to make this work we will have to discretize all the routines, except for *pnull* as the distribution function is the same for discrete and continuous random variables. Let's say we decide to discretize the data into a 50x30 grid of equal size bins. As in example 5 we can generate the data using the Binomial distribution. However, we will also have to find the bin probabilities, essentially the density of the discretized random vector. This is done in ```{r} rnull_disc=function(p) { pnull=function(x, p) { f=function(x) { g=function(u) dbeta(u, p, p)*pbeta(x[2], u, u) integrate(g, 0, x[1])$value } if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } nb=c(50, 30) xvals=1:nb[1]/nb[1] yvals=1:nb[2]/nb[2] z=cbind(rep(xvals, nb[2]), rep(yvals, each=nb[1]), 0) prob=c(t(MDgof::p2dC(z, pnull, p))) z[, 3]=rbinom(prod(nb), 1e6, prob) z } dta_disc=rnull_disc(2) dta_disc[1:10, ] ``` The MDgof routine $p2dC$ finds the probabilities of rectangles defined by the grid from the distribution function. Next we need to rewrite the routine that finds the estimator, now based on the discrete data: ```{r} phat_disc=function(x) { nb=c(50, 30) n=sum(x[,3]) #sample size valsx=unique(x[,1])-1/nb[1]/2#x values, center of bins x=tapply(x[,3], x[,1], sum) # counts for x alone sx=sum(x*log(valsx*(1-valsx)))/2/n num=function(a) digamma(2*a)-digamma(a)+sx denom=function(a) 2*trigamma(2*a)-trigamma(a) anew=2 repeat { aold=anew anew=aold-num(aold)/denom(aold) if(abs(aold-anew)<0.00001) break } anew } p=phat_disc(dta_disc) p ``` We want to apply this test to the original continuous data set, so we need to discretize it as well. This can be done with the MDgof routine *discretize*. Then we can run the test ```{r disc} set.seed(111) x=rbeta(1e6, 2, 2) y=rbeta(1e6, x, x) dta_cont=cbind(x=x, y=y) dta_disc=MDgof::discretize(dta_cont, Range=matrix(c(0,1,0,1),2,2), nbins=c(50, 30)) head(dta_disc) ``` ```{r A, eval=ReRunExamples} examples[["ex6b"]]=gof_test(dta_disc, pnull, rnull_disc, phat=phat_disc, B=B) ``` ```{r} examples[["ex6b"]] ``` ### User supplied tests The routine *newTS* (included in package) does a chi-square test for discrete data: ```{r} TSextra=list(Continuous=FALSE, WithEstimation=TRUE, Withpvalue=FALSE) ``` ```{r, eval=ReRunExamples} examples[["ex6c"]]=gof_test(dta_disc, pnull, rnull_disc, phat_disc, TS=newTS, TSextra=TSextra, B=B) ``` ```{r} examples[["ex6c"]] ``` ### Power Estimation Power estimation for discrete data is done the same way as for continuous data: ### Example 7 We consider again the case of data from binomial distributions as in example 5. As an alternative we change the distribution of Y. ```{r ex7} pnull=function(x) { f=function(x) sum(dbinom(0:x[1], 5, 0.5)*pbinom(x[2], 3, 0.5+0:x[1]/20)) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } rnull=function() { x=rbinom(1000, 5, 0.5) y=rbinom(1000, 3, 0.5+x/20) MDgof::sq2rec(table(x, y)) } ralt=function(p) { x=rbinom(1000, 5, p) y=rbinom(1000, 3, 0.5+x/20) MDgof::sq2rec(table(x, y)) } x=ralt(0.475) ``` ```{r, eval=ReRunExamples} examples[["ex7a"]]=gof_test(x, pnull, rnull, B=B) ``` ```{r} examples[["ex7a"]] ``` ```{r, eval=ReRunExamples} examples[["ex7b"]]= gof_power(pnull, rnull, ralt, c(0.5, 0.475),B=200) ``` ```{r} examples[["ex7b"]] ``` or using a user-supplied test: ```{r} TSextra=list(Continuous=FALSE, WithEstimation=FALSE, Withpvalue=TRUE) ``` ```{r, eval=ReRunExamples} examples[["ex7c"]]=gof_test(x, pnull, rnull, TS=newTS, TSextra=TSextra, B=B) ``` ```{r} examples[["ex7c"]] ``` ```{r eval=ReRunExamples} examples[["ex7d"]]=gof_power(pnull, rnull, ralt, c(0.5, 0.475), TS=newTS, TSextra=TSextra, With.p.value = TRUE, B=B) ``` ```{r} examples[["ex7d"]] ``` ## Benchmarking The routine *run.studies* allows the user to quickly compare the power of a new test with the power of the included tests for 30 different combinations of null hypothesis vs alternative for continuous or discrete data and with or without parameter estimation. It also allows the user to find the power for those case studies for different sample size and type I error probabilities. Let's say that we want to determine whether method A or B has better power for a specific case of null hypothesis and alternative. Then if we find the powers of the tests for (say) $n=100,\alpha=0.05$ and find that method A is better, A will aslo be better for any other combination of $n$ and $\alpha$. For this reason just checking one suffices to detemine the ranking of the methods (for this one case). ### Example 8 Say we want to compare the power of *newTS* for continuous data with parameter estimation with the powers of the included tests: ```{r ex8} TSextra=list(Continuous=TRUE, WithEstimation=TRUE, Withpvalue=TRUE) ``` ```{r eval=ReRunExamples} a=run.studies(study=1:2, # do first two Continuous=TRUE, WithEstimation = TRUE, TS=newTS, TSextra=TSextra, With.p.value=TRUE, B=B) ``` **Arguments of run.studies** - *study* either the name of the study, or its number. If missing all the studies are run. - *Continuous=TRUE* run cases for continuous data. - *WithEstimation=FALSE* do parameter estimation. - *Dim=2* for continuous data without parameter estimation there are case studies in 2 or 5 dimensions. - *TS* routine to calculate test statistics. - *TSextra* list passed to TS, if any. - *With.p.value=FALSE* does user supplied routine return p values? - *nsample=250* desired sample size. - *alpha=0.05* type I error. - *param_alt* (list of) values of parameter under the alternative hypothesis. If missing included values are used. - *SuppressMessages=FALSE* should messages be printed? - *B = 1000* number of simulation runs. - *maxProcessor* number of cores to use. MDgof includes five data sets with the results for the included tests using a sample size of 250, a true type I error probability of 0.05 and two values of the parameter under the alternative, one which makes the null hypothesis true and one which makes it false. They are - *power_studies_cont_results* for continuous data without parameter estimation and two dimensional data. - *power_studies_cont_D5_results* for continuous data without parameter estimation and five dimensional data. - *power_studies_cont_est_results* for continuous data with parameter estimation. - *power_studies_disc_results* for discrete data without parameter estimation. - *power_studies_disc_est_results* for discrete data with parameter estimation. If the user wants different numbers he can run: ```{r, eval=ReRunExamples} examples[["ex8"]]= run.studies(1:2, Continuous=TRUE, WithEstimation = FALSE, param_alt=cbind(c(0, 0.4), c(0, 0.4)), alpha=0.1, nsample=500, B=B) ``` ```{r} examples[["ex8"]][["Null"]][1:2, ] examples[["ex8"]][["Pow"]][1:2, ] ```