--- title: "R2sample" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{R2sample} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: [R2sample.bib] --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(R2sample) ``` This package brings together a number of routines for the two sample problem for univariate data. There are two data sets x and y and we want to test whether they were generated by the same probability distribution. The highlights of this package are: - it runs a large number of tests simultaneously. - the user can provide their own tests - it works for continuous and for discrete data. - it uses a permutation test to find p values (except for chi square tests). - tests can be combined and a single adjusted p value found - for large continuous data sets p values can be found via large sample theory. - it uses Rcpp and parallel programming to be very fast. - it includes routines that make power studies very simple. ## Example 1 We generate two data sets of size 100 and 120 respectively from standard normal distributions and run the test: ```{r} set.seed(123) ``` **Note** all examples run with arguments *B=500, maxProcessor = 2* in order to pass *devtools::check()* ```{r} x1 = rnorm(10) y1 = rnorm(12) twosample_test(x1, y1, B=500, maxProcessor = 2) ``` In this case the the null hypothesis is true, both data sets were generated by a standard normal distribution. And indeed, all 13 tests have p values much larger than (say) 0.05 and therefore correctly fail to reject the null hypothesis. Alternatively one might wish to carry out a select number of tests, but then find a correct p value for the combined test. This can be done as follows: ```{r} twosample_test_adjusted_pvalue(x1, y1, B=c(500,500)) ``` By default for continuous data this runs four tests (chi square with a small number of bins, ZA, ZK and Wassp1). It the finds the smallest p value (here 0.02) and adjusts it for the multiple testing problem, resulting in an overall p value of 0.05. ### New Tests Say we wish to use two new tests. One based on the difference in standardized means and the based on the difference in standard deviations: $$ \begin{aligned} &TS_1 = sd(x) - sd(y) \\ &TS_2 = \bar{x}/sd(x) - \bar{y}/sd(y) \\ \end{aligned} $$ ```{r} myTS1 = function(x, y) { out = c(0, 0) out[1] = abs(sd(x) - sd(y)) out[2] = abs(mean(x)/sd(x) - mean(y)/sd(y)) names(out) = c("std", "std t test") out } ``` Then we can simply run ```{r} twosample_test(x1, y1, TS=myTS1, B=500, maxProcessor = 2) ``` Note that the output vector of the routine has to be a **named** vector. If the routine is written in Rcpp parallel programming is not available. If needed one can also supply the argument *TSextra*. This has to be a list, which will be passed directly to TS. In this way one can add items that might be needed for calculation of the test statistics. ## Example 2 Here we generate two data sets of size 1000 and 1200 respectively from a binomial distribution with 5 trials and success probabilities of 0.5 and 0.55, respectively. ```{r} x2 = table(c(0:5,rbinom(1000, 5, 0.5)))-1 y2 = table(c(0:5,rbinom(1200, 5, 0.55)))-1 rbind(x2, y2) twosample_test(x2, y2, vals=0:5, B=500, maxProcessor = 2)$p.values ``` In this case the the null hypothesis is false, all nine tests for discrete data have p values much smaller than (say) 0.05 and therefore correctly reject the null hypothesis. Notice, it is the presence of the *vals* argument that tells the *twosample_test* command that the data is discrete. The vectors x and y are the counts. Note that the lengths of the three vectors have to be the same and no value of vals is allowed that has both x and y equal to 0. ### New Test Again we might want to run a different test, say based again on the difference in standardized means. This time we will implement the new test using Rcpp: ```{Rcpp} #include using namespace Rcpp; // [[Rcpp::export]] NumericVector myTS2(IntegerVector x, IntegerVector y, NumericVector vals) { Rcpp::CharacterVector methods=CharacterVector::create("std t test"); int const nummethods=methods.size(); int k=x.size(), n=sum(x), i; double meanx=0.0, meany=0.0, sdx=0.0, sdy=0.0; NumericVector TS(nummethods); TS.names() = methods; for(i=0;i1 parallel computing is used. - **doMethods** a character vector giving the names of the methods to include. Say for example we want to use only the KS and AD tests: ```{r} x=rnorm(10) y=rnorm(12) twosample_test(x, y, B=500, maxProcessor = 2, doMethods=c("KS","AD")) ``` ### The Output A list with vectors statistics (the test statistics) and p.values. ### *shiny app* The tests can also be run via a shiny app. To do so run ```{r eval=FALSE} run_shiny() ``` ### **twosample_power** and **plot_power** The routine *twosample_power* allows the estimation of the power of the various tests, and *plot_power* draws the corresponding power graph with the methods sorted by their mean power. Note that due to the use of permutation tests the power calculations are fairly slow. Because of the time constraints imposed by CRAN the following routines are not run. A full power study with (say) 25 alternatives and fairly large data sets might take several hours to run. New tests can be used in the same way as above. ### Example 3 Say we wish to find the powers of the tests when one data set comes from a standard normal distribution with sample size 100 and the other from a t distribution with 1-10 degrees of freedom and sample size 200. ```{r eval=FALSE} plot_power(twosample_power( f=function(df) list(x=rnorm(100), y=rt(200, df)), df=1:10) ) ``` The arguments of **twosample_power** are - **f** A function that generates data sets x and y, either continuous or the counts of discrete variables. In the discrete case care needs to be taken that the resulting vectors have the same length as *vals*. The output has to be a list with elements x and y. In the case of discrete data the list also has to include a vector **vals**, the values of the discrete variable. - **...** arguments passed to f. The most common case would be a single vector, but two vectors or none is also possible. - **alpha=0.05** type I error probability to use in power calculation - **B=c(1000, 1000)** number of simulation runs for power estimation and estimation of p values. - **TS** a routine to calculate a test statistic. If missing built-in tests are used. - **TSextra** a list to be passed to TS. - **nbins=c(100,10)** Number of bins to use in chi square tests - **maxProcessor** if >1 parallel programming is used The arguments of **plot_power** are a matrix of powers created by **twosample_power** and (optional) Smooth=FALSE if no smoothing is desired and a name of the variable on the x axis. ### Example 4 We wish to find the powers of the tests when the data set x are 100 observations comes from a binomial n=10, p=0.5 and y 120 observations from from a binomial n=10 and a number of different success probabilities p. Note that in either case not all the possible values 0-10 will actually appear in every data set, so we need to take some care in assuring that x, y and vals match every time: ```{r eval=FALSE} plot_power(twosample_power( function(p) { vals=0:10 x=table(c(vals, rbinom(100, 10, 0.5))) - 1 #Make sure each vector has length 11 y=table(c(vals, rbinom(120, 10, p))) - 1 #and names 1-10 vals=vals[x+y>0] # only vals with at least one count list(x=x[as.character(vals)],y=y[as.character(vals)], vals=vals) }, p=seq(0.5, 0.6, length=5) ), "p") ``` ### Example 5 We want to assess the effect of the sample size when one data set comes from a standard normal and the other from a normal with mean 1 and standard deviation 1: ```{r eval=FALSE} plot_power(twosample_power( f=function(n) list(x=rnorm(n), y=rnorm(n, 1)), n=10*1:10), "n") ``` ### Adjusted p values for Several Tests As no single test can be relied upon to consistently have good power, it is reasonable to employ several of them. We would then reject the null hypothesis if any of the tests does so, that is, if the smallest p-value is less than the desired type I error probability $\alpha$. This procedure clearly suffers from the problem of simultaneous inference, and the true type I error probability will be much larger than $\alpha$. It is however possible to adjust the p value so it does achieve the desired $\alpha$. This can be done as follows: We generate a number of data sets under the null hypothesis. Generally about 1000 will be sufficient. Then for each simulated data set we apply the tests we wish to include, and record the smallest p value. Here is an example. . ```{r eval=FALSE} pvals=matrix(0,1000,13) for(i in 1:1000) pvals[i, ]=R2sample::twosample_test(runif(100), runif(100), B=1000)$p.values ``` Next we find the smallest p value in each run for two selections of four methods. One is the selection found to be best above, namely Zhang's ZC and ZK methods, the methods by Kuiper and Wasserstein as well as a chi square test with a small number of bins. As a second selection we use the methods by Kolmogorov-Smirnov, Kuiper, Anderson-Darling, Cramer-vonMises and Lehman-Rosenblatt, which for this null data turn out to be highly correlated. ```{r eval=FALSE} colnames(pvals)=names(R2sample::twosample_test(runif(100), runif(100), B=1000)$p.values) p1=apply(pvals[, c("ZK", "ZC", "Wassp1", "Kuiper", "ES small" )], 1, min) p2=apply(pvals[, c("KS", "Kuiper", "AD", "CvM", "LR")], 1, min) ``` Next we find the empirical distribution function for the two sets of p values and draw their graphs. We also add the curves for the cases of four identical tests and the case of four independent tests, which of course is the Bonferroni correction. The data for the cdfs is in the inst/extdata directory of the package. ```{r} tmp=readRDS("../inst/extdata/pvaluecdf.rds") Tests=factor(c(rep("Identical Tests", nrow(tmp)), rep("Correlated Selection", nrow(tmp)), rep("Best Selection", nrow(tmp)), rep("Independent Tests", nrow(tmp))), levels=c("Identical Tests", "Correlated Selection", "Best Selection", "Independent Tests"), ordered = TRUE) dta=data.frame(x=c(tmp[,1],tmp[,1],tmp[,1],tmp[,1]), y=c(tmp[,1],tmp[,3],tmp[,2],1-(1-tmp[,1])^4), Tests=Tests) ggplot2::ggplot(data=dta, ggplot2::aes(x=x,y=y,col=Tests))+ ggplot2::geom_line(linewidth=1.2)+ ggplot2::labs(x="p value", y="CDF")+ ggplot2::scale_color_manual(values=c("blue","red", "Orange", "green")) ``` # References