Type: | Package |
Title: | Testing Two-Sample Mean in High Dimension |
Version: | 0.1.0 |
Author: | Huaiyu Zhang, Haiyan Wang |
Maintainer: | Huaiyu Zhang <huaiyuzhang1988@gmail.com> |
Description: | Implements the high-dimensional two-sample test proposed by Zhang (2019) http://hdl.handle.net/2097/40235. It also implements the test proposed by Srivastava, Katayama, and Kano (2013) <doi:10.1016/j.jmva.2012.08.014>. These tests are particularly suitable to high dimensional data from two populations for which the classical multivariate Hotelling's T-square test fails due to sample sizes smaller than dimensionality. In this case, the ZWL and ZWLm tests proposed by Zhang (2019) http://hdl.handle.net/2097/40235, referred to as zwl_test() in this package, provide a reliable and powerful test. |
License: | GPL-2 |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.1.0 |
Depends: | R (≥ 3.1.0) |
Imports: | stats |
NeedsCompilation: | no |
Packaged: | 2020-06-08 15:35:11 UTC; huaiyu |
Repository: | CRAN |
Date/Publication: | 2020-06-12 10:30:08 UTC |
An example of GO term data
Description
A dataset containing the gene expressions for a Gene Ontology (GO) term
on two phenotype groups: BCR/ABL and NEG.
The id of the GO term is GO:0000003
.
The raw dataset is taken from ALL
package.
The data were preprocessed, for which the details are elaborated in Zhang and Wang (2020).
Usage
GO_example
Format
A list with two subsets of gene expression data.
- X
A matrix containing gene expressions for the BCR/ABL group. The row id is for patient and the column id is for gene.
- Y
A matrix containing gene expressions for the NEG group. The row id is for patient and the column id is for gene.
References
Zhang, H. and Wang, H. (2020). Result consistency of high dimensional two-sample tests applied to gene ontology terms with gene sets. Manuscript in review.
Apply the SKK test to multiple simulated two-sample datasets
Description
This function performs the SKK test of Srivastava, Katayama, and Kano(2013) on multiple high-dimensional two-sample datasets. It is useful for Monte Carlo experiments.
Usage
SKK_sim(DATA)
Arguments
DATA |
The list of dataset lists generated by |
Value
a dataframe, each row of which reports the values of the SKK test statistics and the p-values.
References
Srivastava, M. S., Katayama, S., and Kano, Y. (2013). A two sample test in high dimensional data. Journal of Multivariate Analysis, 114:349-358.
Examples
# Generate 3 simulated datasets and apply the SKK test
data <- buildData(n = 45, m =60, p = 300,
muX = rep(0,300), muY = rep(0,300),
dep = 'IND', S = 3, innov = rnorm)
SKK_sim(data)
High-dimensional two-sample test (SKK) proposed by Srivastava, Katayama, and Kano(2013)
Description
This function implements the two-sample high-dimensional test proposed by Srivastava, Katayama, and Kano(2013).
Usage
SKK_test(X, Y)
Arguments
X |
The data matrix (n by p) from the first population. |
Y |
The data matrix (m by p) from the second population. |
Value
A list consisting of the values of the test statistic and p-value.
References
Srivastava, M. S., Katayama, S., and Kano, Y. (2013). A two sample test in high dimensional data. Journal of Multivariate Analysis, 114:349-358.
Examples
# Generate a simulated dataset and apply the SKK test
data <- buildData(n = 45, m =60, p = 300,
muX = rep(0,300), muY = rep(0,300),
dep = 'IND', S = 1, innov = rnorm)
SKK_test(data[[1]]$X, data[[1]]$Y)
# Apply the SKK test to the data for a GO term stored in GO_example
SKK_test(GO_example$X, GO_example$Y)
Two-sample datasets generator
Description
This function generates simulated high dimensional two-sample data from user specified populations with given mean vectors, covariance structure, sample sizes, and dimension of each observation. It could generate the long-range dependent process proposed by Hall et al. (1998) in additional to some processes provided in arima.sim().
Usage
buildData(
n,
m,
p,
muX,
muY,
dep,
commoncov = TRUE,
VarScaleY = 1,
S = 1,
innov = function(n, ...) stats::rnorm(n, 0, 1),
heteroscedastic = FALSE,
het.diag
)
Arguments
n |
number of observations in the 1st sample. |
m |
number of observations in the 2nd sample. |
p |
the dimensionality of the each observation. The samples from both populations should have the same dimension. |
muX |
|
muY |
|
dep |
dependence structure among the 'IND' for independence; 'SD' for strong dependency, AR(1) with parameter 0.9; 'WD' for weak dependency, ARMA(2, 2) with AR parameters 0.4 and -0.1, and MA parameters 0.2 and 0.3; 'LR' for long-range dependency with parameter 0.7. For more details about the configurations, please refer to Zhang and Wang (2020). |
commoncov |
a logical indicating whether the two populations have equal covariance matrices. If FALSE, the innovations used in generating data for the 2nd population will be scaled by the square root of the value specified in VarScaleY. |
VarScaleY |
constant by which innovations are scaled in generating observations for the 2nd sample when commoncov=FALSE. |
S |
the number of data sets to simulate. |
innov |
a function used to generate the innovations, such as |
heteroscedastic |
a logical indicating whether the components will be scaled by the entries in the diagonal matrix specified by |
het.diag |
a |
Value
A list of S
lists, each consisting of an n
by p
matrix X
, an m
by p
matrix Y
, the sample sizes, n
and m
, for each population, and the dimensionality p
.
References
Hall, P., Jing, B.-Y., and Lahiri, S. N. (1998). On the sampling window method for long-range dependent data. Statistica Sinica, 8(4):1189-1204.
Examples
# Generate 3 two-sample datasets of dimensionality 300
# with sample sizes 45 for one sample & 60 for the other.
buildData(n = 45, m =60, p = 300,
muX = rep(0,300), muY = rep(0,300),
dep = 'IND', S = 3, innov = rnorm)
highDmean: A package for testing of equal mean for two-sample high dimensional data
Description
This package is an implementation of the high-dimensional two-sample test proposed by Zhang and Wang (2020) "Result consistency of high dimensional two-sample tests applied to gene ontology terms with gene sets". It also implements the SKK test proposed by Srivastava, Katayama, and Kano (2013) "A two sample test in high dimensional data." These tests are particularly suitable for high dimensional data from two populations for which the classical multivariate Hotelling's T-square test fails due to sample sizes smaller than dimensionality. In this case, the ZWL and ZWLm tests proposed by Zhang and Wang (2020), referred to as zwl_test() in this package, provide a reliable and powerful test.
highDmean functions
The function zwl_test()
conducts the ZWL and ZWLm test of equal mean for two-sample high dimensional data provided in
matrices of dimension n
by p
and m
by p
, which are random samples from two populations. It
returns the value of test statistic and p-value under the null hypothesis of equal means.
The SKK_test()
performs the SKK test and returns the value of test statistic and p-value.
The buildData()
function generates simulated high-dimensional data in the two-population setting
with specified sample sizes, numbers of components, covariance structure, etc., and
the functions zwl_sim()
and SKK_sim()
return test statistic values and p-values for lists of simulated data sets generated by buildData()
.
Random sample from shifted gamma distribution
Description
This function generates random samples from shifted gamma distribution. That is, random samples are first generated from gamma distribution with shape parameter shape
and scale parameter scale
and then the mean of the gamma distribution, shape
*scale
, is subtracted from the sample.
Usage
rgammashift(n, shape, scale)
Arguments
n |
number of observations. |
shape |
the shape parameter of gamma distribution |
scale |
the scale parameter of gamma distribution #' |
Value
A vector of n
values. It is equivalent to rgamma(n, shape, scale)- shape * scale.
Examples
# Generate a sample of shifted gamma observations with shape parameter 4 and scale parameter 2.
set.seed(10)
rgammashift(n = 5, shape =4, scale = 2)
# It is equivalent to
set.seed(10)
rgamma(n = 5, shape=4, scale=2)- 4 * 2
Apply the test by Zhang and Wang (2020) to multiple simulated two-sample datasets
Description
Apply the two-sample high-dimensional test by Zhang and Wang (2020) to multiple simulated two-sample high dimensional datasets. This function is useful for Monte Carlo experiments.
Usage
zwl_sim(DATA, order = 0)
Arguments
DATA |
The list of dataset lists generated by |
order |
The order of the center correction. Possible choices are 0, 2.
To use the ZWLm test, set |
Value
A dataframe with each row consisting the values of the test statistics, p-values, Tn, and the estimate of Var(Tn).
References
Zhang, H. and Wang, H. (2020). Result consistency of high dimensional two-sample tests applied to gene ontology terms with gene sets. Manuscript in review.
Examples
# Generate 3 simulated two-sample datasets and apply the ZWL test
data <- buildData(n = 45, m =60, p = 300,
muX = rep(0,300), muY = rep(0,300),
dep = 'IND', S = 3, innov = rnorm)
zwl_sim(data, order = 2)
High-dimensional two-sample test proposed by Zhang and Wang (2020)
Description
This function implements the test of equal mean for two-sample high-dimension data using the ZWL and ZWLm tests proposed by Zhang and Wang (2020).
Usage
zwl_test(X, Y, order = 0)
Arguments
X |
The data matrix (n by p) from the first population. |
Y |
The data matrix (m by p) from the second population. |
order |
The order of center correction. Possible choices are 0, 2.
To use the ZWLm test, set |
Value
- statistic
The value of the test statistic.
- pvalue
The p-value of the test statistic based on the asymptotic normality established by Zhang and Wang (2020)
- Tn
The average of the squared univariate t-statistics.
- var
The estimated variance of Tn
References
Zhang, H. and Wang, H. (2020). Result consistency of high dimensional two-sample tests applied to gene ontology terms with gene sets. Manuscript in review.
Examples
# Generate a simulated two-sample dataset and apply the ZWL test
data <- buildData(n = 45, m =60, p = 300,
muX = rep(0,300), muY = rep(0,300),
dep = 'IND', S = 1, innov = rnorm)
zwl_test(data[[1]]$X, data[[1]]$Y, order = 2)
# Apply the ZWLm test to a GO term to see if the two groups are differentiately expressed.
# The data for the GO term were stored in GO_example.
zwl_test(GO_example$X, GO_example$Y, order = 0)
# Apply the ZWL test to the GO term
zwl_test(GO_example$X, GO_example$Y, order = 2)