\documentclass{article} \usepackage[round]{natbib} \usepackage{bibentry} \usepackage{amsfonts} \newcommand{\Rz}{\mathbb{R} } \hyphenation{Streit-berg} \renewcommand{\baselinestretch}{1.5} %%\VignetteIndexEntry{Some more or less useful examples for illustration.} %%\VignetteDepends{maxstat,survival} \begin{document} \title{Maximally Selected Rank Statistics in \textsf{R}} \author{by Torsten Hothorn and Berthold Lausen} \date{} \maketitle This document gives some examples on how to use the \texttt{maxstat} package and is basically an extention to \cite{on-maximal:2002}. <>= options(prompt=">", width=60) @ \section{Introduction} The determination of two groups of observations with respect to a simple cutpoint of a predictor is a common problem in medical statistics. For example, the distinction of a low and high risk group of patients is of special interest. The selection of a cutpoint in the predictor %%which separates the responses best leads to a multiple testing problem, cf. Figure \ref{stat}. This has to be taken into account when the effect of the selected cutpoint is evaluated. Maximally selected rank statistics can be used for estimation as well as evaluation of a simple cutpoint model. We show how this problems can be treated with the \texttt{maxstat} package and illustrate the usage of the package by gene expression profiling data. \section{Maximally Selected Rank Statistics} The functional relationship between a quantitative or ordered predictor $X$ and a quantitative, ordered or censored response $Y$ is unknown. As a simple model one can assume that an unknown cutpoint $\mu$ in $X$ determines two groups of observations regarding the response $Y$: the first group with $X$-values less or equal $\mu$ and the second group with $X$-values greater $\mu$. A measure of the difference between two groups with respect to $\mu$ is the absolute value of an appropriate standardized two-sample linear rank statistic of the responses. We give a short overview and follow the notation in \cite{maximally-:1992}. The hypothesis of independence of $X$ and $Y$ can be formulated as \begin{eqnarray*} H_0: P(Y \le y | X \le \mu) = P(Y \le y | X > \mu) \quad \end{eqnarray*} for all $ y $ and $ \mu \in \Rz $. This hypothesis can be tested as follows. For every reasonable cutpoint $\mu$ in $X$ (e.g. cutpoints that provide a reasonable sample size in both groups), the absolute value of the standardized two-sample linear rank statistic $|S_\mu|$ is computed. The maximum of the standardized statistics \begin{eqnarray*} M = \max_\mu |S_\mu| \end{eqnarray*} of all possible cutpoints is used as a test statistic for the hypothesis of independence above. The cutpoint in $X$ that provides the best separation of the responses into two groups, i.e. where the standardized statistics take their maximum, is used as an estimate of the unknown cutpoint. Several approximations for the distribution of the maximum of the standardized statistics $S_\mu$ have been suggested. \cite{maximally-:1992} show that the limiting distribution is the distribution of the supremum of the absolute value of a standardized Brownian bridge and consequently the approximation of \cite{maximally-:1982} can be used. An approximation based on an improved Bonferroni inequality is given by \cite{classifica:1994}. For small sample sizes, \cite{on-the-exa:2001} derive an lower bound of the distribution function based on the exact distribution of simple linear rank statistics. The algorithm by \cite{axact-dist:1986} is used for the computations. The exact distribution of a maximally selected Gau{\ss} statistic can be computed using the algorithms by \cite{numerical-:1992}. Because simple linear rank statistics are asymptotically normal, the results can be applied to approximate the distribution of maximally selected rank statistics \citep[see][]{on-the-exa:2001}. \section{The maxstat Package} The package \texttt{maxstat} implements both cutpoint estimation and the test procedure above with several $P$-value approximations as well as plotting of the empirical process of the standardized statistics. It depends on the packages \texttt{exactRankTests} for the computation of the distribution of linear rank statistics \citep{on-exact-r:2001} and \texttt{mvtnorm} for the computation of the multivariate normal distribution \citep{on-multiva:2001}. The package is available at CRAN. The generic method \texttt{maxstat.test} provides a formula interface for the specification of predictor and response. An object of class \texttt{maxtest} is returned. The methods \texttt{print.maxtest} and \texttt{plot.maxtest} are available for inspectation of the results. \section{Gene Expression Profiling} The distinction of two types of diffuse large B-cell lymphoma by gene expression profiling is studied by \cite{distinct-t:2000}. \cite{on-the-exa:2001} suggest the mean gene expression (MGE) as quantitative factor for the discrimination between two groups of patients with respect to overall survival time. The dataset \texttt{DLBCL} is included in the package. The maximally selected log-rank statistic for cutpoints between the $10\%$ and $90\%$ quantile of MGE using the upper bound of the $P$-value by \cite{on-the-exa:2001} can be computed by \renewcommand{\baselinestretch}{1} <>= library("maxstat") library("survival") data("DLBCL", package="maxstat") mtHL <- maxstat.test(Surv(time, cens) ~ MGE, data=DLBCL, smethod="LogRank", pmethod="HL") mtHL @ \renewcommand{\baselinestretch}{1.5} \begin{figure}[t] \begin{center} <>= mod <- maxstat.test(Surv(time, cens) ~ MGE, data=DLBCL, smethod="LogRank", pmethod="Lau94", alpha=0.05) plot(mod, xlab="Mean gene expression", cex.axis=1.3, cex.lab=1.3) @ \caption{Absolute standardized log-rank statistics and significance bound based on the improved Bonferroni inequality. \label{stat}} \end{center} \end{figure} \normalsize \renewcommand{\baselinestretch}{1.5} For censored responses, the formula interface is similar to the one used in package \texttt{survival}: \texttt{time} specifies the time until an event and \texttt{cens} is the status indicator (\texttt{dead=1}). For quantitative responses $y$, the formula is of the form \texttt{y} $ \sim$ \texttt{x}. Currently it is not possible to specify more than one predictor \texttt{x}. \texttt{smethod} allows the selection of the statistics to be used: \texttt{Gauss, Wilcoxon, Median, NormalQuantil} and \texttt{LogRank} are available. \texttt{pmethod} defines which kind of $P$-value approximation is computed: \texttt{Lau92} means the limiting distribution, \texttt{Lau94} the approximation based on the improved Bonferroni inequality, \texttt{exactGauss} the distribution of a maximally selected Gau{\ss} statistic and \texttt{HL} is the upper bound of the $P$-value by \cite{on-the-exa:2001}. All implemented approximations are known to be conservative and therefore their minimum $P$-value is available by choosing \texttt{pmethod="min"}. \begin{figure}[t] \begin{center} <>= splitMGE <- rep(1, nrow(DLBCL)) DLBCL <- cbind(DLBCL, splitMGE) DLBCL$splitMGE[DLBCL$MGE <= mod$estimate] <- 0 par(mai=c(1.0196235, 1.0196235, 0.8196973, 0.4198450)) plot(survfit(Surv(time, cens) ~ splitMGE, data=DLBCL), xlab = "Survival time in month", ylab="Probability", cex.lab=1.3, cex.axis=1.3, lwd=2) text(80, 0.9, expression("Mean gene expression" > 0.186), cex=1.3) text(80, 0.5, expression("Mean gene expression" <= 0.186 ), cex=1.3) @ \caption{Kaplan-Meier curves of two groups of DLBCL patients separated by the cutpoint \Sexpr{round(mtHL$estimate,3)} mean gene expression. \label{surv}} \end{center} \end{figure} For the overall survival time, the estimated cutpoint is \Sexpr{round(mtHL$estimate,3)} mean gene expression, the maximum of the log-rank statistics is $M =$ \Sexpr{round(mtHL$statistic,3)}. The probability that, under the null hypothesis, the maximally selected log-rank statistic is greater $M =$ \Sexpr{round(mtHL$statistic,3)} is less then than \Sexpr{round(mtHL$p.value,3)}. The empirical process of the standardized statistics together with the $\alpha$-quantile of the null distribution can be plotted using \texttt{plot.maxtest}. If the significance level \texttt{alpha} is specified, the corresponding quantile is computed and drawn as a horizonal red line. The estimated cutpoint is plotted as vertical dashed line, see Figure \ref{stat}. The difference in overall survival time between the two groups determined by a cutpoint of \Sexpr{round(mtHL$estimate,3)} mean gene expression is plotted in Figure \ref{surv}. No event was observed for patients with mean gene expression greater \Sexpr{round(mtHL$estimate,3)}. The exact conditional $p$-value can be simulated via conditional Monte-Carlo. This may be time consuming but is an easy way to check the goodness of the $p$-value approximations. The argument \texttt{B} specifies the number of Monte-Carlo replications to be performed (and defaults to $10000$). <>= maxstat.test(Surv(time, cens) ~ MGE, data=DLBCL, smethod="LogRank", pmethod="condMC", B = 9999) @ \section{More than One Predictor} If the cutpoints in more than one predictor are evalutated, the problem is to test the null hypothesis of independence of the response and any of the predictors under consideration. Furthermore, the ``best'' split in the ``most significant'' predictor needs to be selected. For example, we evaluate both mean gene expression and the International Prognostic Index (IPI) simultaneously: \renewcommand{\baselinestretch}{1} <>= mmax <- maxstat.test(Surv(time, cens) ~ MGE + IPI, data=DLBCL, smethod="LogRank", pmethod="exactGauss", abseps=0.01) mmax @ \renewcommand{\baselinestretch}{1.5} The p-value of the global test for the null hypothesis ``survival is independent from both IPI and MGE'' is \Sexpr{round(mmax$p.value,3)} and IPI provides a better distinction into two groups than MGE does. We can display the standardized statistics using \texttt{plot} (Figure \ref{plotmmax}). The methodology used here is described in \cite{lausenetal:2002}. \begin{figure}[ht] \begin{center} <>= plot(mmax, cex.axis=1.3, cex.lab=1.3) @ \caption{The standardized logrank statistics for the two predictors MGE and IPI. \label{plotmmax}} \end{center} \end{figure} \section{Summary} The package \texttt{maxstat} provides a user-friendly interface and implements standard methods as well as recent suggestions for the approximation of the null distribution of maximally selected rank statistics. \bibliographystyle{plainnat} \bibliography{maxstatlit} \end{document}