--- title: "Introduction to depthR" author: "Jason Parker" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to depthR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 5 ) library(depthR) set.seed(42) ``` ## What is Statistical Depth? The mean and standard deviation are the workhorses of multivariate statistics. But they have a fundamental problem: they are not robust. A single outlier can pull the mean far from the center of the data, and the IQR — the standard tool for flagging outliers — has no natural multivariate generalization. Statistical depth functions solve this. A depth function assigns to every point in $\mathbb{R}^d$ a value measuring how *central* that point is with respect to a distribution. The deepest point is the multivariate median — a genuine robust generalization of the univariate median. Points near the periphery of the distribution have depth near zero. Existing R packages for depth functions (ddalpha, depth, DepthProc) implement these ideas but cap out at $d = 2$ or $d = 3$, or are too slow for practical use at moderate dimension. **depthR** is built from the ground up in C++ to work in arbitrary dimension $d$, with parallel computation and adaptive stopping rules that keep computation time under control. --- ## The Five Depth Functions depthR implements five depth functions covering the main theoretical families in the literature. ### Mahalanobis Depth The simplest depth function. Depth is a monotone decreasing function of the Mahalanobis distance from the estimated mean: $$D(x, F) = \frac{1}{1 + (x - \mu)^\top \Sigma^{-1} (x - \mu)}$$ The deepest point is the sample mean. Mahalanobis depth is included as a baseline — it is fast and closed-form, but assumes elliptical symmetry and has zero breakdown point. ```{r mahalanobis-example} data <- matrix(rnorm(200), nrow = 50, ncol = 4) x <- rbind(colMeans(data), # center colMeans(data) + 3) # outlier direction mahalanobis_depth(x, data) ``` ### Tukey (Halfspace) Depth The canonical depth function, defined as the minimum probability mass in any closed halfspace containing $x$: $$TD(x, F) = \inf_{u \neq 0} P(u^\top X \leq u^\top x)$$ The Tukey median has breakdown point up to $1/(d+1)$ and satisfies all four Zuo-Serfling axioms. Exact computation is $O(n^{d-1})$; depthR uses an adaptive random projection approximation. ```{r tukey-example, eval = FALSE} tukey_depth(x, data, batch_size = 50, min_batches = 3, patience = 2) ``` ``` #> [1] 0.4 0.0 ``` ### Liu Simplicial Depth The probability that a random simplex formed by $d+1$ draws from $F$ contains $x$: $$SD(x, F) = P(x \in S[X_1, \ldots, X_{d+1}])$$ Simplicial depth has a beautiful geometric interpretation and reduces to the univariate median when $d = 1$. depthR uses adaptive Monte Carlo with a Bernoulli standard error stopping rule. ```{r simplicial-example, eval = FALSE} simplicial_depth(x, data, batch_size = 50, min_batches = 3, max_batches = 5) ``` ``` #> [1] 0.104 0.000 ``` ### Projection Depth Based on the Stahel-Donoho outlyingness — the supremum over all directions of the robust univariate Z-score: $$O(x, F) = \sup_{u \neq 0} \frac{|u^\top x - \text{med}(u^\top F)|} {\text{MAD}(u^\top F)}, \quad PD(x, F) = \frac{1}{1 + O(x, F)}$$ Projection depth uses median and MAD rather than mean and SD, giving it a high breakdown point. It is affine invariant. ```{r projection-example, eval = FALSE} projection_depth(x, data, batch_size = 50, min_batches = 3, patience = 2) ``` ``` #> [1] 0.63968225 0.08674677 ``` ### Spatial Depth Defined as one minus the norm of the mean unit vector pointing from the data toward $x$: $$SpD(x, F) = 1 - \left\| E\left[ \frac{x - X}{\|x - X\|} \right] \right\|$$ Spatial depth is the only function in depthR with a closed-form sample estimate — no Monte Carlo needed. This makes it extremely fast even at large $n$ and $d$. ```{r spatial-example} spatial_depth(x, data) ``` --- ## The depth Object All depth functions plug into `compute_depth()`, which returns a `depth` object from which medians, ranks, outliers, and central regions can be extracted without recomputing depth. ```{r compute-depth} dd <- compute_depth(data, depth_fn = mahalanobis_depth) dd ``` ### Depth-Based Median The deepest point — a genuine robust estimator of multivariate location. ```{r depth-median} m <- median(dd) cat("Median index:", m$index, "\n") print(round(m$point, 3)) cat("Depth:", round(m$depth, 4), "\n") ``` ### Depth-Based Ranks Rank 1 is the deepest (most central) observation; rank $n$ is the most outlying. ```{r depth-ranks} r <- rank(dd) cat("5 deepest observations:\n") print(which(r <= 5)) ``` ### Central Region The inner $1 - \alpha$ fraction of the data by depth. ```{r central-region} cr <- central_region(dd, alpha = 0.50) cat("Inner 50% contains", length(cr$indices), "observations\n") ``` --- ## Outlier Detection ```{r outlier-setup} set.seed(1) clean <- matrix(rnorm(200), nrow = 100, ncol = 2) outliers_injected <- matrix(runif(10, min = 4, max = 6), nrow = 5, ncol = 2) contaminated <- rbind(clean, outliers_injected) dd_cont <- compute_depth(contaminated, depth_fn = mahalanobis_depth) out <- outliers(dd_cont, threshold = 0.05) cat("Injected outliers are rows 101-105\n") cat("Detected outlier indices:\n") print(sort(out$indices)) ``` ### Visualising Outliers ```{r outlier-plot} plot(dd_cont, outlier_threshold = 0.05, main = "Mahalanobis depth — injected outliers flagged in red") ``` --- ## The DD-Plot The depth-depth plot is the multivariate analog of the QQ-plot. ```{r dd-same} set.seed(2) x_same <- matrix(rnorm(200), nrow = 100, ncol = 2) y_same <- matrix(rnorm(200), nrow = 100, ncol = 2) dd_plot(x_same, y_same, depth_fn = mahalanobis_depth, main = "DD-Plot: same distribution") ``` ```{r dd-shift} y_shift <- matrix(rnorm(200, mean = 2), nrow = 100, ncol = 2) dd_plot(x_same, y_shift, depth_fn = mahalanobis_depth, main = "DD-Plot: location shift of 2") ``` --- ## High-Dimensional Performance depthR works in high dimension. Here we demonstrate on $d = 10$. ```{r high-d-setup} set.seed(3) n <- 100 d <- 10 data_hd <- matrix(rnorm(n * d), nrow = n, ncol = d) x_hd <- rbind(colMeans(data_hd), colMeans(data_hd) + 4) cat("Dimension d =", d, ", n =", n, "\n") ``` ```{r high-d-mahal} cat("Mahalanobis depth:\n") print(round(mahalanobis_depth(x_hd, data_hd), 4)) ``` ```{r high-d-tukey, eval = FALSE} cat("\nTukey depth:\n") print(round(tukey_depth(x_hd, data_hd, batch_size = 50, min_batches = 3, patience = 2), 4)) ``` ``` #> #> Tukey depth: #> [1] 0.42 0.00 ``` ```{r high-d-spatial} cat("\nSpatial depth:\n") print(round(spatial_depth(x_hd, data_hd), 4)) ``` --- ## References - Liu, R. Y. (1990). On a notion of data depth based on random simplices. *Annals of Statistics*, 18(1), 405--414. - Vardi, Y. & Zhang, C.-H. (2000). The multivariate L1-median and associated data depth. *PNAS*, 97(4), 1423--1426. - Zuo, Y. & Serfling, R. (2000). General notions of statistical depth function. *Annals of Statistics*, 28(2), 461--482. - Serfling, R. (2006). Depth functions in nonparametric multivariate inference. *DIMACS Series in Discrete Mathematics*, 72, 1--16.