---
title: "Introduction to depthR"
author: "Jason Parker"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to depthR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>",
  fig.width  = 6,
  fig.height = 5
)
library(depthR)
set.seed(42)
```

## What is Statistical Depth?

The mean and standard deviation are the workhorses of multivariate statistics.
But they have a fundamental problem: they are not robust. A single outlier can
pull the mean far from the center of the data, and the IQR — the standard tool
for flagging outliers — has no natural multivariate generalization.

Statistical depth functions solve this. A depth function assigns to every point
in $\mathbb{R}^d$ a value measuring how *central* that point is with respect to
a distribution. The deepest point is the multivariate median — a genuine robust
generalization of the univariate median. Points near the periphery of the
distribution have depth near zero.

Existing R packages for depth functions (ddalpha, depth, DepthProc) implement
these ideas but cap out at $d = 2$ or $d = 3$, or are too slow for practical
use at moderate dimension. **depthR** is built from the ground up in C++ to
work in arbitrary dimension $d$, with parallel computation and adaptive stopping
rules that keep computation time under control.

---

## The Five Depth Functions

depthR implements five depth functions covering the main theoretical families
in the literature.

### Mahalanobis Depth

The simplest depth function. Depth is a monotone decreasing function of the
Mahalanobis distance from the estimated mean:

$$D(x, F) = \frac{1}{1 + (x - \mu)^\top \Sigma^{-1} (x - \mu)}$$

The deepest point is the sample mean. Mahalanobis depth is included as a
baseline — it is fast and closed-form, but assumes elliptical symmetry and has
zero breakdown point.

```{r mahalanobis-example}
data <- matrix(rnorm(200), nrow = 50, ncol = 4)
x    <- rbind(colMeans(data),        # center
              colMeans(data) + 3)    # outlier direction

mahalanobis_depth(x, data)
```

### Tukey (Halfspace) Depth

The canonical depth function, defined as the minimum probability mass in any
closed halfspace containing $x$:

$$TD(x, F) = \inf_{u \neq 0} P(u^\top X \leq u^\top x)$$

The Tukey median has breakdown point up to $1/(d+1)$ and satisfies all four
Zuo-Serfling axioms. Exact computation is $O(n^{d-1})$; depthR uses an
adaptive random projection approximation.

```{r tukey-example, eval = FALSE}
tukey_depth(x, data, batch_size = 50, min_batches = 3, patience = 2)
```
```
#> [1] 0.4 0.0
```

### Liu Simplicial Depth

The probability that a random simplex formed by $d+1$ draws from $F$ contains
$x$:

$$SD(x, F) = P(x \in S[X_1, \ldots, X_{d+1}])$$

Simplicial depth has a beautiful geometric interpretation and reduces to the
univariate median when $d = 1$. depthR uses adaptive Monte Carlo with a
Bernoulli standard error stopping rule.

```{r simplicial-example, eval = FALSE}
simplicial_depth(x, data, batch_size = 50, min_batches = 3, max_batches = 5)
```
```
#> [1] 0.104 0.000
```

### Projection Depth

Based on the Stahel-Donoho outlyingness — the supremum over all directions of
the robust univariate Z-score:

$$O(x, F) = \sup_{u \neq 0} \frac{|u^\top x - \text{med}(u^\top F)|}
                                    {\text{MAD}(u^\top F)}, \quad
PD(x, F) = \frac{1}{1 + O(x, F)}$$

Projection depth uses median and MAD rather than mean and SD, giving it a high
breakdown point. It is affine invariant.

```{r projection-example, eval = FALSE}
projection_depth(x, data, batch_size = 50, min_batches = 3, patience = 2)
```
```
#> [1] 0.63968225 0.08674677
```

### Spatial Depth

Defined as one minus the norm of the mean unit vector pointing from the data
toward $x$:

$$SpD(x, F) = 1 - \left\| E\left[ \frac{x - X}{\|x - X\|} \right] \right\|$$

Spatial depth is the only function in depthR with a closed-form sample
estimate — no Monte Carlo needed. This makes it extremely fast even at large
$n$ and $d$.

```{r spatial-example}
spatial_depth(x, data)
```

---

## The depth Object

All depth functions plug into `compute_depth()`, which returns a `depth` object
from which medians, ranks, outliers, and central regions can be extracted
without recomputing depth.

```{r compute-depth}
dd <- compute_depth(data, depth_fn = mahalanobis_depth)
dd
```

### Depth-Based Median

The deepest point — a genuine robust estimator of multivariate location.

```{r depth-median}
m <- median(dd)
cat("Median index:", m$index, "\n")
print(round(m$point, 3))
cat("Depth:", round(m$depth, 4), "\n")
```

### Depth-Based Ranks

Rank 1 is the deepest (most central) observation; rank $n$ is the most
outlying.

```{r depth-ranks}
r <- rank(dd)
cat("5 deepest observations:\n")
print(which(r <= 5))
```

### Central Region

The inner $1 - \alpha$ fraction of the data by depth.

```{r central-region}
cr <- central_region(dd, alpha = 0.50)
cat("Inner 50% contains", length(cr$indices), "observations\n")
```

---

## Outlier Detection

```{r outlier-setup}
set.seed(1)
clean             <- matrix(rnorm(200), nrow = 100, ncol = 2)
outliers_injected <- matrix(runif(10, min = 4, max = 6), nrow = 5, ncol = 2)
contaminated      <- rbind(clean, outliers_injected)

dd_cont <- compute_depth(contaminated, depth_fn = mahalanobis_depth)
out     <- outliers(dd_cont, threshold = 0.05)

cat("Injected outliers are rows 101-105\n")
cat("Detected outlier indices:\n")
print(sort(out$indices))
```

### Visualising Outliers

```{r outlier-plot}
plot(dd_cont, outlier_threshold = 0.05,
     main = "Mahalanobis depth — injected outliers flagged in red")
```

---

## The DD-Plot

The depth-depth plot is the multivariate analog of the QQ-plot.

```{r dd-same}
set.seed(2)
x_same <- matrix(rnorm(200), nrow = 100, ncol = 2)
y_same <- matrix(rnorm(200), nrow = 100, ncol = 2)

dd_plot(x_same, y_same,
        depth_fn = mahalanobis_depth,
        main     = "DD-Plot: same distribution")
```

```{r dd-shift}
y_shift <- matrix(rnorm(200, mean = 2), nrow = 100, ncol = 2)

dd_plot(x_same, y_shift,
        depth_fn = mahalanobis_depth,
        main     = "DD-Plot: location shift of 2")
```

---

## High-Dimensional Performance

depthR works in high dimension. Here we demonstrate on $d = 10$.

```{r high-d-setup}
set.seed(3)
n       <- 100
d       <- 10
data_hd <- matrix(rnorm(n * d), nrow = n, ncol = d)
x_hd    <- rbind(colMeans(data_hd), colMeans(data_hd) + 4)
cat("Dimension d =", d, ", n =", n, "\n")
```

```{r high-d-mahal}
cat("Mahalanobis depth:\n")
print(round(mahalanobis_depth(x_hd, data_hd), 4))
```

```{r high-d-tukey, eval = FALSE}
cat("\nTukey depth:\n")
print(round(tukey_depth(x_hd, data_hd,
            batch_size = 50, min_batches = 3, patience = 2), 4))
```
```
#> 
#> Tukey depth:
#> [1] 0.42 0.00
```

```{r high-d-spatial}
cat("\nSpatial depth:\n")
print(round(spatial_depth(x_hd, data_hd), 4))
```

---

## References

- Liu, R. Y. (1990). On a notion of data depth based on random simplices.
  *Annals of Statistics*, 18(1), 405--414.

- Vardi, Y. & Zhang, C.-H. (2000). The multivariate L1-median and associated
  data depth. *PNAS*, 97(4), 1423--1426.

- Zuo, Y. & Serfling, R. (2000). General notions of statistical depth function.
  *Annals of Statistics*, 28(2), 461--482.

- Serfling, R. (2006). Depth functions in nonparametric multivariate inference.
  *DIMACS Series in Discrete Mathematics*, 72, 1--16.