2. Wide Correlation Workflows

Scope

This vignette covers the basic wide-data correlation estimators. These methods start from one numeric matrix or data frame, treat columns as variables, and return a square matrix indexed by those columns.

The main functions in this group are:

They answer related but not identical questions, so method choice should be driven by the structure of the data rather than by habit alone.

A common input pattern

library(matrixCorr)

set.seed(10)
z <- rnorm(80)
u <- rnorm(80)
X <- data.frame(
  x1 = z + rnorm(80, sd = 0.35),
  x2 = 0.85 * z + rnorm(80, sd = 0.45),
  x3 = 0.25 * z + 0.70 * u + rnorm(80, sd = 0.45),
  x4 = rnorm(80)
)

All four estimators accept this same wide input format.

R_pear <- pearson_corr(X)
R_spr  <- spearman_rho(X)
R_ken  <- kendall_tau(X)
R_dcor <- dcor(X)

print(R_pear, digits = 2)
#> Pearson correlation matrix
#>   method      : pearson
#>   dimensions  : 4 x 4
#> 
#>      x1    x2   x3    x4
#> x1 1.00  0.78 0.13  0.08
#> x2 0.78  1.00 0.17 -0.02
#> x3 0.13  0.17 1.00  0.22
#> x4 0.08 -0.02 0.22  1.00
summary(R_spr)
#> Correlation summary
#>   output      : matrix
#>   dimensions  : 4 x 4
#>   retained_pairs: 6
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : -0.0023 to 0.7766
#> 
#>  item1 item2 estimate
#>  x1    x2     0.7766 
#>  x2    x3     0.1866 
#>  x3    x4     0.1750 
#>  x1    x3     0.1318 
#>  x1    x4     0.1047 
#>  x2    x4    -0.0023

This toy dataset is intentionally structured so that x1 and x2 form a clear linear pair, x3 is only moderately related to that first block, and x4 is close to null. That makes the shared output structure easier to interpret than a pure-noise example.

Pearson correlation

Pearson correlation targets linear association on the original measurement scale. It is the natural first choice when variables are continuous, the relationship is approximately linear, and there is no strong concern about outlier sensitivity.

plot(R_pear)

If confidence intervals are required, they can be requested directly.

R_pear_ci <- pearson_corr(X, ci = TRUE)
summary(R_pear_ci)
#> Pearson correlation summary
#>   output      : matrix
#>   dimensions  : 4 x 4
#>   retained_pairs: 6
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : -0.0239 to 0.7773
#>   ci          : 95%
#>   ci_method   : fisher_z
#>   ci_width    : 0.179 to 0.439
#>   cross_zero  : 4 pair(s)
#> 
#>  item1 item2 estimate n  95% CI            p_value
#>  x1    x2     0.7773  80 [0.6724, 0.8516]  0.0000 
#>  x3    x4     0.2249  80 [0.0054, 0.4236]  0.0447 
#>  x2    x3     0.1741  80 [-0.0474, 0.3793] 0.1227 
#>  x1    x3     0.1250  80 [-0.0974, 0.3355] 0.2701 
#>  x1    x4     0.0770  80 [-0.1452, 0.2918] 0.4984 
#>  x2    x4    -0.0239  80 [-0.2424, 0.1968] 0.8337
estimate(R_pear_ci)
#>           x1          x2        x3          x4
#> x1 1.0000000  0.77729444 0.1250327  0.07700520
#> x2 0.7772944  1.00000000 0.1741272 -0.02392893
#> x3 0.1250327  0.17412725 1.0000000  0.22485359
#> x4 0.0770052 -0.02392893 0.2248536  1.00000000
confint(R_pear_ci)
#>   item1 item2          lwr       upr
#> 1    x1    x2  0.672415903 0.8515753
#> 2    x1    x3 -0.097358788 0.3355320
#> 3    x2    x3 -0.047403236 0.3793314
#> 4    x1    x4 -0.145167838 0.2917853
#> 5    x2    x4 -0.242371535 0.1968228
#> 6    x3    x4  0.005403655 0.4236409
ci(R_pear_ci)
#> $est
#>           x1          x2        x3          x4
#> x1 1.0000000  0.77729444 0.1250327  0.07700520
#> x2 0.7772944  1.00000000 0.1741272 -0.02392893
#> x3 0.1250327  0.17412725 1.0000000  0.22485359
#> x4 0.0770052 -0.02392893 0.2248536  1.00000000
#> 
#> $lwr.ci
#>             x1          x2           x3           x4
#> x1         NaN  0.67241590 -0.097358788 -0.145167838
#> x2  0.67241590         NaN -0.047403236 -0.242371535
#> x3 -0.09735879 -0.04740324          NaN  0.005403655
#> x4 -0.14516784 -0.24237154  0.005403655          NaN
#> 
#> $upr.ci
#>           x1        x2        x3        x4
#> x1       NaN 0.8515753 0.3355320 0.2917853
#> x2 0.8515753       NaN 0.3793314 0.1968228
#> x3 0.3355320 0.3793314       NaN 0.4236409
#> x4 0.2917853 0.1968228 0.4236409       NaN
#> 
#> $conf.level
#> [1] 0.95
tidy(R_pear_ci)
#>   item1 item2    estimate          lwr       upr
#> 1    x1    x2  0.77729444  0.672415903 0.8515753
#> 2    x1    x3  0.12503273 -0.097358788 0.3355320
#> 3    x2    x3  0.17412725 -0.047403236 0.3793314
#> 4    x1    x4  0.07700520 -0.145167838 0.2917853
#> 5    x2    x4 -0.02392893 -0.242371535 0.1968228
#> 6    x3    x4  0.22485359  0.005403655 0.4236409

Spearman and Kendall

Spearman’s rho and Kendall’s tau are rank-based estimators. They are useful when monotone association is of interest and the analysis should be less sensitive to departures from strict linearity. Both functions also support optional large-sample confidence intervals through ci = TRUE.

set.seed(11)
x <- sort(rnorm(60))
y <- x^3 + rnorm(60, sd = 0.5)
dat_mon <- data.frame(x = x, y = y)

pearson_corr(dat_mon)
#> Pearson correlation matrix
#>   method      : pearson
#>   dimensions  : 2 x 2
#> 
#>        x      y
#> x 1.0000 0.7839
#> y 0.7839 1.0000
spearman_rho(dat_mon)
#> Spearman correlation matrix
#>   method      : spearman
#>   dimensions  : 2 x 2
#> 
#>        x      y
#> x 1.0000 0.8007
#> y 0.8007 1.0000
kendall_tau(dat_mon)
#> Kendall correlation matrix
#>   method      : kendall
#>   dimensions  : 2 x 2
#> 
#>        x      y
#> x 1.0000 0.6395
#> y 0.6395 1.0000

In this setting the relationship is monotone but not linear, so a rank-based summary is often the clearer first description.

When interval estimation is required, the same matrix-style interface is kept.

fit_spr_ci <- spearman_rho(X, ci = TRUE)
fit_ken_ci <- kendall_tau(X, ci = TRUE)

summary(fit_spr_ci)
#> Spearman correlation summary
#>   output      : matrix
#>   dimensions  : 4 x 4
#>   retained_pairs: 6
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : -0.0023 to 0.7766
#>   ci          : 95%
#>   ci_method   : jackknife_euclidean_likelihood
#>   ci_width    : 0.197 to 0.476
#>   cross_zero  : 5 pair(s)
#> 
#>  item1 item2 estimate n  95% CI            p_value
#>  x1    x2     0.7766  80 [0.6776, 0.8750]  0.0000 
#>  x2    x3     0.1866  80 [-0.0400, 0.4130] 0.0975 
#>  x3    x4     0.1750  80 [-0.0525, 0.4023] 0.1207 
#>  x1    x3     0.1318  80 [-0.0975, 0.3609] 0.2447 
#>  x1    x4     0.1047  80 [-0.1335, 0.3427] 0.3565 
#>  x2    x4    -0.0023  80 [-0.2312, 0.2267] 0.9841
summary(fit_ken_ci)
#> Kendall correlation summary
#>   output      : matrix
#>   dimensions  : 4 x 4
#>   retained_pairs: 6
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : -0.0032 to 0.5861
#>   ci          : 95%
#>   ci_method   : fieller
#>   ci_width    : 0.195 to 0.295
#>   cross_zero  : 5 pair(s)
#> 
#>  item1 item2 estimate n  95% CI            p_value
#>  x1    x2     0.5861  80 [0.4800, 0.6752]  0.0000 
#>  x3    x4     0.1152  80 [-0.0329, 0.2583] 0.3100 
#>  x2    x3     0.1133  80 [-0.0348, 0.2565] 0.3181 
#>  x1    x3     0.0880  80 [-0.0603, 0.2325] 0.4389 
#>  x1    x4     0.0741  80 [-0.0743, 0.2192] 0.5151 
#>  x2    x4    -0.0032  80 [-0.1506, 0.1444] 0.9778

Distance correlation

Distance correlation addresses a broader target. It is designed to detect general dependence rather than only linear or monotone structure. The function also supports optional hypothesis testing through p_value = TRUE.

set.seed(12)
x <- runif(100, -2, 2)
y <- x^2 + rnorm(100, sd = 0.2)
dat_nonlin <- data.frame(x = x, y = y)

pearson_corr(dat_nonlin)
#> Pearson correlation matrix
#>   method      : pearson
#>   dimensions  : 2 x 2
#> 
#>        x      y
#> x 1.0000 0.0045
#> y 0.0045 1.0000
dcor(dat_nonlin)
#> Distance correlation (dCor) matrix
#>   method      : distance_correlation
#>   dimensions  : 2 x 2
#> 
#>        x      y
#> x 1.0000 0.2064
#> y 0.2064 1.0000

This is a typical situation where Pearson correlation can be close to zero even though the variables are clearly dependent.

If a formal inferential summary is needed, p-values can be requested directly.

fit_dcor_p <- dcor(dat_nonlin, p_value = TRUE)
summary(fit_dcor_p)
#> Distance correlation summary
#>   output      : matrix
#>   dimensions  : 2 x 2
#>   retained_pairs: 1
#>   threshold   : 0.0000
#>   diag        : included
#>   estimate    : 0.2064
#>   inference   : dcor_t_test
#> 
#>  item1 item2 estimate n   p_value
#>  x     y     0.2064   100 0.0000

Missing values

The default wide-data behaviour is strict validation. Missing values are rejected unless the function explicitly supports a relaxed mode through na_method = "pairwise".

X_miss <- X
X_miss$x2[c(3, 7)] <- NA

try(pearson_corr(X_miss))
#> Error in eval(expr, envir) : 
#>   Missing values are not allowed when na_method = "error"; non-finite values are not allowed either. Use na_method = "pairwise" to use pairwise complete observations, or na_method = "complete" to drop rows with missing or non-finite values before computing correlations.
pearson_corr(X_miss, na_method = "pairwise")
#> Pearson correlation matrix
#>   method      : pearson
#>   dimensions  : 4 x 4
#> 
#>        x1      x2     x3      x4
#> x1 1.0000  0.7742 0.1250  0.0770
#> x2 0.7742  1.0000 0.1829 -0.0285
#> x3 0.1250  0.1829 1.0000  0.2249
#> x4 0.0770 -0.0285 0.2249  1.0000

When na_method = "pairwise", the package uses pairwise complete observations for the affected estimator. That is convenient, but it also means different pairs may be based on different effective sample sizes.

Practical guidance

In ordinary wide-data work, the following sequence is usually defensible.

The next vignette addresses settings where this basic family is still not sufficient because the data are contaminated by outliers or the number of variables is large relative to sample size.