SignalY: Comprehensive Signal Extraction from Panel Data

José Mauricio Gómez Julián

2026-01-24

Introduction

The SignalY package provides a comprehensive framework for signal extraction from panel data, integrating multiple complementary methodologies:

  1. Spectral Decomposition: Wavelets, Empirical Mode Decomposition (EMD), and Bayesian HP filters
  2. Bayesian Variable Selection: Regularized Horseshoe regression for sparse signal identification
  3. Dimensionality Reduction: PCA with bootstrap significance testing and Dynamic Factor Models
  4. Stationarity Testing: Integrated battery of unit root tests

The package distinguishes between latent structure (the underlying data-generating process) and phenomenological dynamics (observed variability), providing automated technical interpretation of results.

Installation

# Install from CRAN (when available)
install.packages("SignalY")

# Install development version from GitHub
# remotes::install_github("username/SignalY")

# For Bayesian methods (Horseshoe, HP-GC), install cmdstanr:
install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
cmdstanr::install_cmdstan()

Quick Start

library(SignalY)

# Generate example data
set.seed(42)
n <- 100
p <- 20

# Create correlated predictors with factor structure
factors <- matrix(rnorm(n * 2), n, 2)
loadings <- matrix(runif(p * 2, -1, 1), p, 2)
X <- factors %*% t(loadings) + matrix(rnorm(n * p, 0, 0.5), n, p)
colnames(X) <- paste0("X", 1:p)

# True signal depends on 5 predictors
true_beta <- c(rep(1, 5), rep(0, 15))
Y <- X %*% true_beta + rnorm(n, 0, 0.5)

# Combine into data frame
data <- data.frame(Y = Y, X)

# Run comprehensive analysis
result <- signal_analysis(
  data = data,
  y_formula = "Y",
  methods = "all",
  verbose = TRUE
)

# View results
print(result)
summary(result)
plot(result)

Methodological Framework

Epistemological Foundation

SignalY operationalizes the distinction between latent structure and phenomenological dynamics:

This distinction maps onto statistical concepts:

Concept Statistical Implementation
Signal Trend component, common factors
Noise Idiosyncratic shocks, measurement error
Sparsity Few variables carry information
Persistence Unit roots, long memory

Method Selection Guidance

Question Recommended Method
What are the dominant frequencies? Wavelets, EMD
Which predictors matter? Horseshoe regression
Is there common factor structure? PCA, DFM
Is the series stationary? Unit root battery
What’s the trend vs cycle? HP-GC filter

Spectral Decomposition

Wavelet Multi-Resolution Analysis

# Apply wavelet decomposition
Y <- sin(seq(0, 4*pi, length.out = 128)) + rnorm(128, 0, 0.3)

wav_result <- filter_wavelet(
  x = Y,
  filter = "la8",      # Daubechies least asymmetric, 8 vanishing moments
  levels = c(3, 4),    # Combine D3 + D4 (business cycle frequencies)
  first_diff = FALSE
)

# Examine results
names(wav_result)
# [1] "combined" "smooth" "details" "level_variance" "filter" "levels"

# Combined signal captures 8-32 period cycles (for annual data)
plot(wav_result$combined, type = "l", main = "Wavelet-Filtered Signal")

The wavelet filter choice matters:

Level selection corresponds to periodicities:

Level Period Range (annual data)
D1 2-4 years
D2 4-8 years
D3 8-16 years
D4 16-32 years
D5 32-64 years

Empirical Mode Decomposition

EMD is data-adaptive, requiring no basis function specification:

emd_result <- filter_emd(
  x = Y,
  max_imf = 10,
  boundary = "wave"
)

# IMFs are ordered from highest to lowest frequency
matplot(emd_result$imf[, 1:3], type = "l", 
        main = "First 3 Intrinsic Mode Functions")

# Residue captures the trend
plot(emd_result$residue, type = "l", main = "EMD Residue (Trend)")

Bayesian HP-GC Filter

The Grant-Chan embedded HP filter provides:

# Requires cmdstanr
hpgc_result <- filter_hpgc(
  x = Y,
  lambda = NULL,  # Auto-select via DIC
  n_chains = 4,
  n_iter = 2000
)

# Compare to standard HP filter
plot(Y, type = "l", col = "gray")
lines(hpgc_result$trend, col = "blue", lwd = 2)
legend("topright", c("Original", "Bayesian Trend"), 
       col = c("gray", "blue"), lwd = c(1, 2))

Bayesian Variable Selection

Regularized Horseshoe Prior

The Horseshoe prior provides:

# Fit Horseshoe regression
hs_fit <- fit_horseshoe(
  y = Y,
  X = X,
  p0 = 5,               # Expected non-zeros (can be NULL for auto)
  n_chains = 4,
  n_iter = 2000,
  n_warmup = 1000,
  adapt_delta = 0.95,   # Target acceptance (increase if divergences)
  use_qr = TRUE         # QR decomposition for multicollinearity
)

# Examine shrinkage
print(hs_fit)

# Key outputs:
# - beta: Coefficient estimates
# - kappa: Shrinkage factors (0 = no shrinkage, 1 = full shrinkage)
# - m_eff: Effective number of non-zeros

# Select variables by shrinkage threshold
selection <- select_by_shrinkage(hs_fit, threshold = 0.5)
which(selection$selected)  # Indices of selected predictors

Interpreting Shrinkage Factors

The shrinkage factor \(\kappa_j\) quantifies how much coefficient \(j\) is pulled toward zero:

\[\kappa_j = \frac{1}{1 + \tau^2 \lambda_j^2 / c^2}\]

\(\kappa\) Interpretation
0.0-0.1 Strong signal, minimal shrinkage
0.1-0.3 Moderate signal
0.3-0.5 Weak signal
0.5-0.7 Likely noise
0.7-1.0 Strong shrinkage, essentially zero

Dimensionality Reduction

PCA with Bootstrap

pca_result <- pca_bootstrap(
  X = X,
  n_components = NULL,      # Auto-select
  rotation = "varimax",     # Or "none", "oblimin"
  n_boot = 1000,
  block_length = NULL,      # Auto: sqrt(T)
  significance_level = 0.05
)

# Key outputs
pca_result$variance_explained  # Proportion by component
pca_result$loadings            # Variable loadings
pca_result$entropy             # Entropy of loadings (concentration measure)
pca_result$loadings_significant  # Bootstrap significance

# Interpretation
# Low entropy = concentrated loadings (few variables dominate)
# High entropy = diffuse loadings (many variables contribute)

Dynamic Factor Models

DFMs capture dynamic common factor structure:

dfm_result <- estimate_dfm(
  X = X,
  n_factors = NULL,         # Auto-select via Bai-Ng IC
  max_factors = 10,
  var_lags = 1,             # VAR lags for factor dynamics
  ic_criterion = "bai_ng_2"
)

# Key outputs
dfm_result$n_factors         # Optimal number
dfm_result$factors           # Estimated factors (T x r)
dfm_result$loadings          # Factor loadings (p x r)
dfm_result$variance_explained
dfm_result$var_coefficients  # VAR transition matrix

Unit Root Testing

Test Battery

ur_result <- test_unit_root(
  x = Y,
  tests = c("adf", "ers", "kpss", "pp")
)

# Synthesized conclusion
ur_result$synthesis
# $conclusion: "stationary", "trend_stationary", "difference_stationary", or "inconclusive"
# $confidence: "high", "medium", "low"
# $evidence: Detailed reasoning

# Individual test results
ur_result$tests$adf_none
ur_result$tests$ers_dfgls

Interpretation Framework

Conclusion Meaning Implications
Stationary Mean-reverting, transient shocks Use levels in regression
Trend-stationary Deterministic trend + stationary Detrend before analysis
Difference-stationary Stochastic trend (unit root) First-difference or cointegration
Inconclusive Mixed evidence Use theory, consider both

The Master Function

signal_analysis() orchestrates all methods:

result <- signal_analysis(
  data = data,
  y_formula = "Y",              # Or formula: Y ~ X1 + X2 + X3
  time_var = NULL,              # Time variable name
  group_var = NULL,             # Panel group variable
  methods = "all",              # Or c("wavelet", "horseshoe", "pca")
  
  # Method-specific configuration
  filter_config = list(
    wavelet_filter = "la8",
    wavelet_levels = c(3, 4),
    hpgc_n_chains = 4
  ),
  
  horseshoe_config = list(
    p0 = NULL,                  # Auto-calibrate
    n_chains = 4,
    adapt_delta = 0.95,
    kappa_threshold = 0.5
  ),
  
  pca_config = list(
    rotation = "none",
    n_boot = 1000
  ),
  
  dfm_config = list(
    ic_criterion = "bai_ng_2"
  ),
  
  # General options
  na_action = "interpolate",    # Or "omit", "fail"
  standardize = TRUE,
  first_difference = FALSE,
  verbose = TRUE,
  seed = 42
)

# Access components
result$filters$wavelet
result$horseshoe
result$pca
result$dfm
result$unitroot
result$interpretation

Automated Interpretation

The interpretation component synthesizes results:

result$interpretation$signal_characteristics
# $smoothness: Variance of second differences
# $smoothness_interpretation: "Very smooth", "Moderately volatile", etc.
# $trend_share: Proportion of variance from trend

result$interpretation$variable_selection
# $sparsity_ratio: Proportion shrunk to zero
# $n_selected: Number of selected predictors
# $top_predictors: Data frame of top variables

result$interpretation$factor_structure
# $pc1_entropy: Shannon entropy of PC1 loadings
# $topology_interpretation: "Concentrated", "Diffuse", etc.

result$interpretation$persistence
# $conclusion: Stationarity type
# $interpretation: Plain-language description

result$interpretation$overall_summary
# Combined narrative synthesis

Best Practices

1. Start with Exploration

# Visualize the data first
plot(Y, type = "l")
acf(Y)
pacf(Y)

# Check for obvious non-stationarity
ur_quick <- test_unit_root(Y, tests = "adf")

2. Choose Methods Thoughtfully

Not every analysis needs every method:

# For trend extraction only
result <- signal_analysis(data, "Y", methods = c("wavelet", "hpgc"))

# For variable selection only
result <- signal_analysis(data, "Y", methods = "horseshoe")

# For factor analysis only
result <- signal_analysis(data, "Y", methods = c("pca", "dfm"))

3. Validate with Domain Knowledge

Statistical significance ≠ theoretical validity:

4. Handle Diagnostics

# Check Horseshoe convergence
result$horseshoe$diagnostics
# Look for: n_divergent = 0, max_rhat < 1.01, ESS > 400

# Check LOO-CV
result$horseshoe$loo
# Pareto k > 0.7 indicates problematic observations

Computational Considerations

Memory and Time

Method Complexity Memory Time (n=1000, p=100)
Wavelets O(n log n) Low < 1 sec
EMD O(n × IMFs) Low 1-5 sec
HP-GC O(n × iter × chains) Medium 30-60 sec
Horseshoe O(n × p × iter × chains) High 1-5 min
PCA O(n × p²) Medium < 1 sec
DFM O(n × p × factors) Medium 1-10 sec

Parallelization

Stan-based methods (Horseshoe, HP-GC) automatically parallelize across chains.

Memory Management

For large datasets:

# Use fewer bootstrap replications
pca_config = list(n_boot = 500)

# Use fewer MCMC iterations
horseshoe_config = list(n_iter = 1000, n_warmup = 500)

# Analyze subsets
result <- signal_analysis(data[1:500, ], ...)

References

  1. Piironen, J., & Vehtari, A. (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2), 5018-5051.

  2. Bai, J., & Ng, S. (2002). Determining the Number of Factors in Approximate Factor Models. Econometrica, 70(1), 191-221.

  3. Grant, A. L., & Chan, J. C. C. (2017). A Bayesian Model Comparison for Trend-Cycle Decompositions of Output. Journal of Money, Credit and Banking, 49(2-3), 525-552.

  4. Percival, D. B., & Walden, A. T. (2000). Wavelet Methods for Time Series Analysis. Cambridge University Press.

  5. Huang, N. E., et al. (1998). The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proceedings of the Royal Society A, 454, 903-995.