SignalY: Comprehensive Signal Extraction from Panel Data

Introduction

The SignalY package provides a comprehensive framework for signal extraction from panel data, integrating multiple complementary methodologies:

Spectral Decomposition: Wavelets, Empirical Mode Decomposition (EMD), and Bayesian HP filters
Bayesian Variable Selection: Regularized Horseshoe regression for sparse signal identification
Dimensionality Reduction: PCA with bootstrap significance testing and Dynamic Factor Models
Stationarity Testing: Integrated battery of unit root tests

The package distinguishes between latent structure (the underlying data-generating process) and phenomenological dynamics (observed variability), providing automated technical interpretation of results.

Installation

# Install from CRAN (when available)
install.packages("SignalY")

# Install development version from GitHub
# remotes::install_github("username/SignalY")

# For Bayesian methods (Horseshoe, HP-GC), install cmdstanr:
install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
cmdstanr::install_cmdstan()

Quick Start

library(SignalY)

# Generate example data
set.seed(42)
n <- 100
p <- 20

# Create correlated predictors with factor structure
factors <- matrix(rnorm(n * 2), n, 2)
loadings <- matrix(runif(p * 2, -1, 1), p, 2)
X <- factors %*% t(loadings) + matrix(rnorm(n * p, 0, 0.5), n, p)
colnames(X) <- paste0("X", 1:p)

# True signal depends on 5 predictors
true_beta <- c(rep(1, 5), rep(0, 15))
Y <- X %*% true_beta + rnorm(n, 0, 0.5)

# Combine into data frame
data <- data.frame(Y = Y, X)

# Run comprehensive analysis
result <- signal_analysis(
  data = data,
  y_formula = "Y",
  methods = "all",
  verbose = TRUE
)

# View results
print(result)
summary(result)
plot(result)

Methodological Framework

Epistemological Foundation

SignalY operationalizes the distinction between latent structure and phenomenological dynamics:

Latent Structure: The underlying generative processes that produce observable patterns
Phenomenological Dynamics: The surface-level variability we measure

This distinction maps onto statistical concepts:

Concept	Statistical Implementation
Signal	Trend component, common factors
Noise	Idiosyncratic shocks, measurement error
Sparsity	Few variables carry information
Persistence	Unit roots, long memory

Method Selection Guidance

Question	Recommended Method
What are the dominant frequencies?	Wavelets, EMD
Which predictors matter?	Horseshoe regression
Is there common factor structure?	PCA, DFM
Is the series stationary?	Unit root battery
What’s the trend vs cycle?	HP-GC filter

Spectral Decomposition

Wavelet Multi-Resolution Analysis

# Apply wavelet decomposition
Y <- sin(seq(0, 4*pi, length.out = 128)) + rnorm(128, 0, 0.3)

wav_result <- filter_wavelet(
  x = Y,
  filter = "la8",      # Daubechies least asymmetric, 8 vanishing moments
  levels = c(3, 4),    # Combine D3 + D4 (business cycle frequencies)
  first_diff = FALSE
)

# Examine results
names(wav_result)
# [1] "combined" "smooth" "details" "level_variance" "filter" "levels"

# Combined signal captures 8-32 period cycles (for annual data)
plot(wav_result$combined, type = "l", main = "Wavelet-Filtered Signal")

The wavelet filter choice matters:

"la8": Least asymmetric, 8 vanishing moments (recommended for economic data)
"d4": Daubechies 4, more compact support
"haar": Simplest, for discontinuous signals

Level selection corresponds to periodicities:

Level	Period Range (annual data)
D1	2-4 years
D2	4-8 years
D3	8-16 years
D4	16-32 years
D5	32-64 years

Empirical Mode Decomposition

EMD is data-adaptive, requiring no basis function specification:

emd_result <- filter_emd(
  x = Y,
  max_imf = 10,
  boundary = "wave"
)

# IMFs are ordered from highest to lowest frequency
matplot(emd_result$imf[, 1:3], type = "l", 
        main = "First 3 Intrinsic Mode Functions")

# Residue captures the trend
plot(emd_result$residue, type = "l", main = "EMD Residue (Trend)")

Bayesian HP-GC Filter

The Grant-Chan embedded HP filter provides:

Posterior distributions for trend and cycle
Principled uncertainty quantification
Model comparison via DIC

# Requires cmdstanr
hpgc_result <- filter_hpgc(
  x = Y,
  lambda = NULL,  # Auto-select via DIC
  n_chains = 4,
  n_iter = 2000
)

# Compare to standard HP filter
plot(Y, type = "l", col = "gray")
lines(hpgc_result$trend, col = "blue", lwd = 2)
legend("topright", c("Original", "Bayesian Trend"), 
       col = c("gray", "blue"), lwd = c(1, 2))

Bayesian Variable Selection

Regularized Horseshoe Prior

The Horseshoe prior provides:

Adaptive shrinkage: Large signals preserved, small signals shrunk
Sparsity: Most coefficients pushed toward zero
Regularization: Slab component prevents unbounded estimates

# Fit Horseshoe regression
hs_fit <- fit_horseshoe(
  y = Y,
  X = X,
  p0 = 5,               # Expected non-zeros (can be NULL for auto)
  n_chains = 4,
  n_iter = 2000,
  n_warmup = 1000,
  adapt_delta = 0.95,   # Target acceptance (increase if divergences)
  use_qr = TRUE         # QR decomposition for multicollinearity
)

# Examine shrinkage
print(hs_fit)

# Key outputs:
# - beta: Coefficient estimates
# - kappa: Shrinkage factors (0 = no shrinkage, 1 = full shrinkage)
# - m_eff: Effective number of non-zeros

# Select variables by shrinkage threshold
selection <- select_by_shrinkage(hs_fit, threshold = 0.5)
which(selection$selected)  # Indices of selected predictors

Interpreting Shrinkage Factors

The shrinkage factor \(\kappa_j\) quantifies how much coefficient \(j\) is pulled toward zero:

\[\kappa_j = \frac{1}{1 + \tau^2 \lambda_j^2 / c^2}\]

\(\kappa\)	Interpretation
0.0-0.1	Strong signal, minimal shrinkage
0.1-0.3	Moderate signal
0.3-0.5	Weak signal
0.5-0.7	Likely noise
0.7-1.0	Strong shrinkage, essentially zero

Dimensionality Reduction

PCA with Bootstrap

pca_result <- pca_bootstrap(
  X = X,
  n_components = NULL,      # Auto-select
  rotation = "varimax",     # Or "none", "oblimin"
  n_boot = 1000,
  block_length = NULL,      # Auto: sqrt(T)
  significance_level = 0.05
)

# Key outputs
pca_result$variance_explained  # Proportion by component
pca_result$loadings            # Variable loadings
pca_result$entropy             # Entropy of loadings (concentration measure)
pca_result$loadings_significant  # Bootstrap significance

# Interpretation
# Low entropy = concentrated loadings (few variables dominate)
# High entropy = diffuse loadings (many variables contribute)

Dynamic Factor Models

DFMs capture dynamic common factor structure:

dfm_result <- estimate_dfm(
  X = X,
  n_factors = NULL,         # Auto-select via Bai-Ng IC
  max_factors = 10,
  var_lags = 1,             # VAR lags for factor dynamics
  ic_criterion = "bai_ng_2"
)

# Key outputs
dfm_result$n_factors         # Optimal number
dfm_result$factors           # Estimated factors (T x r)
dfm_result$loadings          # Factor loadings (p x r)
dfm_result$variance_explained
dfm_result$var_coefficients  # VAR transition matrix

Unit Root Testing

Test Battery

ur_result <- test_unit_root(
  x = Y,
  tests = c("adf", "ers", "kpss", "pp")
)

# Synthesized conclusion
ur_result$synthesis
# $conclusion: "stationary", "trend_stationary", "difference_stationary", or "inconclusive"
# $confidence: "high", "medium", "low"
# $evidence: Detailed reasoning

# Individual test results
ur_result$tests$adf_none
ur_result$tests$ers_dfgls

Interpretation Framework

Conclusion	Meaning	Implications
Stationary	Mean-reverting, transient shocks	Use levels in regression
Trend-stationary	Deterministic trend + stationary	Detrend before analysis
Difference-stationary	Stochastic trend (unit root)	First-difference or cointegration
Inconclusive	Mixed evidence	Use theory, consider both

The Master Function

signal_analysis() orchestrates all methods:

result <- signal_analysis(
  data = data,
  y_formula = "Y",              # Or formula: Y ~ X1 + X2 + X3
  time_var = NULL,              # Time variable name
  group_var = NULL,             # Panel group variable
  methods = "all",              # Or c("wavelet", "horseshoe", "pca")
  
  # Method-specific configuration
  filter_config = list(
    wavelet_filter = "la8",
    wavelet_levels = c(3, 4),
    hpgc_n_chains = 4
  ),
  
  horseshoe_config = list(
    p0 = NULL,                  # Auto-calibrate
    n_chains = 4,
    adapt_delta = 0.95,
    kappa_threshold = 0.5
  ),
  
  pca_config = list(
    rotation = "none",
    n_boot = 1000
  ),
  
  dfm_config = list(
    ic_criterion = "bai_ng_2"
  ),
  
  # General options
  na_action = "interpolate",    # Or "omit", "fail"
  standardize = TRUE,
  first_difference = FALSE,
  verbose = TRUE,
  seed = 42
)

# Access components
result$filters$wavelet
result$horseshoe
result$pca
result$dfm
result$unitroot
result$interpretation

Automated Interpretation

The interpretation component synthesizes results:

result$interpretation$signal_characteristics
# $smoothness: Variance of second differences
# $smoothness_interpretation: "Very smooth", "Moderately volatile", etc.
# $trend_share: Proportion of variance from trend

result$interpretation$variable_selection
# $sparsity_ratio: Proportion shrunk to zero
# $n_selected: Number of selected predictors
# $top_predictors: Data frame of top variables

result$interpretation$factor_structure
# $pc1_entropy: Shannon entropy of PC1 loadings
# $topology_interpretation: "Concentrated", "Diffuse", etc.

result$interpretation$persistence
# $conclusion: Stationarity type
# $interpretation: Plain-language description

result$interpretation$overall_summary
# Combined narrative synthesis

Best Practices

1. Start with Exploration

# Visualize the data first
plot(Y, type = "l")
acf(Y)
pacf(Y)

# Check for obvious non-stationarity
ur_quick <- test_unit_root(Y, tests = "adf")

2. Choose Methods Thoughtfully

Not every analysis needs every method:

# For trend extraction only
result <- signal_analysis(data, "Y", methods = c("wavelet", "hpgc"))

# For variable selection only
result <- signal_analysis(data, "Y", methods = "horseshoe")

# For factor analysis only
result <- signal_analysis(data, "Y", methods = c("pca", "dfm"))

3. Validate with Domain Knowledge

Statistical significance ≠ theoretical validity:

SignalY provides methodology, not theory
Variable selection criteria must be justified by domain knowledge
Significant loadings require theoretical interpretation

4. Handle Diagnostics

# Check Horseshoe convergence
result$horseshoe$diagnostics
# Look for: n_divergent = 0, max_rhat < 1.01, ESS > 400

# Check LOO-CV
result$horseshoe$loo
# Pareto k > 0.7 indicates problematic observations

Computational Considerations

Memory and Time

Method	Complexity	Memory	Time (n=1000, p=100)
Wavelets	O(n log n)	Low	< 1 sec
EMD	O(n × IMFs)	Low	1-5 sec
HP-GC	O(n × iter × chains)	Medium	30-60 sec
Horseshoe	O(n × p × iter × chains)	High	1-5 min
PCA	O(n × p²)	Medium	< 1 sec
DFM	O(n × p × factors)	Medium	1-10 sec

Parallelization

Stan-based methods (Horseshoe, HP-GC) automatically parallelize across chains.

Memory Management

For large datasets:

# Use fewer bootstrap replications
pca_config = list(n_boot = 500)

# Use fewer MCMC iterations
horseshoe_config = list(n_iter = 1000, n_warmup = 500)

# Analyze subsets
result <- signal_analysis(data[1:500, ], ...)

References

Piironen, J., & Vehtari, A. (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2), 5018-5051.
Bai, J., & Ng, S. (2002). Determining the Number of Factors in Approximate Factor Models. Econometrica, 70(1), 191-221.
Grant, A. L., & Chan, J. C. C. (2017). A Bayesian Model Comparison for Trend-Cycle Decompositions of Output. Journal of Money, Credit and Banking, 49(2-3), 525-552.
Percival, D. B., & Walden, A. T. (2000). Wavelet Methods for Time Series Analysis. Cambridge University Press.
Huang, N. E., et al. (1998). The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proceedings of the Royal Society A, 454, 903-995.