--- title: "SignalY: Comprehensive Signal Extraction from Panel Data" author: "José Mauricio Gómez Julián" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{SignalY: Comprehensive Signal Extraction from Panel Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Introduction The **SignalY** package provides a comprehensive framework for signal extraction from panel data, integrating multiple complementary methodologies: 1. **Spectral Decomposition**: Wavelets, Empirical Mode Decomposition (EMD), and Bayesian HP filters 2. **Bayesian Variable Selection**: Regularized Horseshoe regression for sparse signal identification 3. **Dimensionality Reduction**: PCA with bootstrap significance testing and Dynamic Factor Models 4. **Stationarity Testing**: Integrated battery of unit root tests The package distinguishes between **latent structure** (the underlying data-generating process) and **phenomenological dynamics** (observed variability), providing automated technical interpretation of results. ## Installation ```{r install, eval=FALSE} # Install from CRAN (when available) install.packages("SignalY") # Install development version from GitHub # remotes::install_github("username/SignalY") # For Bayesian methods (Horseshoe, HP-GC), install cmdstanr: install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos"))) cmdstanr::install_cmdstan() ``` ## Quick Start ```{r quickstart, eval=FALSE} library(SignalY) # Generate example data set.seed(42) n <- 100 p <- 20 # Create correlated predictors with factor structure factors <- matrix(rnorm(n * 2), n, 2) loadings <- matrix(runif(p * 2, -1, 1), p, 2) X <- factors %*% t(loadings) + matrix(rnorm(n * p, 0, 0.5), n, p) colnames(X) <- paste0("X", 1:p) # True signal depends on 5 predictors true_beta <- c(rep(1, 5), rep(0, 15)) Y <- X %*% true_beta + rnorm(n, 0, 0.5) # Combine into data frame data <- data.frame(Y = Y, X) # Run comprehensive analysis result <- signal_analysis( data = data, y_formula = "Y", methods = "all", verbose = TRUE ) # View results print(result) summary(result) plot(result) ``` ## Methodological Framework ### Epistemological Foundation SignalY operationalizes the distinction between **latent structure** and **phenomenological dynamics**: - **Latent Structure**: The underlying generative processes that produce observable patterns - **Phenomenological Dynamics**: The surface-level variability we measure This distinction maps onto statistical concepts: | Concept | Statistical Implementation | |---------|---------------------------| | Signal | Trend component, common factors | | Noise | Idiosyncratic shocks, measurement error | | Sparsity | Few variables carry information | | Persistence | Unit roots, long memory | ### Method Selection Guidance | Question | Recommended Method | |----------|-------------------| | What are the dominant frequencies? | Wavelets, EMD | | Which predictors matter? | Horseshoe regression | | Is there common factor structure? | PCA, DFM | | Is the series stationary? | Unit root battery | | What's the trend vs cycle? | HP-GC filter | ## Spectral Decomposition ### Wavelet Multi-Resolution Analysis ```{r wavelet, eval=FALSE} # Apply wavelet decomposition Y <- sin(seq(0, 4*pi, length.out = 128)) + rnorm(128, 0, 0.3) wav_result <- filter_wavelet( x = Y, filter = "la8", # Daubechies least asymmetric, 8 vanishing moments levels = c(3, 4), # Combine D3 + D4 (business cycle frequencies) first_diff = FALSE ) # Examine results names(wav_result) # [1] "combined" "smooth" "details" "level_variance" "filter" "levels" # Combined signal captures 8-32 period cycles (for annual data) plot(wav_result$combined, type = "l", main = "Wavelet-Filtered Signal") ``` The wavelet filter choice matters: - `"la8"`: Least asymmetric, 8 vanishing moments (recommended for economic data) - `"d4"`: Daubechies 4, more compact support - `"haar"`: Simplest, for discontinuous signals Level selection corresponds to periodicities: | Level | Period Range (annual data) | |-------|---------------------------| | D1 | 2-4 years | | D2 | 4-8 years | | D3 | 8-16 years | | D4 | 16-32 years | | D5 | 32-64 years | ### Empirical Mode Decomposition EMD is **data-adaptive**, requiring no basis function specification: ```{r emd, eval=FALSE} emd_result <- filter_emd( x = Y, max_imf = 10, boundary = "wave" ) # IMFs are ordered from highest to lowest frequency matplot(emd_result$imf[, 1:3], type = "l", main = "First 3 Intrinsic Mode Functions") # Residue captures the trend plot(emd_result$residue, type = "l", main = "EMD Residue (Trend)") ``` ### Bayesian HP-GC Filter The Grant-Chan embedded HP filter provides: - Posterior distributions for trend and cycle - Principled uncertainty quantification - Model comparison via DIC ```{r hpgc, eval=FALSE} # Requires cmdstanr hpgc_result <- filter_hpgc( x = Y, lambda = NULL, # Auto-select via DIC n_chains = 4, n_iter = 2000 ) # Compare to standard HP filter plot(Y, type = "l", col = "gray") lines(hpgc_result$trend, col = "blue", lwd = 2) legend("topright", c("Original", "Bayesian Trend"), col = c("gray", "blue"), lwd = c(1, 2)) ``` ## Bayesian Variable Selection ### Regularized Horseshoe Prior The Horseshoe prior provides: - **Adaptive shrinkage**: Large signals preserved, small signals shrunk - **Sparsity**: Most coefficients pushed toward zero - **Regularization**: Slab component prevents unbounded estimates ```{r horseshoe, eval=FALSE} # Fit Horseshoe regression hs_fit <- fit_horseshoe( y = Y, X = X, p0 = 5, # Expected non-zeros (can be NULL for auto) n_chains = 4, n_iter = 2000, n_warmup = 1000, adapt_delta = 0.95, # Target acceptance (increase if divergences) use_qr = TRUE # QR decomposition for multicollinearity ) # Examine shrinkage print(hs_fit) # Key outputs: # - beta: Coefficient estimates # - kappa: Shrinkage factors (0 = no shrinkage, 1 = full shrinkage) # - m_eff: Effective number of non-zeros # Select variables by shrinkage threshold selection <- select_by_shrinkage(hs_fit, threshold = 0.5) which(selection$selected) # Indices of selected predictors ``` ### Interpreting Shrinkage Factors The shrinkage factor $\kappa_j$ quantifies how much coefficient $j$ is pulled toward zero: $$\kappa_j = \frac{1}{1 + \tau^2 \lambda_j^2 / c^2}$$ | $\kappa$ | Interpretation | |----------|---------------| | 0.0-0.1 | Strong signal, minimal shrinkage | | 0.1-0.3 | Moderate signal | | 0.3-0.5 | Weak signal | | 0.5-0.7 | Likely noise | | 0.7-1.0 | Strong shrinkage, essentially zero | ## Dimensionality Reduction ### PCA with Bootstrap ```{r pca, eval=FALSE} pca_result <- pca_bootstrap( X = X, n_components = NULL, # Auto-select rotation = "varimax", # Or "none", "oblimin" n_boot = 1000, block_length = NULL, # Auto: sqrt(T) significance_level = 0.05 ) # Key outputs pca_result$variance_explained # Proportion by component pca_result$loadings # Variable loadings pca_result$entropy # Entropy of loadings (concentration measure) pca_result$loadings_significant # Bootstrap significance # Interpretation # Low entropy = concentrated loadings (few variables dominate) # High entropy = diffuse loadings (many variables contribute) ``` ### Dynamic Factor Models DFMs capture dynamic common factor structure: ```{r dfm, eval=FALSE} dfm_result <- estimate_dfm( X = X, n_factors = NULL, # Auto-select via Bai-Ng IC max_factors = 10, var_lags = 1, # VAR lags for factor dynamics ic_criterion = "bai_ng_2" ) # Key outputs dfm_result$n_factors # Optimal number dfm_result$factors # Estimated factors (T x r) dfm_result$loadings # Factor loadings (p x r) dfm_result$variance_explained dfm_result$var_coefficients # VAR transition matrix ``` ## Unit Root Testing ### Test Battery ```{r unitroot, eval=FALSE} ur_result <- test_unit_root( x = Y, tests = c("adf", "ers", "kpss", "pp") ) # Synthesized conclusion ur_result$synthesis # $conclusion: "stationary", "trend_stationary", "difference_stationary", or "inconclusive" # $confidence: "high", "medium", "low" # $evidence: Detailed reasoning # Individual test results ur_result$tests$adf_none ur_result$tests$ers_dfgls ``` ### Interpretation Framework | Conclusion | Meaning | Implications | |------------|---------|-------------| | Stationary | Mean-reverting, transient shocks | Use levels in regression | | Trend-stationary | Deterministic trend + stationary | Detrend before analysis | | Difference-stationary | Stochastic trend (unit root) | First-difference or cointegration | | Inconclusive | Mixed evidence | Use theory, consider both | ## The Master Function `signal_analysis()` orchestrates all methods: ```{r master, eval=FALSE} result <- signal_analysis( data = data, y_formula = "Y", # Or formula: Y ~ X1 + X2 + X3 time_var = NULL, # Time variable name group_var = NULL, # Panel group variable methods = "all", # Or c("wavelet", "horseshoe", "pca") # Method-specific configuration filter_config = list( wavelet_filter = "la8", wavelet_levels = c(3, 4), hpgc_n_chains = 4 ), horseshoe_config = list( p0 = NULL, # Auto-calibrate n_chains = 4, adapt_delta = 0.95, kappa_threshold = 0.5 ), pca_config = list( rotation = "none", n_boot = 1000 ), dfm_config = list( ic_criterion = "bai_ng_2" ), # General options na_action = "interpolate", # Or "omit", "fail" standardize = TRUE, first_difference = FALSE, verbose = TRUE, seed = 42 ) # Access components result$filters$wavelet result$horseshoe result$pca result$dfm result$unitroot result$interpretation ``` ### Automated Interpretation The interpretation component synthesizes results: ```{r interp, eval=FALSE} result$interpretation$signal_characteristics # $smoothness: Variance of second differences # $smoothness_interpretation: "Very smooth", "Moderately volatile", etc. # $trend_share: Proportion of variance from trend result$interpretation$variable_selection # $sparsity_ratio: Proportion shrunk to zero # $n_selected: Number of selected predictors # $top_predictors: Data frame of top variables result$interpretation$factor_structure # $pc1_entropy: Shannon entropy of PC1 loadings # $topology_interpretation: "Concentrated", "Diffuse", etc. result$interpretation$persistence # $conclusion: Stationarity type # $interpretation: Plain-language description result$interpretation$overall_summary # Combined narrative synthesis ``` ## Best Practices ### 1. Start with Exploration ```{r explore, eval=FALSE} # Visualize the data first plot(Y, type = "l") acf(Y) pacf(Y) # Check for obvious non-stationarity ur_quick <- test_unit_root(Y, tests = "adf") ``` ### 2. Choose Methods Thoughtfully Not every analysis needs every method: ```{r selective, eval=FALSE} # For trend extraction only result <- signal_analysis(data, "Y", methods = c("wavelet", "hpgc")) # For variable selection only result <- signal_analysis(data, "Y", methods = "horseshoe") # For factor analysis only result <- signal_analysis(data, "Y", methods = c("pca", "dfm")) ``` ### 3. Validate with Domain Knowledge Statistical significance ≠ theoretical validity: - SignalY provides **methodology**, not **theory** - Variable selection criteria must be justified by domain knowledge - Significant loadings require theoretical interpretation ### 4. Handle Diagnostics ```{r diag, eval=FALSE} # Check Horseshoe convergence result$horseshoe$diagnostics # Look for: n_divergent = 0, max_rhat < 1.01, ESS > 400 # Check LOO-CV result$horseshoe$loo # Pareto k > 0.7 indicates problematic observations ``` ## Computational Considerations ### Memory and Time | Method | Complexity | Memory | Time (n=1000, p=100) | |--------|-----------|--------|---------------------| | Wavelets | O(n log n) | Low | < 1 sec | | EMD | O(n × IMFs) | Low | 1-5 sec | | HP-GC | O(n × iter × chains) | Medium | 30-60 sec | | Horseshoe | O(n × p × iter × chains) | High | 1-5 min | | PCA | O(n × p²) | Medium | < 1 sec | | DFM | O(n × p × factors) | Medium | 1-10 sec | ### Parallelization Stan-based methods (Horseshoe, HP-GC) automatically parallelize across chains. ### Memory Management For large datasets: ```{r memory, eval=FALSE} # Use fewer bootstrap replications pca_config = list(n_boot = 500) # Use fewer MCMC iterations horseshoe_config = list(n_iter = 1000, n_warmup = 500) # Analyze subsets result <- signal_analysis(data[1:500, ], ...) ``` ## References 1. Piironen, J., & Vehtari, A. (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. *Electronic Journal of Statistics*, 11(2), 5018-5051. 2. Bai, J., & Ng, S. (2002). Determining the Number of Factors in Approximate Factor Models. *Econometrica*, 70(1), 191-221. 3. Grant, A. L., & Chan, J. C. C. (2017). A Bayesian Model Comparison for Trend-Cycle Decompositions of Output. *Journal of Money, Credit and Banking*, 49(2-3), 525-552. 4. Percival, D. B., & Walden, A. T. (2000). *Wavelet Methods for Time Series Analysis*. Cambridge University Press. 5. Huang, N. E., et al. (1998). The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. *Proceedings of the Royal Society A*, 454, 903-995.