SignalY: Technical Documentation

Comprehensive Signal Extraction from Panel Data

José Mauricio Gómez Julián

24 enero 2026 - 13:18

Abstract

This document provides comprehensive technical documentation for the SignalY R package, a unified framework for signal extraction from panel data. The package integrates spectral decomposition methods (wavelets, EMD, Bayesian HP filters), Bayesian sparse regression (regularized Horseshoe), dimensionality reduction (PCA, Dynamic Factor Models), and stationarity testing into a coherent analytical pipeline. We detail the mathematical foundations, algorithmic implementations, and interpretation frameworks underlying each methodology.


Introduction and Epistemological Framework

Motivation

The analysis of multivariate time series data—particularly panel data with cross-sectional and temporal dimensions—presents fundamental challenges:

  1. Dimensionality: Many candidate variables may influence the signal of interest
  2. Frequency mixing: Observed dynamics conflate trends, cycles, and noise
  3. Sparsity: Few variables may carry genuine predictive information
  4. Persistence: Non-stationarity complicates inference

SignalY addresses these challenges through a unified pipeline distinguishing latent structure from phenomenological dynamics.

Conceptual Mapping

Definition 1.1 (Latent Structure). The underlying data-generating process \(\mathcal{M}\) that produces observable patterns, characterized by:

Definition 1.2 (Phenomenological Dynamics). Surface-level variability \(Y_{\text{obs}} = f(\mathcal{M}) + \varepsilon\), where \(\varepsilon\) captures idiosyncratic shocks and measurement error.

The methodological components of SignalY operationalize this distinction:

Method Extracts Distinguishes From
Wavelets/EMD Trend, cycles High-frequency noise
Horseshoe Relevant predictors Noise variables
PCA/DFM Common factors Idiosyncratic variation
Unit root tests Persistence type Stationarity properties

Spectral Decomposition Methods

Wavelet Multi-Resolution Analysis

Theoretical Foundation

The Discrete Wavelet Transform (DWT) decomposes a signal into frequency bands via dilations and translations of mother wavelet \(\psi(t)\) and scaling function \(\phi(t)\):

\[ \psi_{j,k}(t) = 2^{j/2} \psi(2^j t - k) \]

\[ \phi_{j,k}(t) = 2^{j/2} \phi(2^j t - k) \]

For a signal \(x(t)\) of length \(N = 2^J\):

\[ x(t) = \sum_k s_{J,k} \phi_{J,k}(t) + \sum_{j=1}^{J} \sum_k d_{j,k} \psi_{j,k}(t) \]

where \(s_{J,k}\) are smooth coefficients and \(d_{j,k}\) are detail coefficients at scale \(j\).

MODWT Implementation

SignalY uses the Maximal Overlap Discrete Wavelet Transform (MODWT), which:

  • Is shift-invariant (unlike DWT)
  • Works for arbitrary sample sizes (not just powers of 2)
  • Produces aligned multi-resolution analysis (MRA)

Algorithm 1: Wavelet Multi-Resolution Analysis

Require: Signal x, filter type, levels to combine L
1: Apply MODWT: (W₁, ..., W_J, V_J) ← MODWT(x)
2: Compute MRA details: D_j ← MRA(W_j) for j = 1, ..., J
3: Compute smooth: S_J ← MRA(V_J)
4: Combine selected levels: C ← Σ_{j∈L} D_j
5: return (C, S_J, {D_j})

Filter Selection

The package supports Daubechies least asymmetric filters (LA) with \(L\) vanishing moments:

Filter Length Recommended Use
la8 8 Economic/financial data
la16 16 Very smooth signals
d4 4 Compact support needed
haar 2 Discontinuous signals

Scale-Period Correspondence

For sampling period \(\Delta t\), level \(j\) captures periods:

\[ T_j \in [2^j \Delta t, 2^{j+1} \Delta t] \]

For annual data (\(\Delta t = 1\) year):

  • \(D_3 + D_4\): 8–32 year cycles (business cycle range)
  • \(D_5 + D_6\): 32–128 year cycles (long waves)

Empirical Mode Decomposition

Sifting Algorithm

EMD decomposes \(x(t)\) into Intrinsic Mode Functions (IMFs) adaptively:

Definition 2.1 (Intrinsic Mode Function). A function \(c(t)\) is an IMF if:

  1. The number of extrema and zero-crossings differ by at most one
  2. The mean of upper and lower envelopes is zero at all points

Reconstruction Property

\[ x(t) = \sum_{k=1}^{K} c_k(t) + r_K(t) \]

Algorithm 2: EMD Sifting Process

Require: Signal x, max IMFs K
1: r₀ ← x
2: for k = 1, ..., K do
3:     h ← r_{k-1}
4:     repeat
5:         Identify local maxima and minima of h
6:         Interpolate upper envelope e_max, lower envelope e_min
7:         Compute mean envelope: m ← (e_max + e_min)/2
8:         Update: h ← h - m
9:     until h satisfies IMF criteria
10:    c_k ← h
11:    r_k ← r_{k-1} - c_k
12: end for
13: return IMFs {c_k} and residue r_K

Bayesian HP-GC Filter

State Space Formulation

The Grant-Chan HP filter embeds trend-cycle decomposition in a state space model:

\[ y_t = \tau_t + \psi_t + \varepsilon_t, \quad \varepsilon_t \sim \mathcal{N}(0, \sigma_\varepsilon^2) \]

\[ \Delta^2 \tau_t = \eta_t, \quad \eta_t \sim \mathcal{N}(0, \sigma_\tau^2) \]

\[ \psi_t = \phi_1 \psi_{t-1} + \phi_2 \psi_{t-2} + \omega_t, \quad \omega_t \sim \mathcal{N}(0, \sigma_\psi^2) \]

where \(\tau_t\) is trend, \(\psi_t\) is cycle, and \(\varepsilon_t\) is irregular.

Prior Configurations

The package implements three prior configurations:

1. Weakly Informative:

\[ \sigma_\tau^2, \sigma_\psi^2, \sigma_\varepsilon^2 \sim \text{Inv-Gamma}(0.01, 0.01) \]

\[ \phi_1, \phi_2 \sim \mathcal{N}(0, 10^2) \mathbf{1}_{[\text{stationary}]} \]

2. Informative (calibrated to HP-1600 for annual data):

\[ \sigma_\tau^2 / \sigma_\psi^2 \sim \text{Log-Normal}(\log(1/100), 0.5) \]

3. Empirical Bayes: Hyperparameters estimated from data

Model selection via Deviance Information Criterion:

\[ \text{DIC} = \bar{D}(\theta) + p_D = -2\mathbb{E}[\log p(y|\theta)] + 2p_D \]


Bayesian Sparse Regression: Regularized Horseshoe

Model Specification

For response \(y \in \mathbb{R}^n\) and predictors \(X \in \mathbb{R}^{n \times p}\):

\[ y_i = \alpha + x_i^\top \beta + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma^2) \]

Horseshoe Prior

The standard Horseshoe places:

\[ \beta_j | \lambda_j, \tau \sim \mathcal{N}(0, \lambda_j^2 \tau^2) \]

\[ \lambda_j \sim \mathcal{C}^+(0, 1) \]

\[ \tau \sim \mathcal{C}^+(0, \tau_0) \]

where \(\mathcal{C}^+(0, s)\) denotes half-Cauchy with scale \(s\).

Regularized Horseshoe

Piironen & Vehtari introduce slab regularization via:

\[ \tilde{\lambda}_j^2 = \frac{c^2 \lambda_j^2}{c^2 + \tau^2 \lambda_j^2} \]

with \(c^2 \sim \text{Inv-Gamma}(\nu/2, \nu s^2/2)\) providing a soft ceiling on coefficient magnitude.

Shrinkage Factors

The shrinkage factor \(\kappa_j\) quantifies the degree of shrinkage:

\[ \kappa_j = \frac{1}{1 + \tau^2 \lambda_j^2 / c^2} \]

Proposition 3.1 (Shrinkage Interpretation). As \(\kappa_j \to 0\), \(\beta_j\) is essentially unshrunken (signal). As \(\kappa_j \to 1\), \(\beta_j\) is strongly shrunk toward zero (noise).

Effective Number of Non-Zeros

\[ m_e = \sum_{j=1}^{p} (1 - \kappa_j) \]

Global Shrinkage Calibration

Following Piironen & Vehtari, calibrate \(\tau_0\) based on expected sparsity:

\[ \tau_0 = \frac{p_0}{p - p_0} \cdot \frac{\sigma}{\sqrt{n}} \]

where \(p_0\) is the expected number of relevant predictors.

Stan Implementation

The package uses non-centered parameterization for efficient NUTS sampling:

\[ z_j \sim \mathcal{N}(0, 1) \]

\[ \beta_j = \tau \cdot \tilde{\lambda}_j \cdot z_j \]

Key diagnostics monitored:


Dimensionality Reduction

PCA with Bootstrap Significance

Standard PCA

For centered data matrix \(X \in \mathbb{R}^{n \times p}\), PCA solves:

\[ \max_w w^\top X^\top X w \quad \text{s.t.} \quad \|w\| = 1 \]

yielding loadings \(W = [w_1, \ldots, w_k]\) and scores \(Z = XW\).

Bootstrap Significance Testing

Algorithm 3: Block Bootstrap for PCA Loadings

Require: Data X, block length b, replications B
1: Compute original loadings W^(0)
2: for r = 1, ..., B do
3:     Generate bootstrap sample X^(r) via block resampling
4:     Compute loadings W^(r)
5:     Apply Procrustes rotation to align with W^(0)
6: end for
7: For each loading w_jk, compute p-value: (1/B) Σ_r 𝟙[|w^(r)_jk| > |w^(0)_jk|]

Entropy of Loadings

Shannon entropy quantifies loading concentration:

\[ H_k = -\sum_{j=1}^{p} w_{jk}^2 \log(w_{jk}^2) \]

where loadings are normalized so \(\sum_j w_{jk}^2 = 1\).

  • Low \(H_k\): Concentrated loadings (few variables dominate)
  • High \(H_k\): Diffuse loadings (many variables contribute)

Dynamic Factor Models

Model

For panel \(X_t \in \mathbb{R}^p\) at time \(t\):

\[ X_t = \Lambda F_t + e_t, \quad e_t \sim \mathcal{N}(0, \Sigma_e) \]

\[ F_t = A_1 F_{t-1} + \cdots + A_q F_{t-q} + \eta_t, \quad \eta_t \sim \mathcal{N}(0, I_r) \]

where \(F_t \in \mathbb{R}^r\) are latent factors, \(\Lambda \in \mathbb{R}^{p \times r}\) are loadings.

Factor Selection via Information Criteria

Bai & Ng propose criteria for selecting \(r\):

\[ IC_1(r) = \log(V(r)) + r \cdot \frac{n + p}{np} \log\left(\frac{np}{n + p}\right) \]

\[ IC_2(r) = \log(V(r)) + r \cdot \frac{n + p}{np} \log(\min(n, p)) \]

\[ IC_3(r) = \log(V(r)) + r \cdot \frac{\log(\min(n, p))}{\min(n, p)} \]

where \(V(r)\) is the sum of squared residuals with \(r\) factors.


Unit Root Testing

Test Battery

Test \(H_0\) \(H_1\) Specification
ADF (none) Unit root Stationary No constant, no trend
ADF (drift) Unit root Stationary Constant only
ADF (trend) Unit root Stationary Constant + trend
ERS (DF-GLS) Unit root Stationary GLS-detrended
ERS (P-test) Unit root Stationary Point optimal
KPSS (level) Stationary Unit root Level stationarity
KPSS (trend) Trend-stationary Unit root Trend stationarity
PP Unit root Stationary Non-parametric correction

ADF Test

\[ \Delta y_t = \alpha + \beta t + \gamma y_{t-1} + \sum_{j=1}^{k} \delta_j \Delta y_{t-j} + \varepsilon_t \]

Test \(H_0: \gamma = 0\) vs \(H_1: \gamma < 0\).

ERS Tests

DF-GLS

Apply GLS transformation with \(\bar{\alpha} = 1 - 7/T\) (demeaning) or \(\bar{\alpha} = 1 - 13.5/T\) (detrending), then run ADF on transformed series.

Point Optimal Test

\[ P_T = \frac{S(\bar{\alpha}) - \bar{\alpha} S(1)}{s_{AR}^2} \]

KPSS Test

\[ \eta = \frac{1}{T^2} \sum_{t=1}^{T} S_t^2 / \hat{\sigma}_\infty^2 \]

where \(S_t = \sum_{i=1}^{t} e_i\) is partial sum of residuals.

Synthesis Algorithm

Algorithm 4: Unit Root Test Synthesis

Require: Test results with p-values
1: Count ADF/ERS rejections at α = 0.05: n_reject
2: Count KPSS non-rejections at α = 0.05: n_accept
3: if n_reject ≥ 2 AND ADF-trend rejects AND ERS confirms then
4:     Conclude: trend_stationary
5: else if n_reject ≥ 2 AND KPSS non-rejects then
6:     Conclude: stationary
7: else if ADF/ERS fail to reject AND KPSS rejects then
8:     Conclude: difference_stationary
9: else
10:    Conclude: inconclusive
11: end if

Master Function: signal_analysis()

Pipeline Architecture

  1. Data Preparation
    • Validation and type checking
    • Missing value handling (interpolation/omission)
    • Standardization
    • Optional first-differencing
  2. Spectral Decomposition
    • Wavelet MRA
    • EMD
    • HP-GC filter
  3. Sparse Regression
    • Horseshoe prior estimation
    • Shrinkage-based variable selection
  4. Factor Analysis
    • PCA with bootstrap
    • DFM estimation
  5. Persistence Analysis
    • Unit root test battery
    • Synthesis
  6. Interpretation Generation

Output Structure

signal_analysis object:
├── call                    # Function call
├── data                    # Processed data
│   ├── Y, X               # Original and standardized
│   └── standardization    # Parameters
├── filters
│   ├── wavelet            # Wavelet decomposition
│   ├── emd                # EMD results
│   └── hpgc               # HP-GC trend/cycle
├── horseshoe
│   ├── summary            # Posterior summaries
│   ├── selection          # Variable selection
│   └── diagnostics        # MCMC diagnostics
├── pca
│   ├── loadings           # With bootstrap p-values
│   ├── scores
│   └── entropy
├── dfm
│   ├── factors
│   ├── loadings
│   └── ic_values
├── unitroot
│   ├── tests              # Individual results
│   └── synthesis          # Combined conclusion
├── interpretation
│   ├── signal_characteristics
│   ├── variable_selection
│   ├── factor_structure
│   ├── persistence
│   └── overall_summary
└── config                  # Configuration used

Interpretation Framework

Signal Characteristics

Definition 7.1 (Signal Smoothness). Measured by variance of second differences:

\[ \sigma_{\Delta^2 Y}^2 = \text{Var}(\Delta^2 Y_t) \]

\(\sigma_{\Delta^2 Y}^2\) Interpretation
< 0.01 Very smooth (strong trend dominance)
0.01 – 0.1 Moderately smooth
0.1 – 0.5 Moderately volatile
> 0.5 Highly volatile (noise-dominated)

Sparsity Assessment

Definition 7.2 (Sparsity Ratio).

\[ \rho = 1 - \frac{|\{j : \kappa_j < \kappa^*\}|}{p} \]

where \(\kappa^*\) is the selection threshold (default 0.5).

Information Topology

Definition 7.3 (Normalized Entropy).

\[ H^* = \frac{H}{\log p} \]

where \(H\) is Shannon entropy and \(\log p\) is maximum possible entropy.

\(H^*\) Topology
< 0.3 Concentrated (few variables dominate)
0.3 – 0.6 Moderately distributed
> 0.6 Diffuse (many variables contribute)

Computational Considerations

Complexity Analysis

Method Time Complexity Space Complexity
Wavelet (MODWT) \(O(n \log n)\) \(O(nJ)\)
EMD \(O(nKI)\) \(O(nK)\)
HP-GC (MCMC) \(O(nMC)\) \(O(nC)\)
Horseshoe (MCMC) \(O(npMC)\) \(O(np + pC)\)
PCA \(O(np^2)\) \(O(p^2)\)
DFM \(O(npr)\) \(O(nr + pr)\)

Where: \(J\) = wavelet levels, \(K\) = IMFs, \(I\) = sifting iterations, \(M\) = MCMC iterations, \(C\) = chains, \(r\) = factors.

Parallelization


CRAN Compliance Notes

Soft Dependency: cmdstanr

Since cmdstanr is not on CRAN:

  1. Listed in Suggests, not Imports
  2. Additional repository: https://mc-stan.org/r-packages/
  3. All Stan-dependent code wrapped in:
if (!requireNamespace("cmdstanr", quietly = TRUE)) {
    stop("Package 'cmdstanr' required for this function.")
}
  1. Examples use \dontrun{} for Stan functions

Namespace Hygiene


References