--- title: "Introduction to factorselect" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to factorselect} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(factorselect) ``` ## Introduction The `factorselect` package implements six estimators for determining the number of factors in large dimensional approximate factor models. The estimators differ in their theoretical assumptions, computational approach, and finite sample performance. The recommended estimator for most applications is the Ahn and Horenstein (2013) eigenvalue ratio estimator, which is robust to perturbations in the eigenvalue spectrum and performs well when only one of N or T is large. ## Simulating Factor Model Data The package includes a helper function for simulating data from a static approximate factor model: $$X = F \Lambda' + E$$ where $F$ is a $T \times k$ matrix of factors, $\Lambda$ is an $N \times k$ matrix of loadings, and $E$ is an $N \times T$ matrix of idiosyncratic errors. ```{r simulate} set.seed(42) X <- simulate_factor_model(N = 100, TT = 200, k = 3, sd = 0.5) dim(X) ``` ## The Recommended Estimator: Ahn & Horenstein (2013) The eigenvalue ratio (ER) and growth ratio (GR) estimators of Ahn and Horenstein (2013) are obtained by maximizing the ratio of adjacent eigenvalues of the sample covariance matrix. The ratio approach provides robustness to perturbations in the eigenvalue spectrum. A key advantage over Bai and Ng (2002) is that the Ahn-Horenstein estimator works well when only one of N or T is large, not requiring both dimensions to grow simultaneously. ```{r ahn_horenstein} result <- select_factors(X, method = "ahn_horenstein", kmax = 8) print(result) ``` ## Comparing All Estimators All six estimators can be run simultaneously by passing a vector of method names: ```{r all_methods} result_all <- select_factors( X, method = c("ahn_horenstein", "bai_ng", "abc", "lam_yao", "onatski_2009", "onatski_2010"), kmax = 8 ) print(result_all) ``` ## Scree Plot The `plot` method produces a scree plot of the leading eigenvalues with the selected number of factors marked for each estimator: ```{r scree, fig.width = 6, fig.height = 4} result_ah <- select_factors(X, method = "ahn_horenstein", kmax = 8) plot(result_ah, main = "Scree Plot — Ahn & Horenstein (2013)") ``` ## Finite Sample Performance To illustrate the finite sample performance of the estimators, we run a small simulation study with 100 replications across three sample size configurations. ```{r simulation, cache = TRUE} set.seed(123) n_reps <- 100 k_true <- 3 configs <- list( large_both = list(N = 100, TT = 200), small_N = list(N = 25, TT = 200), small_T = list(N = 200, TT = 25) ) results <- lapply(configs, function(cfg) { estimates <- replicate(n_reps, { X <- simulate_factor_model(N = cfg$N, TT = cfg$TT, k = k_true, sd = 0.5) res <- select_factors(X, method = c("ahn_horenstein", "bai_ng", "onatski_2010"), kmax = 8) res$k }) rowMeans(estimates == k_true) }) # Percentage correct for each configuration do.call(rbind, lapply(names(results), function(nm) { data.frame( config = nm, ahn_horenstein = round(results[[nm]]["ahn_horenstein"] * 100), bai_ng = round(results[[nm]]["bai_ng"] * 100), onatski_2010 = round(results[[nm]]["onatski_2010"] * 100) ) })) ``` The simulation confirms that Ahn and Horenstein (2013) performs well across all three configurations, including when only one dimension is large. Bai and Ng (2002) tends to be less reliable in the asymmetric sample size cases. ## Notes on Individual Estimators ### Bai & Ng (2002) and ABC (2010) These estimators use unstandardized data internally. The `select_factors` function handles this automatically — users do not need to preprocess data differently when requesting these methods. ### Lam & Yao (2012) This estimator uses lagged auto-covariance matrices rather than the contemporaneous covariance matrix. The number of lags `h` defaults to 1 but can be adjusted: ```{r lam_yao} result_ly <- select_factors(X, method = "lam_yao", kmax = 8, h = 1) print(result_ly) ``` ### Onatski (2009) This estimator performs a sequential hypothesis test. The significance level `alpha` defaults to 0.05 but can be adjusted: ```{r onatski_2009} result_o09 <- select_factors(X, method = "onatski_2009", kmax = 8, alpha = 0.05) print(result_o09) ``` ### Onatski (2010) The edge distribution estimator uses an iterative calibration procedure to estimate the threshold separating systematic from idiosyncratic eigenvalues: ```{r onatski_2010} result_o10 <- select_factors(X, method = "onatski_2010", kmax = 8) print(result_o10) ``` ## References Ahn, S.C. and Horenstein, A.R. (2013). Eigenvalue Ratio Test for the Number of Factors. *Econometrica*, 81(3), 1203-1227. Bai, J. and Ng, S. (2002). Determining the Number of Factors in Approximate Factor Models. *Econometrica*, 70(1), 191-221. Alessi, L., Barigozzi, M. and Capasso, M. (2010). Improved Penalization for Determining the Number of Factors in Approximate Factor Models. *Statistics and Probability Letters*, 80, 1806-1813. Lam, C. and Yao, Q. (2012). Factor Modelling for High-Dimensional Time Series: Inference for the Number of Factors. *The Annals of Statistics*, 40(2), 694-726. Onatski, A. (2009). Testing Hypotheses About the Number of Factors in Large Factor Models. *Econometrica*, 77(5), 1447-1479. Onatski, A. (2010). Determining the Number of Factors From Empirical Distribution of Eigenvalues. *The Review of Economics and Statistics*, 92(4), 1004-1016.