--- title: "Introduction to sparsecommunity" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to sparsecommunity} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4, fig.align = "center" ) set.seed(42) ``` ## Overview **sparsecommunity** implements spectral clustering for community detection in sparse networks. It provides: - `community_detect()`: main estimation function, supporting both the stochastic block model (`"sbm"`) and the degree-corrected stochastic block model (`"dcsbm"`). - `simulate_sbm()` / `simulate_dcsbm()`: simulation utilities for both models. - `misclustering_rate()`: best-permutation evaluation metric. The methods implement the regularized normalized Laplacian spectral clustering of Lei and Rinaldo (2015), which is consistent for community recovery when the maximum expected degree grows as slowly as $\log(n)$ — the sparse regime where many standard spectral methods break down. ```{r load-package} library(sparsecommunity) ``` --- ## 1. Stochastic block model (SBM) ### 1.1 Simulating SBM data A stochastic block model with $K$ communities is specified by a block probability matrix $B \in [0,1]^{K \times K}$. Nodes in community $k$ and community $\ell$ are connected independently with probability $B_{k\ell}$. ```{r sim-sbm} # Two balanced communities, n = 300 nodes # Within-community edge probability: 0.25; between: 0.04 B_sbm <- matrix(c(0.25, 0.04, 0.04, 0.25), nrow = 2) sim_sbm <- simulate_sbm(n = 300, K = 2, B = B_sbm, seed = 1) print(sim_sbm) ``` The returned object contains the sparse adjacency matrix `A` and the true community labels: ```{r sbm-structure} # Mean degree (note: sparse regime ~ log(n)/n * n = log(n)) mean(Matrix::rowSums(sim_sbm$A)) ``` ### 1.2 Fitting the SBM model For the standard SBM, `community_detect()` embeds the network via the regularized normalized Laplacian and applies $k$-means: ```{r fit-sbm} fit_sbm <- community_detect(sim_sbm$A, K = 2, model = "sbm", seed = 1) print(fit_sbm) ``` The fitted object contains the community labels, the spectral embedding, the top eigenvalues, and the clustering objective: ```{r sbm-components} # Top-K eigenvalues of the regularized Laplacian fit_sbm$eigenvalues # Community sizes table(fit_sbm$labels) ``` ### 1.3 Evaluating accuracy `misclustering_rate()` computes the fraction of misclassified nodes under the best alignment of estimated and true labels (Hungarian algorithm): ```{r sbm-accuracy} misclustering_rate(sim_sbm$labels, fit_sbm$labels) ``` ### 1.4 Examining the spectral embedding The two-dimensional embedding reveals the cluster structure directly: ```{r sbm-embedding, fig.cap="Spectral embedding for SBM. Points are colored by true community."} U <- fit_sbm$embedding plot(U[, 1], U[, 2], col = sim_sbm$labels + 1, pch = 19, cex = 0.6, xlab = "Eigenvector 1", ylab = "Eigenvector 2", main = "SBM: spectral embedding") legend("topright", legend = c("Community 1", "Community 2"), col = 2:3, pch = 19, bty = "n") ``` --- ## 2. Degree-corrected stochastic block model (DCSBM) In many real networks, nodes within the same community have very different degrees. The DCSBM captures this by introducing per-node degree parameters $\theta_i > 0$: the edge probability between nodes $i$ and $j$ is $\theta_i \theta_j B_{z_i z_j}$. Under degree heterogeneity, standard spectral methods fail because high-degree nodes concentrate together in the embedding regardless of their community. The spherical $k$-median algorithm (Lei and Rinaldo 2015, Theorem 4.2) corrects for this by row-normalizing the embedding before clustering. ### 2.1 Simulating DCSBM data ```{r sim-dcsbm} # Three communities with strong degree heterogeneity B_dcsbm <- matrix(c(0.5, 0.04, 0.04, 0.04, 0.5, 0.04, 0.04, 0.04, 0.5), nrow = 3) # Degree parameters: Uniform(0.3, 1.7), creating substantial heterogeneity set.seed(2) theta <- runif(400, min = 0.3, max = 1.7) sim_dcsbm <- simulate_dcsbm(n = 400, K = 3, B = B_dcsbm, theta = theta, seed = 2) print(sim_dcsbm) ``` ### 2.2 Why standard SBM clustering fails under degree heterogeneity Applying the SBM method to DCSBM data shows the problem: ```{r dcsbm-sbm-fail} fit_wrong <- community_detect(sim_dcsbm$A, K = 3, model = "sbm", seed = 2) cat("Misclustering rate (SBM method on DCSBM data):", misclustering_rate(sim_dcsbm$labels, fit_wrong$labels), "\n") ``` ### 2.3 Fitting the DCSBM model The `"dcsbm"` model row-normalizes the embedding and applies spherical $k$-median clustering: ```{r fit-dcsbm} fit_dcsbm <- community_detect(sim_dcsbm$A, K = 3, model = "dcsbm", seed = 2) print(fit_dcsbm) cat("Misclustering rate (DCSBM method):", misclustering_rate(sim_dcsbm$labels, fit_dcsbm$labels), "\n") ``` ### 2.4 Embedding comparison The row-normalized embedding maps all nodes to the unit sphere. Degree heterogeneity is absorbed by the normalization, and clusters separate by angle: ```{r dcsbm-embedding, fig.cap="Row-normalized spectral embedding for DCSBM. Colors indicate true communities."} U_dc <- fit_dcsbm$embedding plot(U_dc[, 1], U_dc[, 2], col = sim_dcsbm$labels + 1, pch = 19, cex = 0.5, xlab = "Eigenvector 1 (normalized)", ylab = "Eigenvector 2 (normalized)", main = "DCSBM: row-normalized spectral embedding") legend("topright", legend = paste("Community", 1:3), col = 2:4, pch = 19, bty = "n") ``` --- ## 3. Real-data example: Zachary's karate club Zachary's (1977) karate club network is a benchmark for community detection. The network has 34 members (nodes) and 78 friendships (edges). A conflict led to a split into two factions, providing known ground-truth community membership. ```{r karate-data, message=FALSE} if (!requireNamespace("igraphdata", quietly = TRUE)) { message("igraphdata not installed; skipping real-data example.") knitr::knit_exit() } library(igraph) data("karate", package = "igraphdata") # Extract adjacency matrix and true community labels A_karate <- igraph::as_adjacency_matrix(karate, sparse = TRUE) true_comm <- igraph::V(karate)$Faction cat("Nodes:", vcount(karate), "| Edges:", ecount(karate), "| Communities:", length(unique(true_comm)), "\n") cat("Community sizes:", table(true_comm), "\n") cat("Mean degree:", round(mean(degree(karate)), 2), "\n") ``` ```{r karate-fit} fit_karate <- community_detect(A_karate, K = 2, model = "sbm", n_init = 30, seed = 42) summary(fit_karate) cat("Misclustering rate:", misclustering_rate(true_comm, fit_karate$labels), "\n") ``` ```{r karate-plot, fig.cap="Karate club network. Node color = detected community; node shape = true faction."} # Plot network colored by detected community shape_map <- ifelse(true_comm == 1, "circle", "square") igraph::plot.igraph( karate, vertex.color = fit_karate$labels + 1, vertex.shape = shape_map, vertex.size = 8, vertex.label = NA, main = "Karate club: detected vs. true communities" ) legend("bottomleft", legend = c("Detected: 1", "Detected: 2"), fill = 2:3, bty = "n", cex = 0.9) legend("bottomright", legend = c("True: faction 1", "True: faction 2"), pch = c(19, 15), bty = "n", cex = 0.9) ``` --- ## 4. Real-data example: NCAA college football (Girvan & Newman 2002) The NCAA football network is a standard benchmark for community detection with $K > 2$ communities. The network has 115 Division I-A college football teams (nodes) connected by regular-season games during Fall 2000 (613 edges). Teams are divided into 12 athletic conferences, providing unambiguous ground-truth community labels (Girvan & Newman 2002). The network is bundled directly in **sparsecommunity**: ```{r football-data} data("football") cat("Nodes:", nrow(football_A), "| Edges:", sum(football_A) / 2, "\n") cat("Mean degree:", round(mean(Matrix::rowSums(football_A)), 2), " log(n):", round(log(nrow(football_A)), 2), "\n") table(football_labels) # 12 conferences ``` The mean degree (10.66) is well above $\log(115) = 4.74$, placing the network in the sparse-but-identifiable regime. We first use `estimate_K()` to check whether the number of conferences can be recovered from the data alone: ```{r football-estimate-K} estimate_K(football_A, K_max = 15) # true K = 12 ``` The estimator returns 10, slightly underestimating due to two very small conferences (sizes 5 and 7) that are hard to distinguish spectrally. We proceed with the known $K = 12$: ```{r football-fit} fit_football <- community_detect(football_A, K = 12, model = "sbm", n_init = 30, seed = 1) misclustering_rate(football_labels, fit_football$labels) ``` **sparsecommunity** misclassifies 10 of 115 teams (8.7%). These are primarily independent teams and teams from small conferences that schedule games across conference lines. The spectral embedding confirms the 12-cluster structure: ```{r football-plot, fig.cap="Spectral embedding of the football network. Colors indicate detected community; the 12 athletic conferences are largely separated."} plot(fit_football) ``` --- ## 5. Summary of the `community_detect()` interface | Argument | Default | Description | |------------|-----------|-------------| | `A` | — | Adjacency matrix (sparse or dense) | | `K` | — | Number of communities | | `model` | `"dcsbm"` | `"sbm"` for k-means; `"dcsbm"` for spherical k-median | | `n_init` | `20` | Random restarts for clustering | | `max_iter` | `100` | Max iterations per restart | | `reg` | `TRUE` | Degree regularization in Laplacian embedding | | `seed` | `NULL` | Random seed | The returned `"sparsecommunity"` object contains: | Component | Description | |---------------|-------------| | `labels` | Integer vector of community assignments (length $n$) | | `embedding` | $n \times K$ spectral embedding matrix | | `eigenvalues` | Top-$K$ eigenvalues of the regularized Laplacian | | `centers` | $K \times K$ cluster centers in embedding space | | `objective` | Within-cluster sum of distances at convergence | --- ## References Lei, J. and Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. *Annals of Statistics*, **43**(1), 215–237. https://doi.org/10.1214/14-AOS1274 Girvan, M. and Newman, M. E. J. (2002). Community structure in social and biological networks. *Proceedings of the National Academy of Sciences*, **99**(12), 7821–7826. https://doi.org/10.1073/pnas.122653799 Zachary, W. W. (1977). An information flow model for conflict and fission in small groups. *Journal of Anthropological Research*, **33**(4), 452–473.