--- title: "Classical Diversity Indices: Shannon and Simpson" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Classical Diversity Indices: Shannon and Simpson} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 5 ) ``` ## What Are Classical Diversity Indices? Shannon and Simpson indices are the two most widely used diversity measures in ecology. They quantify how species abundances are distributed within a community --- essentially answering: **"How diverse is this community based on how individuals are distributed among species?"** Both indices consider only the abundance distribution. They do not account for taxonomic, phylogenetic, or functional relationships between species. A community of 10 species from the same genus receives the same score as a community of 10 species spanning 10 different orders. This is why taxdiv pairs them with taxonomic measures --- classical indices capture the **abundance structure**, while taxonomic indices capture the **hierarchical structure**. ```{r setup} library(taxdiv) # Example community community <- c( Quercus_coccifera = 25, Quercus_infectoria = 18, Pinus_brutia = 30, Pinus_nigra = 12, Juniperus_excelsa = 8, Juniperus_oxycedrus = 6, Arbutus_andrachne = 15, Styrax_officinalis = 4, Cercis_siliquastrum = 3, Olea_europaea = 10 ) ``` ## Shannon-Wiener Index (H') ### The idea Shannon entropy, borrowed from information theory (Shannon, 1948), measures the **uncertainty** in predicting the species identity of a randomly chosen individual. High uncertainty means high diversity --- if species are evenly distributed, it is hard to guess which species the next individual belongs to. ### The formula $$H' = -\sum_{i=1}^{S} p_i \ln(p_i)$$ where $p_i$ is the proportion of species $i$ and $S$ is the total number of species. ### Key properties - **Minimum**: $H' = 0$ when there is only one species (no uncertainty) - **Maximum**: $H' = \ln(S)$ when all species have equal abundance (maximum uncertainty) - **Units**: Measured in "nats" when using natural logarithm, "bits" when using log base 2 - **Sensitivity**: Moderately sensitive to both rare and abundant species ### Usage in taxdiv ```{r shannon} # Default: natural logarithm H <- shannon(community) cat("Shannon H':", round(H, 4), "\n") cat("Maximum possible H' for", length(community), "species:", round(log(length(community)), 4), "\n") cat("Evenness (H'/H'max):", round(H / log(length(community)), 4), "\n") ``` ### Bias Correction When sample sizes are small, the observed Shannon index underestimates the true value because rare species are likely missing from the sample. taxdiv provides three correction methods: ```{r bias} cat("Uncorrected: ", round(shannon(community), 4), "\n") cat("Miller-Madow: ", round(shannon(community, correction = "miller_madow"), 4), "\n") cat("Grassberger: ", round(shannon(community, correction = "grassberger"), 4), "\n") cat("Chao-Shen: ", round(shannon(community, correction = "chao_shen"), 4), "\n") ``` **Which correction to use?** - **No correction**: When sample size is large relative to species richness (N >> S). This is the standard approach used in most published studies. - **Miller-Madow**: Simple first-order correction. Adds $(S-1) / 2N$ to the estimate. Appropriate when you want a lightweight adjustment. - **Grassberger**: Uses the digamma function for a more accurate correction. Performs well across a range of sample sizes. - **Chao-Shen**: Uses Horvitz-Thompson estimation to account for unseen species. Best when you suspect many rare species are missing from the sample. ## Simpson Index ### The idea Simpson's index (Simpson, 1949) measures the **probability that two randomly chosen individuals belong to the same species**. A community dominated by one species has a high probability (low diversity); an even community has a low probability (high diversity). ### Three variants taxdiv provides all three common Simpson variants: ```{r simpson} # Dominance (D): probability of same-species pair D <- simpson(community, type = "dominance") cat("Simpson dominance (D): ", round(D, 4), "\n") # Gini-Simpson (1-D): probability of different-species pair GS <- simpson(community, type = "gini_simpson") cat("Gini-Simpson (1-D): ", round(GS, 4), "\n") # Inverse Simpson (1/D): effective number of species inv <- simpson(community, type = "inverse") cat("Inverse Simpson (1/D): ", round(inv, 4), "\n") ``` ### Understanding the variants | Variant | Formula | Range | Interpretation | |---------|---------|-------|----------------| | **Dominance (D)** | $\sum p_i^2$ | 0 to 1 | Higher = less diverse (one species dominates) | | **Gini-Simpson (1-D)** | $1 - \sum p_i^2$ | 0 to 1 | Higher = more diverse (common choice) | | **Inverse Simpson (1/D)** | $1 / \sum p_i^2$ | 1 to S | Effective number of equally abundant species | The **inverse Simpson** is often the most intuitive: a value of 6.5 means the community is as diverse as one with 6.5 perfectly even species. ## Shannon vs Simpson: When to Use Which? ```{r comparison} # Even community even <- c(sp1 = 20, sp2 = 20, sp3 = 20, sp4 = 20, sp5 = 20) # Uneven community (same species, different abundances) uneven <- c(sp1 = 90, sp2 = 4, sp3 = 3, sp4 = 2, sp5 = 1) cat("=== Even community ===\n") cat("Shannon:", round(shannon(even), 4), "\n") cat("Simpson (1-D):", round(simpson(even, type = "gini_simpson"), 4), "\n\n") cat("=== Uneven community ===\n") cat("Shannon:", round(shannon(uneven), 4), "\n") cat("Simpson (1-D):", round(simpson(uneven, type = "gini_simpson"), 4), "\n") ``` **Key difference**: Shannon is more sensitive to **rare species** (because of the logarithm), while Simpson is more sensitive to **dominant species** (because of the squaring). When a community has many rare species, Shannon will detect them; Simpson may not. | Scenario | Better index | |----------|-------------| | Comparing sites with different rare species | Shannon | | Detecting dominance shifts | Simpson | | Need sample-size independence | Neither (use AvTD) | | Need taxonomic information | Neither (use pTO or Delta) | ## The Limitation: Why You Need Taxonomic Indices Too Classical indices treat all species as interchangeable. Consider: ```{r limitation} # Community A: 5 species from 5 different orders comm_A <- c(sp1 = 20, sp2 = 20, sp3 = 20, sp4 = 20, sp5 = 20) # Community B: 5 species from the same genus comm_B <- c(sp6 = 20, sp7 = 20, sp8 = 20, sp9 = 20, sp10 = 20) cat("Community A (5 orders) - Shannon:", round(shannon(comm_A), 4), "\n") cat("Community B (1 genus) - Shannon:", round(shannon(comm_B), 4), "\n") cat("Identical scores, yet A is far more taxonomically diverse.\n") ``` This is exactly why taxdiv includes Clarke & Warwick and Ozkan pTO indices --- they incorporate the taxonomic hierarchy to distinguish between these communities. See the [Clarke & Warwick](clarke-warwick.html) and [Ozkan pTO](ozkan-pto.html) articles for details. ## References - Shannon, C.E. (1948). A mathematical theory of communication. *Bell System Technical Journal*, 27(3), 379-423. - Simpson, E.H. (1949). Measurement of diversity. *Nature*, 163, 688. - Chao, A. & Shen, T.-J. (2003). Nonparametric estimation of Shannon's index of diversity when there are unseen species in sample. *Environmental and Ecological Statistics*, 10, 429-443.