--- title: "Phenological Analysis" author: - Matthias Templ^[University of Applied Sciences Northwestern Switzerland (FHNW)] - Barbara Templ^[Swiss Federal Institute for Forest, Snow and Landscape Research (WSL)] date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: references.bib vignette: > %\VignetteIndexEntry{Phenological Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- *Vignette 3 of 4 for the **pep725** R package. The most current version is available on [GitHub](https://github.com/matthias-da/pep725) and [CRAN](https://CRAN.R-project.org/package=pep725).* ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Introduction This vignette teaches you how to perform core phenological analyses using the **pep725** package. By the end, you'll be able to: 1. Calculate **phenological normals** (long-term baselines) 2. Detect **anomalies** (years that deviate from normal) 3. Assess **data quality** before analysis 4. Analyze and visualize **trends** over time 5. Link phenology to **climate variables** These analyses form the foundation of phenological research, whether you're studying climate change impacts, agricultural timing, or ecosystem dynamics. ### What You'll Learn | Analysis | What It Tells You | Key Function | |----------|-------------------|--------------| | Normals | "What is typical?" | `pheno_normals()` | | Anomalies | "Which years are unusual?" | `pheno_anomaly()` | | Quality | "Can I trust this data?" | `pep_quality()` | | Trends | "Is timing changing?" | `pheno_trend_turning()` | | Climate links | "Why is it changing?" | `pheno_regional()` | ### Prerequisites This vignette assumes you've read the **Getting Started** vignette and understand: - How to load phenological data - The structure of the `pep` object - Basic BBCH codes and DOY (day of year) ## Setup Let's load the package and prepare our data: ```{r setup, message=FALSE, warning=FALSE} library(pep725) # Download the synthetic dataset pep <- pep_download() # For this vignette, we'll focus on two well-represented species: # 1. Apple (Malus domestica) - excellent coverage across Europe # 2. Grapevine (Vitis vinifera) - longest historical records apple <- pep[species == "Malus domestica"] vine <- pep[species == "Vitis vinifera"] # Verify subsets contain data stopifnot(nrow(apple) > 0, nrow(vine) > 0) # Quick overview of our data cat("=== Apple (Malus domestica) ===\n") cat("Observations:", nrow(apple), "\n") cat("Year range:", min(apple$year), "-", max(apple$year), "\n") cat("Stations:", length(unique(apple$s_id)), "\n\n") cat("=== Grapevine (Vitis vinifera) ===\n") cat("Observations:", nrow(vine), "\n") cat("Year range:", min(vine$year), "-", max(vine$year), "\n") cat("Stations:", length(unique(vine$s_id)), "\n") ``` **Why these species?** - *Malus domestica* (apple) has excellent coverage across Europe with clear phenological phases (bud burst, flowering, fruit development) - *Vitis vinifera* (grapevine) has the longest historical records (back to 1775) and is economically important for wine production ## Part 1: Phenological Normals ### What are Phenological Normals? In climatology, "climate normals" are defined as 30-year averages of temperature or precipitation used to characterise typical climatic conditions. By analogy, we propose the term **phenological normals** to describe long-term averages of the timing of recurring biological events. **Why are normals important?** 1. **Baseline for comparison**: Describing an event as “early” or “late” is only meaningful relative to when it typically occurs. 2. **Detecting change**: Comparing normals from different reference periods (e.g., 1961-1990 vs. 1991-2020) highlights long-term shifts in phenological timing 3. **Spatial patterns**: Normals vary across regions, reflecting differences in climate, latitude and elevation 4. **Model validation**: Phenological models predict normals; observations validate them ### Standard Reference Periods The World Meteorological Organization (WMO) defines standard 30-year periods for calculating normals: | Period | Use Case | |--------|----------| | **1961-1990** | Historical baseline, often used as "pre-warming" reference | | **1991-2020** | Current WMO standard normal period | | **Custom** | Any period relevant to your research question | **Important**: The period you choose affects your conclusions! If you compare 2023 to a 1961-1990 baseline, you might find larger anomalies than comparing to a 1991-2020 baseline (because the recent period already includes warming). ### Calculating Normals with pheno_normals() Let's calculate normals for apple: ```{r normals-basic} # Calculate normals for apple across countries and phases # Using 1990-2015 for illustration (ensures enough data in synthetic dataset) normals <- pheno_normals( pep = apple, period = 1990:2015, by = c("country", "phase_id"), min_years = 10 ) print(normals) ``` **Understanding the parameters:** | Parameter | What it does | Our choice | |-----------|--------------|------------| | `pep` | Input data | Apple observations | | `period` | Years to include in normal | 1990-2015 (26 years) | | `by` | Grouping variables | Country and phase (separate normals for each) | | `min_years` | Minimum years required | 10 (groups with fewer years get NA) | ### Understanding the Output Statistics The function returns several statistics for each group. Here's what these mean and when to use each: | Statistic | Description | When to Use | |-----------|-------------|-------------| | `n_years` | Years with data | Check data availability | | `mean_doy` | Arithmetic mean DOY | Standard measure, but sensitive to outliers | | `median_doy` | Middle value | **Recommended** - robust to outliers | | `sd_doy` | Standard deviation | Describes year-to-year variability | | `iqr_doy` | Interquartile range | Robust measure of spread | | `mad_doy` | Median absolute deviation | Most robust spread measure | | `q05`-`q95` | Percentiles | Define "early" (q05) and "late" (q95) thresholds | **Which measure should you use?** - For **central tendency**: Use `median_doy` (robust to outliers) - For **variability**: Use `mad_doy` or `iqr_doy` (robust) - For **defining extreme events**: Use percentiles (e.g., q05, q95) ```{r normals-summary} # Get a summary of the normals summary(normals) ``` ### Visualizing Normals The `plot()` method shows the median DOY with interquartile range (Q25-Q75) for each group: ```{r normals-plot, fig.width=7, fig.height=5} plot(normals) ``` A bar chart variant is also available with `plot(normals, which = "bar")`. ### Filtering to Specific Phases Often you want normals for a specific phenological phase: ```{r normals-filtered} # Normals for apple first flowering (BBCH 60) only flowering_normals <- pheno_normals( pep = pep, # Can use full dataset period = 1990:2015, species = "Malus domestica", # Filter by species phase_id = 60, # Filter by phase (first flowers) by = "country", # Group only by country min_years = 5 # Lower threshold for more countries ) print(flowering_normals) ``` **Note on NA values:** You'll notice some countries show `NA` for all statistics. This happens when a country has fewer than `min_years` years of data. This is intentional - calculating a "normal" from just 2-3 years would be misleading. The print output tells you "Groups with valid normals: X / Y" so you know how many groups had sufficient data. ### Comparing Two Time Periods A powerful application is comparing normals between periods to detect shifts: ```{r normals-compare, warning=FALSE, message=FALSE} # Calculate normals for two periods # Period 1: Historical (1961-1990) period1 <- pheno_normals( pep, period = 1961:1990, species = "Malus", phase_id = 60, by = "country", min_years = 10 ) # Period 2: Recent (1991-2020) period2 <- pheno_normals( pep, period = 1991:2020, species = "Malus", phase_id = 60, by = "country", min_years = 10 ) # Compare the two periods by merging on country # Negative shift = flowering got earlier # Positive shift = flowering got later if (nrow(period1) > 0 && nrow(period2) > 0) { comparison <- merge( period1[, .(country, mean_doy_early = mean_doy)], period2[, .(country, mean_doy_recent = mean_doy)], by = "country" ) comparison[, shift := mean_doy_recent - mean_doy_early] cat("Countries compared:", nrow(comparison), "\n") cat("Mean shift:", round(mean(comparison$shift, na.rm = TRUE), 1), "days\n") cat("Range of shifts:", round(min(comparison$shift, na.rm = TRUE), 1), "to", round(max(comparison$shift, na.rm = TRUE), 1), "days\n") if (mean(comparison$shift, na.rm = TRUE) < 0) { cat("\nInterpretation: On average, flowering has shifted EARLIER\n") } else { cat("\nInterpretation: On average, flowering has shifted LATER\n") } } ``` **Ecological interpretation**: A shift of -5 days (earlier) over 30 years corresponds to about 1.7 days per decade. ### Comparing Species: Grapevine with Longer History Grapevine records extend much further back, enabling comparison across longer time periods: ```{r normals-vine, warning=FALSE, message=FALSE} # Calculate normals for grapevine flowering vine_normals <- pheno_normals( pep = vine, period = 1990:2015, phase_id = 65, # Full flowering by = "country", min_years = 5 ) # Compare with apple for countries that have both if (nrow(vine_normals) > 0) { cat("Grapevine flowering normals (BBCH 65):\n") print(vine_normals[!is.na(mean_doy), .(country, n_years, mean_doy, sd_doy)]) } ``` ## Part 2: Phenological Anomalies ### What are Anomalies? An **anomaly** is the deviation of a single year from the long-term normal. Anomalies help you: 1. **Identify extreme years**: Which years had unusually early or late phenology? 2. **Link to climate**: Do years with early phenology correspond to warm springs? 3. **Assess trends**: Are anomalies becoming more frequent or extreme? ### How Anomalies are Calculated For each year, the anomaly is calculated as: ``` Anomaly = Observed DOY - Normal DOY ``` - **Negative anomaly** = Earlier than normal - **Positive anomaly** = Later than normal The anomaly can be expressed in: - **Days**: Direct difference (e.g., "10 days early") - **Z-score**: Standardized by variability (e.g., "2 standard deviations early") - **Percentile**: Position in historical distribution (e.g., "5th percentile") ### Calculating Anomalies with pheno_anomaly() ```{r anomalies-basic} # Calculate anomalies for apple # Using 1990-2010 as baseline period anomalies <- pheno_anomaly( pep = apple, baseline_period = 1990:2010, by = c("country", "phase_id"), min_years = 5 ) print(anomalies) ``` ### Understanding Anomaly Metrics The function returns several ways to express the anomaly: | Metric | Description | Interpretation | |--------|-------------|----------------| | `anomaly_days` | Difference in days | Negative = early, positive = late | | `z_score` | Standardized anomaly | \|z\| > 2 is unusual, > 3 is extreme | | `percentile` | Position in distribution | <10 very early, >90 very late | | `is_extreme` | Boolean flag | TRUE if outside normal range | | `direction` | Text label | "early", "late", or "normal" | **Which metric to use?** - **For communication**: Use `anomaly_days` ("flowering was 10 days early") - **For statistics**: Use `z_score` (allows comparison across regions/phases) - **For frequency analysis**: Use `percentile` ("a 1-in-20 year event") ### Identifying Extreme Years Let's find the most extreme years in our data: ```{r anomalies-extreme} # Find extreme years (is_extreme == TRUE) extreme <- anomalies[is_extreme == TRUE] if (nrow(extreme) > 0) { cat("Total extreme years found:", nrow(extreme), "\n\n") # Show the most extreme early years cat("Most extreme EARLY years:\n") early <- extreme[direction == "early"][order(anomaly_days)] if (nrow(early) > 0) { print(early[1:min(5, nrow(early)), .(country, phase_id, year, anomaly_days, z_score)]) } # Show the most extreme late years cat("\nMost extreme LATE years:\n") late <- extreme[direction == "late"][order(-anomaly_days)] if (nrow(late) > 0) { print(late[1:min(5, nrow(late)), .(country, phase_id, year, anomaly_days, z_score)]) } } else { cat("No extreme years detected in this dataset\n") } ``` **What makes a year "extreme"?** By default, years with |z-score| > 2 are flagged as extreme. This corresponds to roughly the most unusual 5% of years (2.5% early + 2.5% late). ### Summary of Anomalies ```{r anomalies-summary} summary(anomalies) ``` ### Visualizing Anomalies The `plot()` method shows anomalies over time, coloured by direction (early vs. late), with extreme events marked: ```{r anomalies-plot, fig.width=7, fig.height=5} plot(anomalies) ``` A histogram of anomaly magnitudes is available with `plot(anomalies, which = "histogram")`. ### Robust vs. Classical Methods Phenological data often contains outliers - observations that are clearly erroneous or affected by unusual local conditions. The anomaly function offers two approaches. The **robust method** (`robust = TRUE`, default and recommended) uses the median as the baseline and the MAD (median absolute deviation) for standardization. This approach is less affected by outliers and should be preferred in most situations. The **classical method** (`robust = FALSE`) uses the mean as the baseline and the standard deviation (SD) for standardization. This approach is more affected by outliers. **When to use each:** - **Robust (default)**: Almost always. Phenological data commonly has outliers. - **Classical**: Only if you've already cleaned outliers, or for comparison with older literature that used classical methods. ## Part 3: Data Quality Assessment ### Why Quality Matters Before running any analysis, you should understand your data's quality. Poor quality data can lead to: - **False trends**: A few erroneous observations can create apparent shifts - **Inflated variability**: Outliers increase standard deviations - **Spatial bias**: If quality varies by region, comparisons are compromised ### Quality Dimensions The `pep_quality()` function assesses three dimensions: | Dimension | Question | Metric | |-----------|----------|--------| | **Completeness** | How many years have data? | % of period covered | | **Continuity** | Are there gaps? | Maximum consecutive missing years | | **Consistency** | Are there outliers? | % of observations flagged | ### Running Quality Assessment ```{r quality-basic} # Assess quality at the station level quality <- pep_quality( pep = apple, by = c("s_id", "phase_id") ) print(quality) ``` ### Understanding Quality Grades The function assigns letter grades based on thresholds: | Grade | Completeness | Max Gap | Outliers | Interpretation | |-------|--------------|---------|----------|----------------| | **A** | ≥ 80% | ≤ 2 years | < 2% | Excellent - use confidently | | **B** | ≥ 60% | ≤ 5 years | < 5% | Good - suitable for most analyses | | **C** | ≥ 40% | ≤ 10 years | < 10% | Fair - use with caution | | **D** | Below C | - | - | Poor - consider excluding | **What grade do you need?** - **Trend analysis**: Grade A or B recommended (need continuous data) - **Normal calculation**: Grade B or C acceptable (gaps less critical) - **Spatial analysis**: Grade C may suffice (using many stations) ```{r quality-summary} # Summary of quality across all stations summary(quality) ``` ### Visualizing Quality Assessment With many stations to assess, visual inspection is essential. The `plot()` method provides an overview of quality metrics: ```{r quality-plot, fig.height=5, fig.width=11} # Overview plot: grade distribution + station map (requires pep for coordinates) plot(quality, pep = apple) ``` **Reading the plot:** - **Left panel**: Grade distribution shows how many stations fall into each category - **Right panel**: Map shows station locations colored by their quality grade, revealing spatial patterns in data quality You can also generate individual plots: ```{r quality-plot-grades, fig.height=4} # Just the grade distribution (no pep data needed) plot(quality, which = "grades") ``` ```{r quality-plot-map, fig.height=5} # Just the map of station quality plot(quality, which = "map", pep = apple) ``` ### Filtering Data by Quality A critical step before analysis is filtering to high-quality data: ```{r quality-filter} # Get high-quality stations high_quality <- quality[quality_grade %in% c("A", "B")] cat("Quality distribution:\n") print(table(quality$quality_grade)) cat("\nHigh quality (A or B):", nrow(high_quality), "of", nrow(quality), "(", round(100 * nrow(high_quality) / nrow(quality), 1), "%)\n") # Filter original data to these stations if (nrow(high_quality) > 0) { good_stations <- unique(high_quality$s_id) apple_hq <- apple[s_id %in% good_stations] cat("\nOriginal apple observations:", nrow(apple), "\n") cat("After quality filter:", nrow(apple_hq), "(", round(100 * nrow(apple_hq) / nrow(apple), 1), "%)\n") } ``` ### Assessing Quality at Different Levels You can assess quality at various aggregation levels: ```{r quality-country} # Country-level quality (coarser assessment) country_quality <- pep_quality( pep = pep, by = c("country", "species", "phase_id") ) print(country_quality) ``` ## Part 4: Complete Analytical Workflow Now let's put it all together in a realistic workflow: ```{r workflow} cat("=== STEP 1: Data Quality Assessment ===\n\n") # Assess quality quality <- pep_quality(apple, by = c("s_id", "phase_id")) cat("Quality grades:\n") print(table(quality$quality_grade)) cat("\n=== STEP 2: Filter to Good Quality Data ===\n\n") # Keep grades A, B, and C (exclude only grade D) good_stations <- quality[quality_grade %in% c("A", "B", "C"), s_id] apple_clean <- apple[s_id %in% good_stations] cat("Original observations:", nrow(apple), "\n") cat("After quality filter:", nrow(apple_clean), "\n") cat("Retained:", round(100 * nrow(apple_clean) / nrow(apple), 1), "%\n") stopifnot(nrow(apple_clean) > 0) cat("\n=== STEP 3: Calculate Baseline Normals ===\n\n") normals <- pheno_normals( apple_clean, period = 1990:2010, by = c("country", "phase_id"), min_years = 5 ) cat("Normals calculated for", sum(!is.na(normals$mean_doy)), "groups\n") cat("Mean DOY range:", round(min(normals$mean_doy, na.rm = TRUE), 0), "to", round(max(normals$mean_doy, na.rm = TRUE), 0), "\n") cat("\n=== STEP 4: Calculate Anomalies ===\n\n") anomalies <- pheno_anomaly( apple_clean, baseline_period = 1990:2010, by = c("country", "phase_id"), min_years = 5 ) # Convert to plain data.table so grouped aggregation works correctly # (the pheno_anomaly S3 class can interfere with [.data.table dispatch) anomalies_dt <- as.data.table(anomalies) cat("Anomalies calculated for", sum(!is.na(anomalies_dt$anomaly_days)), "year-groups\n") cat("Extreme years:", sum(anomalies_dt$is_extreme, na.rm = TRUE), "\n") cat("\n=== STEP 5: Analyze Trends by Decade ===\n\n") decade_summary <- anomalies_dt[ !is.na(anomaly_days), .( mean_anomaly = round(mean(anomaly_days, na.rm = TRUE), 1), sd_anomaly = round(sd(anomaly_days, na.rm = TRUE), 1), n_obs = .N, pct_extreme = round(100 * sum(is_extreme, na.rm = TRUE) / .N, 1) ), by = .(decade = floor(year / 10) * 10) ][order(decade)] print(decade_summary) cat("\nInterpretation:\n") first_decade <- decade_summary$mean_anomaly[1] last_decade <- decade_summary$mean_anomaly[nrow(decade_summary)] change <- last_decade - first_decade if (change < -3) { cat("- Strong shift toward EARLIER phenology\n") } else if (change < 0) { cat("- Moderate shift toward earlier phenology\n") } else if (change > 3) { cat("- Strong shift toward LATER phenology\n") } else { cat("- No clear directional change\n") } ``` ## Part 5: Visualizing Trends ### Time Series Plots The `pheno_plot_timeseries()` function creates publication-ready time series: ```{r timeseries-basic, fig.height=5} # Prepare aggregated data for visualization apple_annual <- apple[, .( day = mean(day, na.rm = TRUE), n = .N ), by = .(year, species, phase_id)] # Add human-readable phase labels apple_annual[, phase := factor(phase_id, levels = c(60, 65, 100), labels = c("First flowers", "Full flowering", "Harvest"))] # Remove rows where phase mapping failed apple_annual <- apple_annual[!is.na(phase)] # Plot time series with trend lines if (nrow(apple_annual) > 0) { pheno_plot_timeseries( apple_annual, color_by = "phase", smooth = TRUE, title = "Apple Phenology Trends" ) } ``` **Reading the plot:** - Each line represents a phenological phase - The smoothed line shows the underlying trend - Downward trends indicate earlier phenology - The spread around the line shows year-to-year variability ### Detecting Trend Turning Points Sometimes trends aren't linear - they may accelerate, slow, or even reverse. The `pheno_trend_turning()` function uses the sequential Mann-Kendall test to detect such changes: ```{r trends-turning, fig.height=5} # Get apple flowering data apple_flowering <- pep[species == "Malus domestica" & phase_id == 60] # Detect turning points turning <- pheno_trend_turning(apple_flowering, min_years = 10) print(turning) # Visualize the analysis plot(turning) ``` **Understanding the plot:** - **Blue line (Progressive tau)**: Trend calculated from the start forward - **Orange line (Retrograde tau)**: Trend calculated from the end backward - **Crossing points**: Where trends may have changed direction - **Dashed lines**: Statistical significance thresholds (±1.96 for 95%, ±2.58 for 99%) ### Simple Trend Statistics For a single trend value, use `kendall_tau()`: ```{r kendall-simple} # Aggregate to annual means annual_doy <- apple_flowering[, .( mean_doy = mean(day, na.rm = TRUE) ), by = year][order(year)] # Calculate Mann-Kendall test statistic # Note: kendall_tau() returns the standardized z-statistic, not Kendall's # tau correlation. Values beyond ±1.96 indicate significant trends. tau <- kendall_tau(annual_doy$mean_doy) cat("Mann-Kendall z-statistic:", round(tau, 3), "\n") cat("\nInterpretation:\n") if (tau < -2.58) { cat("- Highly significant earlier trend (p < 0.01)\n") } else if (tau < -1.96) { cat("- Significant earlier trend (p < 0.05)\n") } else if (tau > 2.58) { cat("- Highly significant later trend (p < 0.01)\n") } else if (tau > 1.96) { cat("- Significant later trend (p < 0.05)\n") } else { cat("- No statistically significant trend\n") } ``` **Why Kendall's tau?** - Non-parametric (doesn't assume normality) - Robust to outliers - Works well with ordinal data - Widely used in phenological research ### Linking Phenology to Climate A central goal of phenological research is to understand how climate drives the timing of biological events. The `pheno_regional()` function compiles multi-station phenological data into regional time series and links them to global temperature anomalies, enabling you to assess **climate sensitivity** - how many days earlier (or later) phenology occurs per degree of warming. ```{r regional-climate, eval=FALSE} # Compile regional data and link to temperature anomalies regional_data <- pheno_regional( pep = pep, species_name = "Malus domestica", year_min = 1961 ) # Visualize phenology alongside temperature anomalies pheno_plot(regional_data) ``` **What the output shows:** - **Regional median DOY** per year, aggregated from individual stations - **Global temperature anomaly** for context (GISS dataset) - **Correlation** between phenological timing and temperature **Typical climate sensitivities** for European spring phenology: | Phase | Sensitivity | Meaning | |-------|-------------|---------| | Leaf unfolding | -2 to -4 days/°C | Moderate response to spring warming | | Flowering | -3 to -5 days/°C | Strong response, especially for fruit trees | | Harvest | -1 to -3 days/°C | Weaker response, partly compensated by summer processes | These values mean that for every 1°C increase in spring temperature, flowering tends to occur 3-5 days earlier. Over the past 50 years, this has translated to an advance of roughly 2-5 days per decade across Central Europe. ## Best Practices Summary ### Before Analysis 1. **Check data quality** with `pep_quality()` - always! 2. **Filter to appropriate quality grades** - usually A, B, or C 3. **Verify sample sizes** - are there enough years/stations? ### Calculating Normals 4. **Choose reference period carefully** - it affects your conclusions 5. **Use median_doy** for robust central tendency 6. **Set appropriate min_years** - WMO recommends 30, but 10-20 often works ### Calculating Anomalies 7. **Use robust method** (default) - handles outliers automatically 8. **Consider z-scores** for cross-region/phase comparisons 9. **Check for extreme years** - investigate them! ### Analyzing Trends 10. **Use non-parametric methods** like Kendall's tau 11. **Look for trend changes** with `pheno_trend_turning()` 12. **Consider multiple phases** - they may show different patterns ### Reporting 13. **Document your methods** - period, thresholds, filters 14. **Report sample sizes** - readers need to assess reliability 15. **Include uncertainty** - confidence intervals or standard errors ## Next Steps You now understand the core analytical functions. Continue to: - **Spatial Phenological Patterns**: How does phenology vary with elevation and latitude? Use `pheno_gradient()` and `pheno_synchrony()`. ```{r eval = FALSE} vignette("spatial-patterns", package = "pep725") ``` - **Data Quality Assessment**: How to detect outliers and validate data completeness. ```{r eval = FALSE} vignette("data-quality", package = "pep725") ``` ## Session Info ```{r session} sessionInfo() ```