---
title: "Getting Started with pep725"
author:
  - Matthias Templ^[University of Applied Sciences Northwestern Switzerland (FHNW)]
  - Barbara Templ^[Swiss Federal Institute for Forest, Snow and Landscape Research (WSL)]
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{Getting Started with pep725}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

*Vignette 1 of 4 for the **pep725** R package.
The most current version is available on
[GitHub](https://github.com/matthias-da/pep725) and
[CRAN](https://CRAN.R-project.org/package=pep725).*

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

## What is Phenology?
Phenology is the study of cyclic and seasonal natural phenomena, especially in
relation to climate and plant life cycles. In plant science, phenology focuses
on the timing of recurring biological events such as:

- **Budburst**: When dormant buds begin to open in spring
- **Leaf unfolding**: The emergence of new leaves
- **Flowering**: When flowers open (anthesis)
- **Fruit ripening**: The maturation of fruits
- **Leaf senescence**: Autumn coloring and leaf fall

These phenological phases are strongly temperature-dependent and respond sensitively to interannual climate variability as well as long-term climate change. As a result, phenological records provide some of the most direct biological evidence of climate impacts on terrestrial ecosystems.

## The PEP725 Database

The [PEP725 Pan European Phenology Database](http://www.pep725.eu/) is one of
the world's most comprehensive collections of plant phenological observations. It brings together long-term records from national phenological monitoring networks across Europe into a harmonized and openly accessible research infrastructure.
Key characteristics of PEP725 include [@templ2018; @templ2026]:

- **Over 12 million observations** from across over 30 European countries
- **Records spanning 250+ years** (some stations have data back to the 1750s)
- **Thousands of monitoring stations** spanning a wide range of elevations
- **Broad taxonomic coverage** including agricultural crops, fruit trees, wild plant and forest tree species

Observations in PEP725 are based on standardized phase definitions (largely aligned with BBCH-type schemes) and consistent observation protocols within national networks. This standardization enables comparative analyses across regions, species, and time periods, thus addresses key data integration and reusability challenges commonly encountered in large, heterogeneous phenological datasets [@templ2025].

## Why Use the pep725 R Package?

Working with phenological data presents several challenges:

1. **Data volume**: Millions of observations require efficient data handling
2. **Data quality**: Observations may contain errors or outliers
3. **Standardization**: Different countries use different formats
4. **Analysis complexity**: Calculating trends, normals, and anomalies requires
   specialized methods

The **pep725** package addresses these challenges by providing:

- Efficient data import and handling using `data.table`
- Quality assessment and outlier detection tools
- Functions for calculating phenological normals and anomalies
- Spatial analysis (gradients, synchrony)
- Publication-ready visualizations


## Installation

Before you begin, install the package:

```{r install, eval = FALSE}
# Install from CRAN (when available)
install.packages("pep725")

# Or install development version from GitHub
# First install devtools if you don't have it:
# install.packages("devtools")
devtools::install_github("matthias-da/pep725")
```

## Loading the Package

Once installed, load the package at the start of your R session:

```{r setup, message=FALSE, warning=FALSE}
library(pep725)
```

When the package loads, you'll see a startup message with tips on how to get
data.

## Getting Phenological Data

The **pep725** package offers five ways to obtain data. Understanding which
option to use is important for your workflow:

| Option | Best For | Data Size | Internet Required? |
|--------|----------|-----------|-------------------|
| **1. Download synthetic data** | Learning, tutorials, reproducible examples | ~7.24M rows | First download only |
| **2. Seed dataset** | Quick offline tests, minimal examples | ~1,300 rows | No |
| **3. Generate synthetic data** | Creating shareable versions of your data | Variable | No |
| **4. Import real PEP725 data** | Actual research and publications | ~7.24M rows | For initial download |
| **5. Use your own phenological data** | Regional networks, field trials, citizen science datasets | Variable | No |

### Why Synthetic Data?

You might wonder why we use synthetic data instead of real observations. The
original PEP725 database has usage restrictions that prevent redistribution -
you must register on the pep725.eu website and accept the data use policy. This creates a problem for
tutorials and examples that need to be reproducible.

**Synthetic data** solves this problem. It is generated to preserve the
statistical properties of real data:

- Same distributions of phenological events across seasons
- Similar correlations between variables (e.g. latitude and flowering date)
- Realistic spatial patterns across Europe
- Comparable year-to-year variability

This means you can learn all the analysis techniques using synthetic data, then
apply them confidently to real data when you're ready for actual research.

### Option 1: Download Synthetic Data (Recommended for Learning)

To learn the package, download the pre-generated synthetic dataset:

```{r download, message=FALSE}
# Download synthetic data (cached locally after first download)
pep <- pep_download()
```

**What happens here?**

1. The first time you run this, the package downloads a ~64 MB file from GitHub
2. The file is cached in a local directory (typically in your user folder)
3. Subsequent calls load from the cache instantly - no re-download needed

You can check your cache status or clear it if needed:

```{r cachestatus, message=FALSE}
# Check cache status - shows file location and size
pep_cache_info()

# If you need to clear the cache (e.g., to get an updated version):
# pep_cache_clear()
```

### Option 2: Use the Small Seed Dataset

For quick tests when you don't need the full dataset, use the included seed
data:

```{r load-seed, eval = FALSE}
# Load the included seed dataset (small, for quick tests only)
data(pep_seed)

# This is a small subset (~1,300 rows) useful for:
# - Testing code quickly
# - Working offline
# - Running examples in documentation
```

**When to use this**: If you're developing code and want fast iteration without
waiting for the full dataset to load, the seed dataset is perfect. However,
it's too small for meaningful statistical analyses.

### Option 3: Generate Your Own Synthetic Data

If you have real PEP725 data and want to create a shareable synthetic version
(e.g., for teaching or supplementary materials), use `pep_simulate()`:

```{r simulate, eval = FALSE}
# Generate synthetic data based on your real data
# This preserves statistical properties while anonymizing observations
pep_synthetic <- pep_simulate(your_real_pep_data)

# You can then share pep_synthetic freely
```

**Use case**: You've done an analysis with real data for a paper, and you want
to provide reproducible examples in supplementary materials without violating
data sharing restrictions.

### Option 4: Import Real PEP725 Data (For Research)

For actual research and publications, you'll want real data. Here's the process:

1. **Register** at [pep725.eu](http://www.pep725.eu/) (free for research)
2. **Download** data for your species/regions of interest
3. **Import** using `pep_import()`

```{r import, eval = FALSE}
# Import from downloaded PEP725 files
# Point to the folder containing your downloaded CSV files
pep <- pep_import("path/to/pep725_data/")
```

The `pep_import()` function automatically cleans and formats the data:

- Converts columns to appropriate types (factors, dates, etc.)
- Handles missing values (e.g., -9999 altitude codes become NA)
- Removes problematic columns
- Optionally adds country names based on coordinates


### Option 5: Use Your Own Plant Phenological Data

Although **pep725** is designed around the structure and conventions of the PEP725 database, it is not limited to PEP725 data.

If your dataset consists of station-based phenological observations with:

- spatial coordinates (longitude and latitude),
- observation year,
- species identifiers,
- phenophase identifiers,
- and day-of-year (DOY) information,

you can use most of the functionality in **pep725** with minimal adaptation.

The key requirement is a *PEP725-like tabular structure* — one row per observation. This structure can usually be achieved with straightforward preprocessing (e.g., renaming columns and ensuring consistent DOY formatting).

Once your data has the required columns, convert it to a `pep` object:

```{r own-data, eval = FALSE}
# Prepare your data with the required columns
my_data <- data.frame(
  s_id = c("MY001", "MY001", "MY002"),
  lon = c(10.1, 10.1, 11.3),
  lat = c(47.5, 47.5, 48.1),
  genus = c("Malus", "Malus", "Malus"),
  species = c("Malus domestica", "Malus domestica", "Malus domestica"),
  phase_id = c(60, 65, 60),
  year = c(2020, 2020, 2020),
  day = c(110, 125, 108)
)

# Convert to a pep object — validates required columns automatically
my_pep <- as.pep(my_data)
```


> **Important Note for Research**
>
> All examples in these vignettes use **synthetic data**. The patterns you see
> are realistic but not actual scientific observations.
>
> **For publishable research, always use real PEP725 data!** After registering
> at [pep725.eu](http://www.pep725.eu/), simply replace `pep_download()` with
> `pep_import("your/data/path/")`. All the analysis techniques work identically.


## Understanding the Data Structure

Now that we have data loaded, let's understand what we're working with. The
`pep` object is a special type of data frame optimized for large datasets.

```{r structure}
# View the data - the print method shows a summary
print(pep)
```

Let's get a quick visual overview. The `plot()` method for `pep` objects
provides three built-in views — a station map, a time series, and a DOY
histogram:

```{r first-look, fig.height = 4}
# Where are the stations?
plot(pep, type = "map")

# How are phenological events distributed across the year?
plot(pep, type = "histogram")
```

### What You See in the Output

The output shows you:

- **Dimensions**: Number of rows (observations) and columns
- **Column names and types**: What variables are available
- **First few rows**: A preview of actual values

### Key Columns Explained

Each row in the dataset represents **one phenological observation** - a record
of when a specific plant at a specific location reached a particular
developmental stage.

| Column | Description | Example |
|--------|-------------|---------|
| `s_id` | **Station identifier** - A unique code for each monitoring location | "AT0001" |
| `lon`, `lat` | **Geographic coordinates** - Longitude and latitude in decimal degrees | 15.5, 48.2 |
| `alt` | **Altitude** - Elevation above sea level in meters | 350 |
| `genus` | **Plant genus** - The genus name (first part of scientific name) | "Malus" |
| `species` | **Full species name** - Genus + specific epithet | "Malus domestica" |
| `phase_id` | **BBCH code** - Standardized phenological stage (explained below) | 60 |
| `year` | **Observation year** | 2015 |
| `day` | **Day of year (DOY)** - The day when the phase was observed (1-365) | 145 |
| `country` | **Country name** | "Austria" |

### Understanding Day of Year (DOY)

Phenological data uses **Day of Year (DOY)** rather than calendar dates. This
simplifies calculations and comparisons:

- DOY 1 = January 1
- DOY 91 = April 1 (approximately)
- DOY 182 = July 1 (approximately)
- DOY 274 = October 1 (approximately)
- DOY 365/366 = December 31

**Why DOY?** It makes it easy to calculate things like "flowering advanced by
10 days" without dealing with month boundaries.

## The `pep` Class

The package provides a custom class system that extends `data.table` with
phenology-specific methods. Let's verify what class our data has:

```{r class}
# Check the class hierarchy
class(pep)
```

You'll see three classes:

1. `pep` - Our custom class with phenology-specific methods
2. `data.table` - Fast data manipulation (from the data.table package)
3. `data.frame` - Base R compatibility

This means you can use:

- **pep725 functions** designed for phenological analysis
- **data.table syntax** for fast filtering and aggregation
- **Base R functions** that work with data frames

### The Summary Method

The `summary()` function is your first tool for understanding a phenological
dataset. It provides different views depending on the `by` parameter:

```{r summary}
# Default summary - shows top species by observation count
summary(pep)
```

**What this tells you**: Which species have the most data, how many stations
observe each species, and the year range of observations.

```{r summary-phase}
# Summary by phenological phase
summary(pep, by = "phase")
```

**What this tells you**: Which developmental stages are recorded, how many
observations of each, and the typical DOY (useful for understanding the
seasonal distribution of your data).

```{r summary-country}
# Summary by country (showing top 5)
summary(pep, by = "country", top = 5)
```

**What this tells you**: Geographic coverage - which countries contribute most
data, how many stations they have, and species diversity.

```{r summary-year}
# Temporal coverage summary
summary(pep, by = "year")
```

**What this tells you**: How observations are distributed across years - are
there gaps? Is coverage increasing or decreasing over time?

### Subsetting the Data

You'll often want to focus on specific subsets of the data. The package
preserves the `pep` class when you subset, so all methods continue to work:

```{r subset}
# Filter to apple data using data.table syntax
# The syntax is: dataset[row_filter, column_selection]
apple <- pep[species == "Malus domestica"]
print(apple)

# Filter by year range
recent <- pep[year >= 2000 & year <= 2015]
print(recent)

# The summary method works on subsets too
summary(recent)
```

**Tip for beginners**: The `data.table` syntax `pep[condition]` is similar to
base R's `subset()` function but much faster for large datasets. You can
combine multiple conditions with `&` (and) or `|` (or).

## Understanding BBCH Codes

The **BBCH scale** [from Biologische Bundesanstalt, Bundessortenamt und
CHemische Industrie; @meier2018] is a standardized system for describing plant
developmental stages. It's used worldwide, ensuring that "flowering" means the
same thing whether observed in Germany or Spain.

### How BBCH Codes Work

BBCH codes are two-digit numbers where:

- The **first digit** (0-9) indicates the principal growth stage
- The **second digit** indicates the secondary stage within that principal stage

| First Digit | Principal Growth Stage |
|-------------|------------------------|
| 0 | Germination / sprouting |
| 1 | Leaf development |
| 2 | Formation of side shoots / tillering |
| 3 | Stem elongation |
| 4 | Development of harvestable vegetative parts |
| 5 | Inflorescence emergence / heading |
| 6 | Flowering |
| 7 | Development of fruit |
| 8 | Ripening of fruit and seed |
| 9 | Senescence / dormancy |

### Looking Up BBCH Codes

The `bbch_description()` function translates codes to human-readable
descriptions:

```{r bbch}
# Get descriptions for BBCH codes present in your data
codes <- unique(pep$phase_id)
bbch_description(codes)
```

### Commonly Used BBCH Codes in Phenological Research

| Code | Stage | What to Look For |
|------|-------|------------------|
| 10 | First leaves | Cotyledons or first true leaves visible |
| 11 | First leaf unfolded | First true leaf fully unfolded |
| 60 | First flowers open | Beginning of flowering (anthesis) |
| 65 | Full flowering | 50% of flowers open |
| 69 | End of flowering | All petals fallen |
| 85 | Fruit coloring | Softening begins |
| 87 | Fruit ripe | Ready for harvest |
| 89 | Full ripeness | Fruit fully mature |
| 91 | Shoot growth complete | Leaves fully colored |
| 95 | Leaf fall | 50% of leaves fallen |

### Filtering by Phenological Phase

When analyzing phenology, you typically focus on specific phenophases. Here's how:

```{r filter-phase}
# Get all flowering observations (BBCH 60-69)
flowering <- pep[phase_id >= 60 & phase_id <= 69]
cat("Flowering observations:", nrow(flowering), "\n")

# Get only full flowering (BBCH 65)
full_flowering <- pep[phase_id == 65]
cat("Full flowering observations:", nrow(full_flowering), "\n")

# Get harvest-related observations
harvest <- pep[phase_id %in% c(87, 89)]
cat("Harvest observations:", nrow(harvest), "\n")
```

## Exploring Data Coverage

Before diving into analysis, it's crucial to understand what your dataset
contains. The `pep_coverage()` function provides a comprehensive overview.

### Full Coverage Report

```{r coverage-all}
# Get a complete coverage report
pep_coverage(pep)
```

This report shows you three dimensions of coverage:

1. **Temporal**: What years are covered? Any gaps?
2. **Geographical**: Which countries/regions? How many stations?
3. **Species**: Which plants are observed? How many observations each?

### Focused Coverage Analysis

You can focus on specific aspects and optionally generate plots:

```{r coverage-temporal}
# Temporal coverage with a visualization
pep_coverage(pep, kind = "temporal", plot = TRUE)
```

The temporal plot shows observation density over time. Look for:

- **Gaps**: Years with few or no observations
- **Trends**: Is data coverage increasing or decreasing over time?
- **Breakpoints**: Sudden changes (e.g. due to methodology changes)

```{r coverage-geo}
# Geographical coverage - which countries have most data
pep_coverage(pep, kind = "geographical", top = 5)
```

Geographic coverage tells you where your data comes from. This is important
because:

- Different regions have different climates
- Some countries have longer observation histories
- Network density affects spatial analyses

```{r coverage-species}
# Species coverage - which species are best represented
pep_coverage(pep, kind = "species", top = 5)
```

Species coverage shows which plants have enough data for robust analyses. As a
rule of thumb:

- **>10,000 observations**: Excellent for most analyses
- **1,000-10,000 observations**: Good for national/regional analyses
- **<1,000 observations**: Limited; may need to combine with related species

### Coverage by Groups

You can break down coverage by grouping variables:

```{r coverage-by-country}
# Temporal coverage broken down by country
cov_by_country <- pep_coverage(pep, kind = "temporal", by = "country")
# Show observations by group
cov_by_country$temporal$obs_by_group
```

This helps identify which countries have continuous records versus sporadic
observations.

## Your First Analysis: Tracking Flowering Trends

Let's put everything together with a simple but meaningful analysis: tracking
how apple flowering dates have changed over time.

```{r quick-analysis}
# Step 1: Filter to apple flowering (BBCH 60 = first flowers)
apple_flowering <- pep[species == "Malus domestica" & phase_id == 60]

cat("Apple flowering observations:", nrow(apple_flowering), "\n")
cat("Year range:", min(apple_flowering$year), "-", max(apple_flowering$year), "\n")
cat("Countries:", length(unique(apple_flowering$country)), "\n")

# Step 2: Calculate annual mean DOY
# The data.table syntax here means:
#   - Group by year
#   - Calculate mean DOY and count observations
#   - Order by year
annual_mean <- apple_flowering[, .(
  mean_doy = mean(day, na.rm = TRUE),
  n_obs = .N
), by = year][order(year)]

# Step 3: Look at the results
print(annual_mean)
```

Now let's visualize the trend:

```{r quick-analysis-plot}
# Step 4: Plot the trend with ggplot2
library(ggplot2)

p <- ggplot(annual_mean, aes(x = year, y = mean_doy)) +
  geom_point(aes(size = n_obs), color = "steelblue", alpha = 0.6) +
  geom_line(color = "steelblue", alpha = 0.4) +
  geom_smooth(method = "lm", color = "red", linetype = 2, se = FALSE) +
  labs(
    title = "Apple Flowering Date Over Time",
    x = "Year", y = "Mean Day of Year",
    size = "Observations"
  ) +
  theme_minimal()
print(p)

# Step 5: Quantify the trend
trend <- lm(mean_doy ~ year, data = annual_mean)
slope <- coef(trend)[2]
cat("\nTrend:", round(slope * 10, 2), "days per decade\n")
if (slope < 0) {
  cat("Interpretation: Flowering is getting EARLIER\n")
} else {
  cat("Interpretation: Flowering is getting LATER\n")
}
```

**Interpreting the results**: A negative slope means flowering is occurring
earlier in the year - a common finding for spring phenology in a warming
climate. A shift of -2 days per decade means that over 50 years, flowering
has advanced by about 10 days.

### Comparing with Another Species: Grapevine

Let's compare apple with grapevine (*Vitis vinifera*), which has the longest
time series in the database (records back to 1775 in some regions):

```{r quick-analysis-vine}
# Step 1: Filter to grapevine flowering (BBCH 65 = full flowering)
vine_flowering <- pep[species == "Vitis vinifera" & phase_id == 65]

cat("Grapevine flowering observations:", nrow(vine_flowering), "\n")
cat("Year range:", min(vine_flowering$year), "-", max(vine_flowering$year), "\n")
cat("Countries:", length(unique(vine_flowering$country)), "\n")

# Step 2: Calculate annual mean DOY
vine_annual <- vine_flowering[, .(
  mean_doy = mean(day, na.rm = TRUE),
  n_obs = .N
), by = year][order(year)]

# Step 3: Plot the trend
p <- ggplot(vine_annual, aes(x = year, y = mean_doy)) +
  geom_point(aes(size = n_obs), color = "purple", alpha = 0.6) +
  geom_line(color = "purple", alpha = 0.4) +
  geom_smooth(method = "lm", color = "red", linetype = 2, se = FALSE) +
  labs(
    title = "Grapevine Flowering Date Over Time",
    x = "Year", y = "Mean Day of Year",
    size = "Observations"
  ) +
  theme_minimal()
print(p)

trend <- lm(mean_doy ~ year, data = vine_annual)
slope <- coef(trend)[2]
cat("\nTrend:", round(slope * 10, 2), "days per decade\n")
```

**Why grapevine?** *Vitis vinifera* is economically important (wine production),
and its phenology is closely monitored. The long historical records make it
ideal for studying long-term climate trends.

## Common Pitfalls and Tips

As you start working with phenological data, keep these points in mind:

### Data Quality Matters

- **Always check for outliers**: A flowering date of DOY 300 (late October) for
  apple is almost certainly an error
- **Consider sample sizes**: Trends from 5 observations are not reliable
- **Check spatial coverage**: Trends might differ between regions


### Species Considerations

- **Check taxonomy**: "Malus" might include multiple apple species
- **Consider subspecies**: Some records have subspecies/variety information
- **Wild vs. cultivated**: Agricultural species may behave differently than
  wild populations

## Next Steps

Now that you understand the basics, explore the other vignettes for more
advanced analyses:

### Phenological Analysis Vignette

Learn to calculate phenological normals (long-term averages), detect anomalies
(unusual years), and analyze trends using:

- `pheno_normals()` - Calculate climatological baselines
- `pheno_anomaly()` - Identify deviations from normal
- `pheno_trend_turning()` - Detect changes in trend direction

```{r eval = FALSE}
vignette("phenological-analysis", package = "pep725")
```

### Spatial Phenological Patterns Vignette

Analyze how phenology varies across landscapes:

- `pheno_gradient()` - Quantify elevation and latitude effects
- `pheno_synchrony()` - Measure spatial coherence
- `pheno_map()` - Create maps of phenological patterns

```{r eval = FALSE}
vignette("spatial-patterns", package = "pep725")
```

### Data Quality Assessment Vignette

Ensure your data is reliable and investigate unusual observations:

- `pep_flag_outliers()` - Identify suspicious observations
- `pep_plot_outliers()` - Visualize outliers in context

```{r eval = FALSE}
vignette("data-quality", package = "pep725")
```

## Getting Help

If you encounter problems:

1. **Check the function documentation**: `?function_name` (e.g., `?pheno_normals`)
2. **Read the vignettes**: They contain worked examples
3. **Report issues**: [github.com/matthias-da/pep725/issues](https://github.com/matthias-da/pep725/issues)

## Session Info

For reproducibility, here's the R environment used for this vignette:

```{r session}
sessionInfo()
```

## References