---
title: "Data Quality Assessment"
author:
  - Matthias Templ^[University of Applied Sciences Northwestern Switzerland (FHNW)]
  - Barbara Templ^[Swiss Federal Institute for Forest, Snow and Landscape Research (WSL)]
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{Data Quality Assessment}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

*Vignette 2 of 4 for the **pep725** R package.
The most current version is available on
[GitHub](https://github.com/matthias-da/pep725) and
[CRAN](https://CRAN.R-project.org/package=pep725).*

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

## Introduction

Data quality assessment is a critical but often overlooked step in phenological
analysis. Before calculating trends, normals, or climate sensitivities, you need
to understand what's in your data and make informed decisions about unusual
observations.

### The Data Quality Challenge

Long-term phenological datasets such as PEP725 are compiled from many observers over decades. This richness enables large-scale analyses, but it also introduces data quality challenges that need to be assessed:

| Issue Type | Examples | Consequence if Ignored |
|------------|----------|------------------------|
| **Recording errors** | Typographical errors (DOY 150 instead of 105), incorrect year | Outliers bias statistics |
| **Identification errors** | Confusion between morphologically similar species | Inconsistent or mixed time series |
| **Protocol changes** | Changes in the definition of phenological phases over time | Artificial shifts or trends |
| **Biological anomalies** | Irregular phenological events as a result of climate extremes | May represent real biological signals |

The following sections introduce practical tools in pep725 package for diagnosing and handling these issues in a transparent and reproducible way.

### What You'll Learn

| Section | Topic | Key Functions |
|---------|-------|---------------|
| Part 1 | Detecting statistical outliers | `pep_flag_outliers()`, `pep_plot_outliers()` |
| Part 2 | Temporal coverage and completeness | `pep_completeness()` |
| Part 3 | Phase presence validation | `pep_check_phases()` |
| Part 4 | Integrated workflow | Combining all approaches |

### Prerequisites

This vignette assumes you are familiar with:

- Basic phenological concepts (from "Getting Started" vignette)
- Day of Year (DOY) as a timing metric
- BBCH phenological phase codes

## Setup

```{r setup, message=FALSE, warning=FALSE}
library(pep725)
library(data.table)

# Download the synthetic dataset
pep <- pep_download()

# For this vignette, we'll focus on flowering phases
flowering <- pep[phase_id %in% c(60, 65)]
cat("Flowering observations:", nrow(flowering), "\n")
cat("Species:", length(unique(flowering$species)), "\n")
cat("Year range:", min(flowering$year), "-", max(flowering$year), "\n")
```

---

# Part 1: Outlier Detection

## Why Detect Outliers?

An **outlier** is an observation that differs substantially from the expected
pattern. In phenological data, outliers may indicate:

1. **Recording errors** that should be excluded from analysis
   - A typo: DOY 250 instead of DOY 150 (roughly 3 months difference)
   - Wrong year entered: observation assigned to incorrect season

2. **Unusual weather events** worth investigating
   - Very warm winter causing early flowering
   - Late frost delaying spring phenology

3. **Second flowering** or other abnormal phenological events
   - Plants flowering again in autumn/winter after spring flowering
   - Increasingly observed with climate change

**The challenge**: You can't simply delete all outliers. Some are errors, but
others are scientifically valuable observations of unusual events.

## Statistical Methods for Outlier Detection

The `pep_flag_outliers()` function provides four statistical methods for identifying
potential outliers. Each method has different assumptions and strengths:

### Method 1: 30-Day Rule (Simple Threshold)

The simplest approach: flag any observation more than 30 days from the median
for that species/phase combination.

```{r flag-basic}
# Flag outliers using the default 30-day rule
outliers <- pep_flag_outliers(
  pep = flowering,
  method = "30day",
  by = c("species", "phase_id")
)

print(outliers)
```

**How it works:**

```
                     Distribution of flowering dates

        ← Early                                    Late →

   ----|----[=======|=======]----|-----
        ^           ^           ^
     -30 days    median      +30 days

   Anything outside the brackets is flagged as an outlier
```

### Understanding the Output

The function adds new columns to your original data:

| Column | What It Contains | How to Use It |
|--------|------------------|---------------|
| `is_outlier` | TRUE/FALSE flag | Filter data: `data[is_outlier == FALSE]` |
| `deviation` | Days from expected | Prioritize investigation: larger = more extreme |
| `expected_doy` | Reference DOY (median) | Understand what "normal" looks like |

### Method 2: MAD (Median Absolute Deviation) - Recommended

The MAD method is **robust to the presence of multiple outliers**. This is
important because if your data contains many outliers, they can distort
mean-based methods.

```{r flag-methods, eval=FALSE}
# MAD method: flag if > 3 MAD from median (robust)
outliers_mad <- pep_flag_outliers(flowering, method = "mad", threshold = 3)
```

**Why MAD is recommended:**

| Scenario | Mean-based method | MAD method |
|----------|-------------------|------------|
| 1-2 outliers | Works reasonably | Works well |
| Many outliers | Outliers inflate SD, may miss more | Stays robust |
| Asymmetric data | Assumes symmetry | Handles asymmetry |

### Method 3: IQR (Interquartile Range) - Also Robust

The IQR method uses quartiles and is familiar from boxplot "whiskers":

```{r iqr-method, eval=FALSE}
# IQR method: flag if outside 1.5 * IQR (standard boxplot rule)
outliers_iqr <- pep_flag_outliers(flowering, method = "iqr", threshold = 1.5)
```

**How it works:**
- Q1 = 25th percentile, Q3 = 75th percentile
- IQR = Q3 - Q1
- Outlier if: value < Q1 - 1.5×IQR OR value > Q3 + 1.5×IQR

### Method 4: Z-Score - Sensitive but Less Robust

The z-score method assumes normally distributed data:

```{r zscore-method, eval=FALSE}
# Z-score method: flag if |z| > 3 (assumes normal distribution)
outliers_zscore <- pep_flag_outliers(flowering, method = "zscore", threshold = 3)
```

**Caution**: The z-score method is sensitive to existing outliers. If your data
has many outliers, they will inflate the standard deviation, making the z-score
threshold less effective at detecting them.

### Choosing a Method

| Method | Best For | Threshold Meaning |
|--------|----------|-------------------|
| `30day` | Simple, interpretable threshold | Absolute days from median |
| `mad` | Most situations (recommended) | Number of MADs from median |
| `iqr` | Familiar boxplot-style detection | IQR multiplier |
| `zscore` | Clean data, normal distribution | Standard deviations |

### Summary Statistics

```{r flag-summary}
summary(outliers)
```

### Comparing Outlier Rates Across Species

Different species may have different outlier rates due to observation difficulty
or data quality:

```{r flag-compare}
# Check outliers for grapevine
vine_flowering <- pep[species == "Vitis vinifera" & phase_id %in% c(60, 65)]

if (nrow(vine_flowering) > 0) {
  outliers_vine <- pep_flag_outliers(
    pep = vine_flowering,
    method = "30day",
    by = c("species", "phase_id")
  )

  cat("Outlier comparison:\n")
  cat("All flowering species: ", round(100 * mean(outliers$is_outlier), 2), "%\n")
  cat("Grapevine only:        ", round(100 * mean(outliers_vine$is_outlier), 2), "%\n")
} else {
  cat("No grapevine flowering data available for comparison.\n")
}
```

## Visualizing Outliers

Visual inspection is essential for understanding outlier patterns. The
`pep_plot_outliers()` function provides four visualization types:

### 1. Overview Plot

Shows the big picture: how many outliers, which species/phases are affected:

```{r plot-overview, fig.height=6}
# Overview of outlier patterns
pep_plot_outliers(outliers, type = "overview")
```

**What to look for:**
- Which species have the most outliers? (Might indicate data quality issues)
- Are outliers concentrated in certain phases? (Might indicate definition problems)
- What's the overall outlier rate? (Typical: 1-5% for well-curated data)

### 2. Seasonal Distribution

When in the year do outliers occur?

```{r plot-seasonal, fig.height=5}
# When do outliers occur in the year?
pep_plot_outliers(outliers, type = "seasonal")
```

**Interpretation guide:**

| Outlier Timing | Likely Explanation |
|----------------|-------------------|
| Very early (DOY < 50) | Possible data errors (flowering in January/February unlikely for most species) |
| Somewhat early | Warm winters or Mediterranean locations |
| Somewhat late | Cool springs or high-altitude stations |
| Very late (DOY > 250) | Possible **second flowering** events - investigate! |

### 3. Detailed Context

See outliers alongside normal observations:

```{r plot-detail, fig.height=5}
# See outliers in context of all observations
pep_plot_outliers(outliers, type = "detail", n_top = 15)
```

**This plot shows:**
- Full distribution of observations (gray)
- Flagged outliers (highlighted)
- The `n_top` parameter controls how many species/phases to show

### 4. Geographic Distribution

Where are outliers located?

```{r plot-map, fig.height=5}
# Where are outliers located?
pep_plot_outliers(outliers, type = "map")
```

**What to look for:**
- Clustered outliers in one region? Might indicate local data quality issues
- Outliers at network edges? Might be at environmental limits
- Random scatter? Suggests individual observation errors

---

# Part 2: Data Completeness

## Why Check Completeness?

**Data completeness** refers to how many years of observations exist for each
station/phase combination. This matters because:

1. **Trend analysis requires continuous data**: Gaps can bias trend estimates
2. **Normals require representative coverage**: WMO recommends 24+ years in a 30-year period
3. **Missing data isn't random**: Stations often drop out or are added systematically

### Visualizing Completeness Issues

```
Station A: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●  <- 100% complete
           1990                      2020

Station B: ●●●●●●●●●●○○○○○○○○○○●●●●●●●●●●  <- Gap in middle
           1990                      2020

Station C: ○○○○○○○○○○○○○○○○○○○○●●●●●●●●●●  <- Only recent data
           1990                      2020

Station D: ●●●●●●●●●●●●●●●●●●●●○○○○○○○○○○  <- Discontinued
           1990                      2020
```

Different completeness patterns have different implications:

| Pattern | Implications for Analysis |
|---------|---------------------------|
| Full coverage | Ideal - reliable trends and normals |
| Gap in middle | Careful - may miss important changes |
| Only recent | Cannot compare to historical period |
| Discontinued | Cannot assess recent changes |

## Assessing Completeness

```{r completeness-basic}
# Check completeness by station and phase
# Use year_range to focus on a specific period
completeness <- pep_completeness(
  pep = flowering,
  by = c("s_id", "phase_id"),
  year_range = c(1990, 2020)
)

print(completeness)
```

### Understanding Completeness Metrics

The function calculates several metrics:

| Metric | What It Measures | Interpretation |
|--------|------------------|----------------|
| `n_years` | Total years with data | More = better |
| `completeness_pct` | % of year span covered | 70%+ for trends, 80%+ for normals |
| `year_span` | Total span (year_max - year_min + 1) | Context for completeness |
| `year_min` | Earliest observation | Needed for historical comparison |
| `year_max` | Most recent observation | Needed for current status |

### Summary Statistics

```{r completeness-summary}
summary(completeness)
```

## Filtering by Completeness

Use completeness information to select stations appropriate for your analysis:

```{r completeness-filter}
# Get stations with good coverage (>= 70%)
good_coverage <- completeness[completeness_pct >= 70]
cat("Stations with >= 70% coverage:", nrow(good_coverage), "\n")

# Use these for trend analysis
good_stations <- unique(good_coverage$s_id)
flowering_complete <- flowering[s_id %in% good_stations]
cat("Observations from complete stations:", nrow(flowering_complete), "\n")
```

### Completeness Thresholds for Different Analyses

| Analysis Type | Minimum Completeness | Reasoning |
|---------------|----------------------|-----------|
| Trend detection | 50% (15+ years) | Need enough points for trend fitting |
| Climate sensitivity | 60% | Need variability across climate conditions |
| 30-year normals | 80% (24+ years) | WMO standard requirement |
| Station comparison | Same time period | Avoid bias from different eras |

## Visualizing Completeness

```{r completeness-plot, fig.height=5}
plot(completeness)
```

---

# Part 3: Phase Presence Validation

## Why Check Phase Presence?

Before starting an analysis, it's important to verify that your data contains
the phenological phases you need. The `pep_check_phases()` function validates that
expected BBCH phase codes are present in your data.

**Common questions this helps answer:**

- Does my dataset have flowering observations (phase 60)?
- Are flowering and fruit maturity phases both available for apple?
- Which phases are missing for my species of interest?

## Checking Phase Presence

```{r check-basic}
# Check if expected phases are present for apple
apple <- pep[species == "Malus domestica"]

phase_check <- pep_check_phases(
  pep = apple,
  expected = c(60, 65, 87)  # flowering, full flowering, fruit maturity
)

print(phase_check)
```

### Understanding the Output

The function returns information about phase coverage:

| Output | What It Shows |
|--------|---------------|
| `expected` | The phases you asked for |
| `present` | Which expected phases were found |
| `missing` | Which expected phases are absent |
| `complete` | TRUE if all expected phases are present |
| `n_obs` | Number of observations per phase |

## Checking Multiple Species

To check phase presence across multiple species at once:

```{r check-multi}
# Check phases for multiple species
multi_check <- pep_check_phases_multi(
  pep = pep,
  species_list = c("Malus domestica", "Vitis vinifera"),
  expected = c(60, 65, 87)
)

print(multi_check)
```

### Common Phases to Check

| Plant Type | Common Phases | BBCH Codes |
|------------|---------------|------------|
| Cereals | Heading, Flowering, Harvest | 60, 65, 100 |
| Fruit trees | Flowering, Full flowering, Fruit maturity | 60, 65, 87 |
| Deciduous trees | Leaf unfolding, Flowering, Leaf fall | 11, 60, 95 |

---

# Part 4: Integrated Quality Workflow

## Putting It All Together

Here's a recommended workflow for data quality assessment that combines all the
tools covered in this vignette:

```{r workflow}
# ══════════════════════════════════════════════════════════════════════════════
# STEP 1: Assess temporal completeness
# ══════════════════════════════════════════════════════════════════════════════
# Why: Incomplete stations can bias trend estimates and normals

completeness <- pep_completeness(flowering, by = c("s_id", "phase_id"))
good_stations <- completeness[completeness_pct >= 50, s_id]
fl_filtered <- flowering[s_id %in% good_stations]

cat("Kept", length(good_stations), "stations with >= 50% completeness\n")

# ══════════════════════════════════════════════════════════════════════════════
# STEP 2: Flag statistical outliers
# ══════════════════════════════════════════════════════════════════════════════
# Why: Identify observations that deviate from expected patterns

outliers_wf <- pep_flag_outliers(fl_filtered, method = "mad", threshold = 3)

cat("Flagged", sum(outliers_wf$is_outlier), "outliers",
    "(", round(100 * mean(outliers_wf$is_outlier), 1), "% )\n")

# ══════════════════════════════════════════════════════════════════════════════
# STEP 3: Make informed decisions about exclusion
# ══════════════════════════════════════════════════════════════════════════════
# Key principle: Document your decisions!

# Option A: Strict cleaning (for normals calculation)
fl_strict <- outliers_wf[is_outlier == FALSE]

# Option B: Moderate cleaning (keep moderate outliers)
fl_moderate <- outliers_wf[is_outlier == FALSE | abs(deviation) < 60]

cat("Strict cleaning keeps:", nrow(fl_strict), "obs\n")
cat("Moderate cleaning keeps:", nrow(fl_moderate), "obs\n")

# ══════════════════════════════════════════════════════════════════════════════
# STEP 4: Proceed with analysis on cleaned data
# ══════════════════════════════════════════════════════════════════════════════

normals <- pheno_normals(fl_moderate, period = 1991:2020, min_years = 5)
cat("Normals calculated for", sum(!is.na(normals$mean_doy)), "groups\n")
```

## Documentation Template

When publishing research, document your quality control decisions:

```
Data Quality Control:
1. Completeness filtering: Excluded stations with < 50% temporal coverage
   (reduced dataset from N to M stations)

2. Outlier detection: Used MAD method with threshold = 3
   (flagged X observations as outliers, Y% of total)

3. Outlier disposition:
   - Excluded: Z observations with deviations > 60 days (likely errors)
   - Retained: W observations identified as abnormal events (e.g., second flowering)

4. Phase validation: Checked for sequence violations in [species list]
   (identified V cases for manual review)
```

---

# Best Practices Summary

## For Outlier Detection

1. **Use robust methods**: MAD or IQR methods handle multiple outliers better
   than z-scores. The MAD method is generally recommended.

2. **Check seasonally**: Late-season outliers (DOY > 250) may be biologically
   meaningful events (e.g. second flowering) rather than errors.

3. **Verify extremes**: For observations with very large deviations (> 60 days),
   try to check original data sources or contact data providers.

4. **Document decisions**: Record which outliers you exclude and why. This is
   essential for reproducibility and peer review.

5. **Don't delete automatically**: Human judgment is needed to distinguish
   errors from genuine unusual events.

## For Abnormal Event Detection

1. **Consider species biology**: Some species (e.g., certain Rosaceae) are more
   prone to second flowering than others.

2. **Check geographic patterns**: If multiple nearby stations report late events
   in the same year, this suggests a real regional phenomenon.

3. **Look at weather context**: Link abnormal events to weather extremes -
   drought is a common trigger.

4. **Distinguish from errors**: Very isolated observations (single station,
   single year) need verification before treating them as e.g. second flowering.

5. **Keep for separate analysis**: Abnormal phenological events are scientifically
   interesting (proof of climate change) - don't just delete them!

## For Completeness Assessment

1. **Set appropriate thresholds**: Use 80%+ coverage for calculating normals,
   50%+ for trend analysis.

2. **Consider gap patterns**: Many small gaps are usually better than one long
   gap that might coincide with important climate changes.

3. **Check temporal bias**: Ensure you're not comparing stations with data only
   from early periods to stations with only recent data.

4. **Document station selection**: Report how many stations met your completeness
   criteria and how this affected your sample.

## For Phase Presence Checking

1. **Check early**: Run `pep_check_phases()` at the start of your analysis to
   verify required phases are available in your data.

2. **Know required phases**: Different analyses need different phases -
   know which BBCH codes you need before starting.

3. **Check across species**: Use `pep_check_phases_multi()` to verify data
   availability for all species you plan to analyze.

---

## Summary

This vignette covered data quality tools for phenological analysis:

| Function | Purpose | Key Output |
|----------|---------|------------|
| `pep_flag_outliers()` | Identify unusual observations | `is_outlier` flag, `deviation` in days |
| `pep_plot_outliers()` | Visualize outlier patterns | Four plot types: overview, seasonal, detail, map |
| `pep_completeness()` | Assess temporal coverage | Completeness %, gaps, year range |
| `pep_check_phases()` | Check phase presence | Missing phases, observation counts |

### Key Take-Home Messages

1. **Data quality assessment is not optional** - it's a critical step that
   affects the validity of all downstream analyses.

2. **Not all outliers are errors** - some represent genuine biological phenomena
   like second flowering that deserve further study.

3. **Document your decisions** - quality control choices should be transparent
   and reproducible.

4. **Use robust methods** - the MAD method is recommended for phenological data
   because it handles multiple outliers well.

5. **Visual inspection matters** - plots often reveal patterns that summary
   statistics miss.

## Next Steps

Explore the other vignettes for complementary analyses:

- **Getting Started**: Data access, the `pep` class, and basic exploration

```{r eval = FALSE}
vignette("getting-started", package = "pep725")
```

- **Phenological Analysis**: Normals, anomalies, quality grading, and trends

```{r eval = FALSE}
vignette("phenological-analysis", package = "pep725")
```

- **Spatial Phenological Patterns**: Gradients, synchrony, and mapping

```{r eval = FALSE}
vignette("spatial-patterns", package = "pep725")
```

---

## Session Info

```{r session}
sessionInfo()
```