---
title: "Spatial prior + OSSL spectra pipeline (Modules 3 & 4)"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Spatial prior + OSSL spectra pipeline (Modules 3 & 4)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>"
)
library(soilKey)
```

Modules 3 and 4 sit *alongside* the deterministic key, never inside it:

* **Module 3 (`spatial-*`)** -- pulls a probabilistic prior over RSGs from a regional or global map (SoilGrids, an Embrapa pedological map, or any raster the user supplies) and runs a **consistency check** that warns when the deterministic classification disagrees with the prior. The key is never overwritten -- the prior is purely advisory.
* **Module 4 (`spectra-*`, `vlm-*`)** -- gap-fills horizon attributes (clay, CEC, BS, OC, pH, ...) from Vis-NIR / SWIR or MIR spectra via the OSSL library. Predicted values are recorded with `source = "predicted_spectra"` so the evidence grade tracks the substitution.

This vignette walks both modules end-to-end on the canonical Ferralsol fixture, with all external dependencies (raster files, OSSL parquet libraries, ellmer chats) replaced by inline synthetic objects so the example is fully reproducible.

# 1. Start from a partially-described pedon

Take the canonical Latossolo and *intentionally erase* the CEC and base-saturation values from the lower horizons. We will fill them back in with simulated OSSL predictions in §3.

```{r partial-pedon}
pr_full   <- make_ferralsol_canonical()
pr_partial <- pr_full$clone(deep = TRUE)
pr_partial$horizons[3:5, c("cec_cmol", "bs_pct") := NA]

pr_partial$horizons[, .(top_cm, bottom_cm, designation,
                        clay_pct, cec_cmol, bs_pct, oc_pct)]
```

The classification on this incomplete pedon already differs:

```{r classify-partial}
res_partial <- classify_wrb2022(pr_partial, on_missing = "silent")
res_partial$rsg_or_order
res_partial$evidence_grade
```

The evidence grade reflects the missing data -- the per-RSG trace records which RSGs returned NA because of attributes we erased.

# 2. Module 3 -- spatial prior consistency check

The prior is a probability vector over RSGs from any source -- a regional SoilGrids extract, a national Embrapa map, an interpolated kriging surface, etc. For this vignette we build it inline so the example runs without network access:

```{r build-prior}
# Synthetic prior consistent with the gneiss-Mata-Atlantica context:
# Ferralsols dominate, with a tail of Acrisols and Cambisols.
prior <- data.table::data.table(
  rsg_code    = c("FR", "AC", "CM", "AL"),
  probability = c(0.62, 0.20, 0.12, 0.06)
)
prior
```

`prior_consistency_check()` confirms the deterministic call (`FR`) is supported by the prior:

```{r consistency-check}
chk <- prior_consistency_check(rsg_code = "FR", prior = prior, threshold = 0.05)
chk
```

Now suppose the deterministic key had instead landed on Cambisols. The same prior would flag the disagreement (Cambisols at probability 0.12 vs the dominant Ferralsols at 0.62 -- the inconsistency margin):

```{r inconsistent}
prior_consistency_check(rsg_code = "AL", prior = prior, threshold = 0.05)
```

The deterministic key is never overwritten by the prior. The check only flags cases where a manual review is warranted; the user remains in charge of the final assignment. Real production runs would source the prior from `spatial_prior_soilgrids()` (a live SoilGrids-WCS request) or `spatial_prior_embrapa()`:

```{r soilgrids-call, eval = FALSE}
prior <- spatial_prior_soilgrids(pr_partial, buffer_m = 250)
```

# 3. Module 4 -- OSSL gap-filling

The OSSL workflow is:

1. Pre-process the raw spectra (SNV / Savitzky-Golay 1st derivative).
2. Send each horizon's spectrum through one of three predictors:
   * `predict_ossl_mbl()`        -- memory-based learning (recommended);
   * `predict_ossl_plsr_local()` -- partial-least-squares with a local subset;
   * `predict_ossl_pretrained()` -- a pre-trained Cubist or RF model.
3. Convert each property's prediction-interval width to an A--D confidence grade via `pi_to_confidence()`.
4. `fill_from_spectra()` writes each predicted value into the horizons table AND adds a provenance entry with `source = "predicted_spectra"`.

A *production* call would look like this (skipped in this vignette because OSSL is a multi-GB dataset that would have to be downloaded):

```{r ossl-call, eval = FALSE}
fill_from_spectra(
  pr_partial,
  library     = "ossl",
  region      = "south_america",
  properties  = c("clay_pct", "cec_cmol", "bs_pct", "oc_pct"),
  method      = "mbl",
  preprocess  = "snv+sg1",
  k_neighbors = 100L,
  ossl_library = "/path/to/ossl-soilsite-vnir.parquet"
)
```

For the vignette, simulate the predicted values directly through `pedon$add_measurement()`:

```{r mock-predict}
preds <- list(
  list(idx = 3, attribute = "cec_cmol", value = 5.5, confidence = 0.78),
  list(idx = 3, attribute = "bs_pct",   value = 14,  confidence = 0.72),
  list(idx = 4, attribute = "cec_cmol", value = 4.9, confidence = 0.79),
  list(idx = 4, attribute = "bs_pct",   value = 13,  confidence = 0.74),
  list(idx = 5, attribute = "cec_cmol", value = 4.7, confidence = 0.70),
  list(idx = 5, attribute = "bs_pct",   value = 13,  confidence = 0.71)
)

pr_filled <- pr_partial$clone(deep = TRUE)
for (p in preds) {
  pr_filled$add_measurement(
    horizon_idx = p$idx,
    attribute   = p$attribute,
    value       = p$value,
    source      = "predicted_spectra",
    confidence  = p$confidence,
    overwrite   = TRUE
  )
}

pr_filled$horizons[, .(top_cm, bottom_cm, cec_cmol, bs_pct)]
```

The predicted values are now in the horizons table, *and* the provenance log records each as `predicted_spectra`:

```{r show-provenance}
prov <- pr_filled$provenance
prov[source == "predicted_spectra", .(horizon_idx, attribute, source, confidence)]
```

# 4. Re-classify with the gap-filled pedon

After OSSL fills the missing CEC/BS, the deterministic key has a complete dataset and the classification's evidence grade reflects the predicted source.

```{r classify-filled}
res_filled <- classify_wrb2022(pr_filled, on_missing = "silent")
res_filled$rsg_or_order
res_filled$evidence_grade
res_filled$name
```

The evidence grade ladder for the same profile across the three workflows:

| Workflow                                  | Evidence grade |
|-------------------------------------------|----------------|
| Lab-only (full canonical fixture)         | A              |
| Spectra-filled (OSSL `predicted_spectra`) | B              |
| VLM-extracted only                        | C              |
| User-assumed                              | D              |

# 5. Combining priors and posteriors

`combine_priors()` merges multiple sources (SoilGrids + Embrapa + a custom map) with weights, returning a single normalised prior that you can feed into `prior_consistency_check()`:

```{r combine-priors}
combined <- combine_priors(
  priors = list(
    soilgrids = data.table::data.table(rsg_code = c("FR", "AC", "CM"),
                                          probability = c(0.62, 0.20, 0.18)),
    embrapa   = data.table::data.table(rsg_code = c("FR", "AC", "NT"),
                                          probability = c(0.55, 0.30, 0.15))
  ),
  weights = c(soilgrids = 0.6, embrapa = 0.4)
)
combined
```

`posterior_classify()` would then take a `ClassificationResult` and a prior, returning a posterior probability over RSGs (the deterministic key contributes a sharply peaked likelihood). Used for ranking ambiguous fixtures or for active-learning loops.

# Summary

* The deterministic key always runs first and is never overwritten.
* Module 3 gives a probabilistic *sanity check* against external maps and warns on disagreement.
* Module 4 fills missing horizon attributes from spectra, with full provenance, so the evidence grade tracks the substitution.
* Putting them together, you can take a partially-described pedon, fill its gaps from spectra, classify it deterministically, and cross-check the result against a global map -- all in one pipeline.

The next vignette (`v06_wosis_benchmark`) shows how to run this whole stack at scale against the WoSIS global pedon archive for paper-grade validation.