--- title: "Spatial prior + OSSL spectra pipeline (Modules 3 & 4)" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Spatial prior + OSSL spectra pipeline (Modules 3 & 4)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(soilKey) ``` Modules 3 and 4 sit *alongside* the deterministic key, never inside it: * **Module 3 (`spatial-*`)** -- pulls a probabilistic prior over RSGs from a regional or global map (SoilGrids, an Embrapa pedological map, or any raster the user supplies) and runs a **consistency check** that warns when the deterministic classification disagrees with the prior. The key is never overwritten -- the prior is purely advisory. * **Module 4 (`spectra-*`, `vlm-*`)** -- gap-fills horizon attributes (clay, CEC, BS, OC, pH, ...) from Vis-NIR / SWIR or MIR spectra via the OSSL library. Predicted values are recorded with `source = "predicted_spectra"` so the evidence grade tracks the substitution. This vignette walks both modules end-to-end on the canonical Ferralsol fixture, with all external dependencies (raster files, OSSL parquet libraries, ellmer chats) replaced by inline synthetic objects so the example is fully reproducible. # 1. Start from a partially-described pedon Take the canonical Latossolo and *intentionally erase* the CEC and base-saturation values from the lower horizons. We will fill them back in with simulated OSSL predictions in ยง3. ```{r partial-pedon} pr_full <- make_ferralsol_canonical() pr_partial <- pr_full$clone(deep = TRUE) pr_partial$horizons[3:5, c("cec_cmol", "bs_pct") := NA] pr_partial$horizons[, .(top_cm, bottom_cm, designation, clay_pct, cec_cmol, bs_pct, oc_pct)] ``` The classification on this incomplete pedon already differs: ```{r classify-partial} res_partial <- classify_wrb2022(pr_partial, on_missing = "silent") res_partial$rsg_or_order res_partial$evidence_grade ``` The evidence grade reflects the missing data -- the per-RSG trace records which RSGs returned NA because of attributes we erased. # 2. Module 3 -- spatial prior consistency check The prior is a probability vector over RSGs from any source -- a regional SoilGrids extract, a national Embrapa map, an interpolated kriging surface, etc. For this vignette we build it inline so the example runs without network access: ```{r build-prior} # Synthetic prior consistent with the gneiss-Mata-Atlantica context: # Ferralsols dominate, with a tail of Acrisols and Cambisols. prior <- data.table::data.table( rsg_code = c("FR", "AC", "CM", "AL"), probability = c(0.62, 0.20, 0.12, 0.06) ) prior ``` `prior_consistency_check()` confirms the deterministic call (`FR`) is supported by the prior: ```{r consistency-check} chk <- prior_consistency_check(rsg_code = "FR", prior = prior, threshold = 0.05) chk ``` Now suppose the deterministic key had instead landed on Cambisols. The same prior would flag the disagreement (Cambisols at probability 0.12 vs the dominant Ferralsols at 0.62 -- the inconsistency margin): ```{r inconsistent} prior_consistency_check(rsg_code = "AL", prior = prior, threshold = 0.05) ``` The deterministic key is never overwritten by the prior. The check only flags cases where a manual review is warranted; the user remains in charge of the final assignment. Real production runs would source the prior from `spatial_prior_soilgrids()` (a live SoilGrids-WCS request) or `spatial_prior_embrapa()`: ```{r soilgrids-call, eval = FALSE} prior <- spatial_prior_soilgrids(pr_partial, buffer_m = 250) ``` # 3. Module 4 -- OSSL gap-filling The OSSL workflow is: 1. Pre-process the raw spectra (SNV / Savitzky-Golay 1st derivative). 2. Send each horizon's spectrum through one of three predictors: * `predict_ossl_mbl()` -- memory-based learning (recommended); * `predict_ossl_plsr_local()` -- partial-least-squares with a local subset; * `predict_ossl_pretrained()` -- a pre-trained Cubist or RF model. 3. Convert each property's prediction-interval width to an A--D confidence grade via `pi_to_confidence()`. 4. `fill_from_spectra()` writes each predicted value into the horizons table AND adds a provenance entry with `source = "predicted_spectra"`. A *production* call would look like this (skipped in this vignette because OSSL is a multi-GB dataset that would have to be downloaded): ```{r ossl-call, eval = FALSE} fill_from_spectra( pr_partial, library = "ossl", region = "south_america", properties = c("clay_pct", "cec_cmol", "bs_pct", "oc_pct"), method = "mbl", preprocess = "snv+sg1", k_neighbors = 100L, ossl_library = "/path/to/ossl-soilsite-vnir.parquet" ) ``` For the vignette, simulate the predicted values directly through `pedon$add_measurement()`: ```{r mock-predict} preds <- list( list(idx = 3, attribute = "cec_cmol", value = 5.5, confidence = 0.78), list(idx = 3, attribute = "bs_pct", value = 14, confidence = 0.72), list(idx = 4, attribute = "cec_cmol", value = 4.9, confidence = 0.79), list(idx = 4, attribute = "bs_pct", value = 13, confidence = 0.74), list(idx = 5, attribute = "cec_cmol", value = 4.7, confidence = 0.70), list(idx = 5, attribute = "bs_pct", value = 13, confidence = 0.71) ) pr_filled <- pr_partial$clone(deep = TRUE) for (p in preds) { pr_filled$add_measurement( horizon_idx = p$idx, attribute = p$attribute, value = p$value, source = "predicted_spectra", confidence = p$confidence, overwrite = TRUE ) } pr_filled$horizons[, .(top_cm, bottom_cm, cec_cmol, bs_pct)] ``` The predicted values are now in the horizons table, *and* the provenance log records each as `predicted_spectra`: ```{r show-provenance} prov <- pr_filled$provenance prov[source == "predicted_spectra", .(horizon_idx, attribute, source, confidence)] ``` # 4. Re-classify with the gap-filled pedon After OSSL fills the missing CEC/BS, the deterministic key has a complete dataset and the classification's evidence grade reflects the predicted source. ```{r classify-filled} res_filled <- classify_wrb2022(pr_filled, on_missing = "silent") res_filled$rsg_or_order res_filled$evidence_grade res_filled$name ``` The evidence grade ladder for the same profile across the three workflows: | Workflow | Evidence grade | |-------------------------------------------|----------------| | Lab-only (full canonical fixture) | A | | Spectra-filled (OSSL `predicted_spectra`) | B | | VLM-extracted only | C | | User-assumed | D | # 5. Combining priors and posteriors `combine_priors()` merges multiple sources (SoilGrids + Embrapa + a custom map) with weights, returning a single normalised prior that you can feed into `prior_consistency_check()`: ```{r combine-priors} combined <- combine_priors( priors = list( soilgrids = data.table::data.table(rsg_code = c("FR", "AC", "CM"), probability = c(0.62, 0.20, 0.18)), embrapa = data.table::data.table(rsg_code = c("FR", "AC", "NT"), probability = c(0.55, 0.30, 0.15)) ), weights = c(soilgrids = 0.6, embrapa = 0.4) ) combined ``` `posterior_classify()` would then take a `ClassificationResult` and a prior, returning a posterior probability over RSGs (the deterministic key contributes a sharply peaked likelihood). Used for ranking ambiguous fixtures or for active-learning loops. # Summary * The deterministic key always runs first and is never overwritten. * Module 3 gives a probabilistic *sanity check* against external maps and warns on disagreement. * Module 4 fills missing horizon attributes from spectra, with full provenance, so the evidence grade tracks the substitution. * Putting them together, you can take a partially-described pedon, fill its gaps from spectra, classify it deterministically, and cross-check the result against a global map -- all in one pipeline. The next vignette (`v06_wosis_benchmark`) shows how to run this whole stack at scale against the WoSIS global pedon archive for paper-grade validation.