--- title: "End-to-end pipeline: Gemma 4 + spatial + spectral + key + GIS export" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{End-to-end pipeline: Gemma 4 + spatial + spectral + key + GIS export} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(soilKey) ``` This vignette walks the **complete soilKey pipeline** on a real Brazilian soil profile, hitting every public entry point in canonical order: 1. **Spatial guide** — `soil_classes_at_location()` returns ranked likely classes at the field GPS coordinate before any pedon data is collected. 2. **Multimodal extraction** — `classify_from_documents()` runs Gemma 4 (local Ollama) on a soil-description PDF and a profile-wall photograph, extracts horizons + Munsell + site metadata, and feeds everything into a `PedonRecord`. 3. **Spectral analogy** — `classify_by_spectral_neighbours()` consumes a Vis-NIR scan of the surface horizon, finds the K most similar OSSL profiles within a regional radius, and returns a probabilistic class prediction. 4. **Deterministic classification** — `classify_wrb2022()`, `classify_sibcs(include_familia = TRUE)`, `classify_usda()` walk the canonical YAML rules and produce the final names with full key trace + provenance + evidence grade. 5. **Reports** — `report()` writes a self-contained HTML pedologist report. 6. **GIS export** — `report_to_qgis()` produces a multi-layer GeoPackage that QGIS opens natively. The whole pipeline runs offline once the Ollama Gemma 4 model is pulled; the only network hit is the optional SoilGrids fetch in step 1. # 1. Set the scene We use a canonical Latossolo Vermelho Distrocoeso from the Mata Atlântica around Seropédica, RJ, parent material gneiss. The fixture mimics a real Embrapa survey profile. ```{r scene, eval = FALSE} # Field GPS coordinates of the planned profile pit. field_lat <- -22.7 field_lon <- -43.7 ``` # 2. Spatial guide -- before any pedon data `soil_classes_at_location()` queries SoilGrids 2.0 (or any WRB-coded raster the user provides) and returns a ranked list of likely classes plus the canonical attribute thresholds that distinguish them. ```{r spatial-guide, eval = FALSE} guide <- soil_classes_at_location( lat = field_lat, lon = field_lon, system = "wrb2022", source_url = "https://files.isric.org/soilgrids/latest/data/wrb/MostProbable.vrt" ) guide$distribution #> # Ranked candidate classes: #> # rsg_code rsg_name probability #> # FR Ferralsols 0.62 #> # AC Acrisols 0.21 #> # NT Nitisols 0.12 #> # CM Cambisols 0.05 guide$typical_attributes #> # Per-class diagnostic thresholds to confirm in the field. ``` The function does **not** classify -- it tells the pedologist "you are most likely standing on a Ferralsol; here is what to look for to confirm". # 3. Multimodal extraction with local Gemma 4 The pedologist arrives at the pit, photographs the wall against a Munsell chart, scans the field sheet, and exports the survey report PDF. `classify_from_documents()` chains the entire downstream pipeline -- VLM extraction, all three classifications, optional report rendering -- in a single call. The default provider is local Gemma 4 edge (`gemma4:e4b`, ~3 GB, multimodal text + image + audio) via [Ollama](https://ollama.com) -- no API key, no data leaving the laptop. Pull the model once: ```bash ollama pull gemma4:e4b ollama serve ``` ```{r vlm, eval = FALSE} res <- classify_from_documents( pdf = "perfil_042_descricao.pdf", image = "perfil_042_parede.jpg", report = "perfil_042.html", provider = "ollama" # default; uses gemma4:e4b ) res$classifications$wrb$name #> [1] "Geric Ferric Rhodic Chromic Ferralsol (Clayic, Humic, Dystric, Ochric, Rubic)" res$classifications$sibcs$name #> [1] "Latossolos Vermelhos Distroficos tipicos, argilosa, moderado" res$classifications$usda$name #> [1] "Rhodic Hapludox" ``` Every extracted attribute is stamped `source = "extracted_vlm"` in the `PedonRecord`'s provenance log; the deterministic key is consumed by the `PedonRecord` unaware of how each value got there. The architectural invariant -- **the key is never delegated to a model** -- holds. For the rest of the vignette we keep working with the populated pedon `res$pedon`. ```{r pedon-from-canonical} # For a runnable demo without Ollama / a real PDF, reuse the # canonical Ferralsol fixture -- the downstream code is the same. pedon <- make_ferralsol_canonical() ``` # 4. Spectral analogy If a Vis-NIR scan is available for the surface horizon, `classify_by_spectral_neighbours()` adds another evidence layer. It finds the K most spectrally similar OSSL profiles within a regional radius and returns a probabilistic class prediction. ```{r spectral-analogy, eval = FALSE} # Hypothetical: a real OSSL South-America library with WRB labels # obtained via `download_ossl_subset_with_labels()`. ossl_lib <- download_ossl_subset_with_labels( region = "south_america", max_distance_km = 10 ) # Pull the surface-horizon Vis-NIR scan from the populated pedon. query_spectrum <- pedon$spectra$vnir[1, ] spectral <- classify_by_spectral_neighbours( spectrum = query_spectrum, ossl_library = ossl_lib, k = 25, region = list(lat = field_lat, lon = field_lon, radius_km = 500) ) spectral$distribution #> # class n_neighbours probability #> # FR 22 0.88 #> # AC 2 0.08 #> # NT 1 0.04 spectral$neighbours #> # The 25 closest OSSL profiles + their distances + labels. ``` The biome-aware regional filter prevents the analogy from drifting to non-tropical reference soils. # 5. Deterministic classification The canonical step. `classify_wrb2022()` / `classify_sibcs()` / `classify_usda()` walk the canonical YAML rules over the populated `PedonRecord`. ```{r classify} cls_wrb <- classify_wrb2022(pedon, on_missing = "silent") cls_sibcs <- classify_sibcs(pedon, include_familia = TRUE) cls_usda <- classify_usda(pedon) cls_wrb$name cls_sibcs$name cls_usda$name # Each ClassificationResult carries the full key trace, the per- # attribute provenance, and an evidence grade A/B/C/D. cls_wrb$evidence_grade length(cls_wrb$trace) # number of RSGs tested before assignment ``` # 6. HTML report `report()` writes a self-contained HTML one-pager with the cross-system summary, full key trace, evidence grade, qualifiers, ambiguities, missing-data hints, the horizons table, and the per-source provenance summary. ```{r report-html, eval = FALSE} results <- list(wrb = cls_wrb, sibcs = cls_sibcs, usda = cls_usda) report(results, file = file.path(tempdir(), "perfil_042.html"), pedon = pedon) ``` The output is a single HTML file with inline CSS -- no external network requests, suitable for emailing to a colleague or attaching to a laudo. # 7. GIS export `report_to_qgis()` produces a multi-layer GeoPackage (`.gpkg`) that QGIS reads natively. ```{r report-qgis, eval = FALSE} results <- list(wrb = cls_wrb, sibcs = cls_sibcs, usda = cls_usda) report_to_qgis( pedon = pedon, classifications = results, file = file.path(tempdir(), "perfil_042.gpkg"), report_html = file.path(tempdir(), "perfil_042.html") ) ``` The GeoPackage carries three layers: * **`pedon_point`** -- POINT geometry at the profile coordinates with all classification metadata as attributes (WRB / SiBCS / USDA names, RSG / Ordem / Order codes, evidence grades, principal qualifiers, supplementary qualifiers, hyperlink to the rendered HTML report). * **`horizons_table`** -- one row per horizon, with the canonical horizon-schema attributes. Joined to `pedon_point` by `site_id`. * **`provenance_log`** -- per-`(horizon, attribute, source)` provenance rows for downstream auditing. In QGIS: **Layer → Add Layer → Add Vector Layer → `perfil_042.gpkg`**. The point appears on the canvas with all classification metadata in the feature pop-up; styling rules can map symbol colour to the evidence grade or the assigned RSG. # 8. The complete picture ```{r diagram, eval = FALSE} # Pipeline summary: # # field GPS -> soil_classes_at_location() "what to expect" # | # v # PDF + photo -> classify_from_documents() (Gemma 4) populates PedonRecord # | # v # Vis-NIR scan -> classify_by_spectral_neighbours() spectral prior # | # v # -> classify_wrb2022() + classify_sibcs() + classify_usda() # | (the deterministic step -- canonical) # v # -> report() / report_to_qgis() deliverables ``` Each step's output carries explicit provenance into the next; the final `evidence_grade` reflects the worst-source rule applied to the attributes that were decisive in the assigned name. Two pedologists running this pipeline on the same documents get the same output bit-for-bit. # Summary soilKey separates four distinct stages: 1. **Spatial guides** (`soil_classes_at_location`) -- expectations from a soil-class raster. 2. **Extraction** (`classify_from_documents`, `extract_*`) -- VLM populates a `PedonRecord`, never classifies. 3. **Spectral analogy** (`classify_by_spectral_neighbours`) -- OSSL nearest-neighbour analogy as a prior. 4. **Deterministic classification** (`classify_wrb2022 / classify_sibcs / classify_usda`) -- the canonical step. Plus two delivery formats: HTML reports (`report`) and GeoPackage exports (`report_to_qgis`). All four stages preserve provenance and evidence grading; the deterministic key remains the only thing that *assigns* a class.