--- title: "Database / indexing layer" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Database / indexing layer} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = FALSE) ``` ```{r minimal-example, eval = TRUE} # Minimal executable example — selectRecords() works entirely in memory library(gmsp) library(data.table) master <- data.table( RecordID = c("aabbccdd00112233", "aabbccdd00112233", "eeff00112233aabb"), OwnerID = c("NGAW", "NGAW", "CESMD"), EventID = c("20100227T063452Z", "20100227T063452Z", "20110311T054624Z"), StationID = c("ANTU", "ANTU", "MYG004"), DIR = c("H1", "H2", "H1"), EventMagnitude = c(8.8, 8.8, 9.1), Repi = c(90, 90, 140) ) sel <- selectRecords(master[EventMagnitude > 8 & DIR == "H1"]) print(sel) ``` `gmsp` ships an optional layer for managing a local strong-motion record archive. It is **separate** from the signal-processing core (`AT2TS`, `TS2IMF`, `TSL2PS`, `getIntensity`) — you can use the core without ever touching the indexing layer. The indexing layer assumes records on disk in a fixed directory structure. The base paths are yours to choose; functions that touch disk take explicit `path`, `path.records`, or `path.index` arguments. ## Expected file layout ``` / ← you choose this / e.g. "NGAW", "CESMD", "ESM" / e.g. "20060803T030800Z" / e.g. "NTYB" raw.owner/ provider files as downloaded record.json owner-supplied metadata .AT2 / .v2 / .ac / .tr / ... raw/ gmsp output of extractRecord() AT..csv WIDE: provider OCID columns (scaled to mm) AT..json DIR / OCID / NP / PGA / dt / Fs / Units / ← you choose this RawFileTable..csv provider file inventory RawRecordTable..csv one row per RecordID RawIntensityTable..csv per (RecordID, DIR), 20 IM scalars EventTable..csv event metadata StationTable..csv station metadata / ← you choose this .csv writeSelection() output .json sidecar with audit metadata ``` ## Provider formats supported | `OwnerID` | Format | Parser | Quantity | Notes | |---|---|---|---|---| | `NGAW` | AT2 | `readAT2()` | AT | PEER NGA-West2 (4-line header, NPTS/DT) | | `CESMD` | V2 / V2c | `readV2()` | AT | multi-channel V2 or single-channel V2c | | `NWZ` | V2A | `readV2A()` | AT | NWZ-flavoured V2 | | `GSC` | TR (A/B/C/Z) | `readTR()` | AT | Geological Survey of Canada | | `IGP` | ACA / LIS | `readAC()` | AT | Instituto Geofísico del Perú | | `UCR` | ACB | `readAC()` | AT | Universidad de Costa Rica | | Generic | two-col | `readTwoCol()` | AT | (t, s) ASCII columns; used by CAL, CENA, etc. | | `ISEE` | ISEE | `readISEE()` | VT | Micromate / ISEE blasting seismograph (mm/s velocity, MicL dropped) | Each parser returns a LONG `data.table(t, OCID, s)` for one component file. `parseRecord()` is the dispatcher that consults `.OWNER_FORMAT` and calls the right parser for the owner. ## Extraction pipeline ``` parseRecord() ── reads raw.owner/* via the owner's parser │ returns LONG (t, OCID, s) for all components ▼ mapComponents() ── derives DIR labels H1 / H2 / UP from provider OCIDs │ H1/H2 are derived processing directions │ `extractRecord()` uses rotate = FALSE │ Returns NULL for arrays or 2-comp records ▼ alignComponents() ── pads (or truncates) to equal NP across components │ ▼ extractRecord() ── scales to canonical mm via .parseUnits + .getSF writes raw/..csv + ..json CSV columns remain provider OCID values; the JSON sidecar stores the DIR -> OCID mapping. KIND ∈ {AT, VT, DT} -- derived from the Units suffix by .parseKind(), or forced by the `kind = "VT"` argument (e.g. for blasting records whose Units may be missing). Sidecar peak field is named accordingly: PGA (KIND=AT) / PGV (KIND=VT) / PGD (KIND=DT). RecordID = first 16 hex chars of md5(CSV). ``` `extractRecord()` is the orchestrator; parsers and `mapComponents()` are public so they can be reused or audited. Public calls use `parseRecord(.x, path)` and `extractRecord(.x, path)`, where `.x` is the one-record master subset and `path` is the records root. ## Indexing tables After `extractRecord()` has produced `raw/` outputs for some records, the indexing functions scan the records tree and emit per-owner CSVs to `/`: * `buildRawFileTable()` — provider-file inventory (one row per `ComponentID × FileID`); reads `raw.owner/record.json` or `raw.owner.tar.gz` (post-archive safe). * `buildRawRecordTable()` — one row per `RecordID` (`NP = max(post-align)`, `pad = max NP − min NP`, `Fs`). * `buildRawIntensityTable()` — calls `getRawIntensities()` per station; emits three rows per record (one per `DIR`), each carrying the 20 AT-derivable scalars from `getIntensity()`. The provider-flatfile + USGS catalog join (`buildEventTable()`) is under development and ships in `inst/dev/`; it is not yet part of the exported API. ## Master record catalog `buildMaster()` joins, per owner: * `RawRecordTable..csv` (record list), * `EventTable..csv` (event scalars, merged via `fcoalesce` with source precedence `*.owner` > `*.USGS` > `*.ISC`), * `StationTable..csv` (station scalars including Vs30), and emits a `data.table` keyed at `(RecordID, DIR)`. It adds: * `Repi` — epicentral distance (haversine, km), * `Rhyp` — hypocentral distance, $\sqrt{\mathrm{Repi}^2 + \mathrm{EventDepth}^2}$ (km). After `buildMaster()` you can filter the master and pass the subset to `selectRecords()` to produce a `(RecordID, OwnerID, EventID, StationID)` selection, which is the input contract for the `readTS()` family — `readAT()` / `readVT()` / `readDT()` are KIND-specific wrappers around `readTS(.x, path, kind = ...)` — and for `writeSelection()` (persists the selection to disk for orchestration). ## Composing with the processing core The natural composition for acceleration records is: ```r M <- buildMaster(path = "") Selection <- selectRecords(M[EventMagnitude > 7 & Repi < 100 & DIR == "H1"]) TS <- readAT(.x = Selection, path = "") ATS <- TS[, AT2TS(.SD, units.source = "mm", Fmax = 25), by = .(RecordID, OwnerID, EventID, StationID)] ``` The output of `readAT()` is a wide table keyed by `(RecordID, OwnerID, EventID, StationID, t)` with one column per provider `OCID`. `AT2TS()` consumes it per record. The shape is identical for `readVT()` and `readDT()`; pair them with `VT2TS()` / `DT2TS()`. Blasting records (e.g. ISEE) typically flow through `readVT()` + `VT2TS()`. ## Audit helpers * `auditSite(M)` — flags rows with missing or out-of-range `StationVs30`. * `auditDistances(M)` — flags `lat/lon` NA or out-of-range, negative depths, large `Repi`, geometric impossibility (`Rhyp < Repi`). * `auditParsers(.x = M, owner = "NGAW", path = ...)` — dry-run `parseRecord()` per `(EventID, StationID)` of one owner and report OK / FAIL / WARN with reason. ## Maintenance `archiveRawOwner(path)` compresses `raw.owner/` to `raw.owner.tar.gz` after extraction has succeeded, verifies the archive is readable, and only then unlinks the original. ## Notes * The package does **not** download data. Bringing raw provider files to `raw.owner/` is the user's responsibility. Examples under `examples/maintenance/` in the source repository show a pattern for ingestion (USGS catalog matching, staging / promote / rollback, etc.). * `RecordID` is a 16-character hex hash (`openssl::md5` of the WIDE CSV body, truncated). It is stable across re-extraction of the same record.