---
title: "palimpsestr: Probabilistic Decomposition of Archaeological Palimpsests Using Stratigraphic Entanglement Fields"
author: "Enzo Cocca"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{palimpsestr: Probabilistic Decomposition of Archaeological Palimpsests}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5.5, dpi = 150,
                      warning = FALSE, message = FALSE)
```

*Archaeological palimpsests --- deposits in which material from multiple occupation phases is superimposed and partially intermixed --- represent one of the most persistent analytical challenges in field archaeology. The **palimpsestr** R package implements the Stratigraphic Entanglement Field (SEF) framework, a probabilistic method for decomposing palimpsests into latent depositional phases by jointly modelling four evidence domains: spatial proximity, vertical distribution, chronological overlap, and cultural similarity. The package provides a diagonal-covariance Gaussian mixture model fitted via Expectation-Maximisation, augmented with taphonomic weighting and stratigraphic entanglement penalties. Three interpretable diagnostics --- the Stratigraphic Entanglement Index (SEI), Excavation Stratigraphic Energy (ESE), and Palimpsest Dissolution Index (PDI) --- allow practitioners to assess deposit coherence, detect intrusive finds, and evaluate overall phase separability. This article describes the methodology and demonstrates a complete analytical workflow on a realistic multi-period Roman villa dataset.*

## Introduction

Archaeological palimpsests arise when successive episodes of human occupation leave material traces within the same sedimentary matrix, producing deposits in which artefacts, ecofacts, and features from chronologically distinct phases co-occur in close spatial and vertical proximity. The phenomenon is ubiquitous: from deeply stratified Near Eastern tells to shallow open-air Palaeolithic sites where millennia of activity are compressed into a few centimetres of deposit. Disentangling the depositional history of such contexts is essential for any behavioural, technological, or settlement-pattern inference, yet the analytical toolkit available to excavators has remained surprisingly limited.

The Harris Matrix (Harris, 1989) provides a rigorous formalism for recording observed stratigraphic relationships, but its relationships are deterministic and binary --- a unit either overlies another or it does not. It cannot express the gradual, probabilistic mixing that characterises many palimpsests. Conversely, spatial clustering methods such as k-means (Kintigh and Ammerman, 1982) operate on coordinate data alone, ignoring vertical stratigraphy, chronological evidence from radiocarbon or ceramic seriation, and cultural-typological information. Neither approach offers a unified probabilistic framework, and neither produces quantitative diagnostics for deposit integrity or mixing intensity.

**palimpsestr** addresses these limitations by integrating four evidence domains --- horizontal coordinates, vertical elevation, chronological range, and cultural class --- into a single model that produces soft (probabilistic) phase assignments together with three diagnostic statistics. The Stratigraphic Entanglement Index (SEI) quantifies pairwise depositional coherence; Excavation Stratigraphic Energy (ESE) captures local disruption; and the Palimpsest Dissolution Index (PDI) summarises global phase separability. All three are interpretable by field practitioners and can be exported to GIS for spatial interrogation.

## Methodology

### The Stratigraphic Entanglement Index (SEI)

The SEI quantifies how strongly two finds $i$ and $j$ are "entangled" --- that is, how much evidence supports their co-occurrence within the same depositional event. It is defined as:

$$SEI_{ij} = \frac{w_s}{d_{xy}(i,j) + \epsilon} + \frac{w_z}{\max(|z_i - z_j|, z_{floor})} + w_t \cdot O_t(i,j) + w_c \cdot \mathbb{1}[c_i = c_j]$$

where:

- $d_{xy}(i,j)$ is the Euclidean distance in the horizontal plane, and $\epsilon$ is a small constant preventing division by zero. The first term captures spatial proximity: finds close together in plan are more likely to derive from the same depositional event.
- $|z_i - z_j|$ is the absolute vertical separation, with $z_{floor}$ a minimum threshold. The second term captures stratigraphic proximity: finds at the same depth are more likely contemporaneous.
- $O_t(i,j)$ is the chronological overlap ratio, computed from the intersection of the temporal ranges $[t_{min,i}, t_{max,i}]$ and $[t_{min,j}, t_{max,j}]$ relative to their union. Finds whose date ranges overlap strongly receive higher entanglement scores.
- $\mathbb{1}[c_i = c_j]$ is an indicator function equal to 1 when both finds belong to the same cultural or typological class, and 0 otherwise.
- $w_s, w_z, w_t, w_c$ are domain weights (defaulting to equal weighting) that allow the analyst to encode prior knowledge about which evidence domains are most reliable for a given site.

Each SEI component is normalised to [0, 1] by dividing by its within-dataset maximum, making weights directly interpretable as relative importance. The SEI matrix is row-normalised and used both as a regularisation penalty during model fitting and as the basis for the ESE diagnostic.

### The Diagonal Gaussian Mixture EM

Phase decomposition proceeds via a Gaussian mixture model fitted by Expectation-Maximisation (EM). The feature matrix for each find is constructed by concatenating: scaled horizontal coordinates $(x, y)$, vertical coordinate $(z)$, temporal midpoint and temporal span of the date range, and a one-hot encoding of the cultural class variable. This multi-domain feature representation ensures that the clustering draws on all available evidence simultaneously.

The algorithm is initialised with K-means, and the resulting hard assignments are converted to softmax probabilities. The EM procedure then iterates between an E-step (computing posterior phase membership probabilities given current parameters) and an M-step (updating mixture weights, means, and diagonal covariance matrices given current memberships). Two modifications distinguish SEF from a standard Gaussian mixture:

1. **Taphonomic weighting**: if a taphonomic integrity score is available for each find, it is used to down-weight poorly preserved or disturbed items during parameter estimation, reducing their influence on cluster centroids.
2. **Stratigraphic entanglement penalties**: the SEI matrix is incorporated as a soft constraint, encouraging finds with high mutual entanglement to receive similar phase assignments.

Convergence is assessed by monitoring the log-likelihood; the algorithm terminates when the relative change falls below a tolerance threshold or a maximum number of iterations is reached.

### Diagnostic Statistics

Three statistics summarise different aspects of the decomposition:

- **Excavation Stratigraphic Energy (ESE)**: for each find, ESE measures its dissimilarity from its spatial neighbours in terms of phase assignment. A high ESE value indicates that a find is surrounded by items assigned to different phases --- a signature of depositional disruption, bioturbation, or post-depositional displacement.

- **Palimpsest Dissolution Index (PDI)**: a global statistic defined as $PDI = 1 - \bar{H} / \log(K)$, where $\bar{H}$ is the mean Shannon entropy of the phase-probability vectors across all finds and $K$ is the number of phases. PDI ranges from 0 (complete mixing; every find is equally likely to belong to any phase) to 1 (perfect separation; every find is assigned to exactly one phase with certainty). Values above 0.7 generally indicate well-resolved phases.

- **Entropy**: the Shannon entropy of each find's phase-probability vector, $H_i = -\sum_k p_{ik} \log p_{ik}$. Low entropy means confident assignment; high entropy signals ambiguity, typically at phase boundaries or within heavily mixed zones.

- **Intrusion score**: a composite metric combining high entropy, high ESE, and low SEI connectivity. Finds exceeding a threshold on this composite are flagged as candidate intrusions --- items that are spatially or stratigraphically anomalous relative to their assigned phase.

## The Dataset: Villa Romana

We demonstrate the complete analytical workflow using `villa_romana`, a realistic multi-period Roman villa dataset included in the package. It simulates 300 finds from a villa with 4 occupation phases across 18 stratigraphic units, incorporating realistic post-depositional disturbances.

```{r load-data}
library(palimpsestr)
data(villa_romana)
cat("Finds:", nrow(villa_romana), "\n")
cat("Stratigraphic units:", length(unique(villa_romana$context)), "\n")
cat("Material classes:", length(unique(villa_romana$class)), "\n")
cat("Date range:", min(villa_romana$date_min), "to", max(villa_romana$date_max), "\n")
```

The four phases represent:

| Phase | Period | Chronology |
|:------|:-------|:-----------|
| 1 | Republican | 2nd--1st c. BCE |
| 2 | Early Imperial | 1st--2nd c. CE |
| 3 | Late Imperial | 3rd--4th c. CE |
| 4 | Late Antique | 5th--6th c. CE |

The dataset includes realistic disturbances: bioturbation (8% vertical displacement), construction cuts (5% stratigraphic intrusions), and residual pottery (3% old material in younger contexts).

```{r data-summary}
cat("Material classes:\n")
sort(table(villa_romana$class), decreasing = TRUE)
```

```{r data-structure}
str(villa_romana)
```

## Phase Count Selection

When the number of phases is unknown, `compare_k()` fits models across a range of $K$ values and reports BIC, ICL, and PDI for each. The optimal $K$ minimises BIC while maximising PDI.

```{r compare-k}
ck <- compare_k(
  villa_romana,
  k_values = 2:7,
  tafonomy = "taf_score",
  context  = "context",
  seed     = 42
)
print(ck)
```

```{r gg-compare-k, fig.cap="Model selection diagnostics: BIC, PDI, and mean entropy across candidate phase counts. The dashed line marks the optimal K."}
if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_compare_k(ck)
}
```

## Fitting the Model (K = 4)

Based on the selection diagnostics and archaeological knowledge of the site (4 occupation periods), we fit with K = 4 and multiple random initialisations.

```{r fit-model}
fit <- fit_sef(
  villa_romana,
  k        = 4,
  tafonomy = "taf_score",
  context  = "context",
  n_init   = 10,
  seed     = 42
)
print(fit)
```

```{r model-summary}
summary(fit)
```

### Model Overview

```{r model-stats}
cat("PDI:", pdi(fit), "\n")
sef_summary(fit)
```

### EM Convergence

```{r gg-convergence, fig.cap="EM convergence trace showing log-likelihood stabilisation across iterations."}
if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_convergence(fit)
}
```

## Spatial Diagnostics

### Phase-field Map

```{r gg-phasefield, fig.cap="Dominant phase assignment. Point size reflects assignment confidence (inverse entropy). Larger points indicate higher confidence."}
if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_phasefield(fit)
}
```

The phase-field map reveals the spatial structure of the decomposition. The model assigns each find to one of four chronologically coherent phases despite the presence of post-depositional disturbance.

### Entropy Map

```{r gg-entropy, fig.cap="Spatial distribution of Shannon entropy. High-entropy zones (bright) mark areas of uncertain phase assignment, typically at phase boundaries or in heavily mixed deposits."}
if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_entropy(fit)
}
```

Entropy mapping is particularly useful for identifying the spatial extent of mixing. Zones of elevated entropy correspond to areas where multiple depositional phases are interleaved and cannot be cleanly separated.

### Energy Map (ESE)

```{r gg-energy, fig.cap="Excavation Stratigraphic Energy field. Elevated ESE values (bright) highlight zones of depositional disruption where neighbouring finds belong to different phases."}
if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_energy(fit)
}
```

The energy map highlights local disruptions. Clusters of high-ESE finds may indicate bioturbation channels, pit cuts, or other post-depositional disturbances.

### Intrusion Detection Map

```{r gg-intrusions, fig.cap="Candidate intrusions overlaid on the phase-field map. Circled finds exhibit high entropy, high stratigraphic energy, and low entanglement with their neighbours. Top 10 suspects are labelled."}
if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_intrusions(fit, top_n = 10)
}
```

### Vertical Phase Profile

```{r gg-phase-profile, fig.cap="Vertical phase profile: depth vs horizontal position. Phase 1 (deepest) corresponds to the earliest occupation; Phase 4 (shallowest) to the most recent."}
if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_phase_profile(fit)
}
```

The vertical profile confirms the expected stratigraphic ordering: earlier phases occupy deeper strata, later phases are shallower.

## Intrusion Analysis

The `detect_intrusions()` function computes a composite intrusion probability for each find, combining entropy, ESE, and SEI connectivity into a single score.

```{r intrusions}
di <- detect_intrusions(fit)
suspects <- di[di$intrusion_prob > 0.5, ]
cat("Suspected intrusions:", nrow(suspects), "out of", nrow(villa_romana), "finds",
    sprintf("(%.1f%%)\n", 100 * nrow(suspects) / nrow(villa_romana)))
```

```{r intrusion-table}
if (nrow(suspects) > 0) {
  # Merge with source data and phase assignments
  phase_tab <- as_phase_table(fit)
  susp_data <- merge(suspects, villa_romana[, c("id", "context", "class")], by = "id")
  susp_data <- merge(susp_data, phase_tab[, c("id", "dominant_phase")], by = "id")
  susp_data <- susp_data[order(susp_data$intrusion_prob, decreasing = TRUE), ]
  susp_show <- susp_data[, c("id", "context", "class", "dominant_phase", "intrusion_prob")]
  names(susp_show) <- c("Find", "US", "Class", "Phase", "Intrusion Prob.")
  knitr::kable(head(susp_show, 15), row.names = FALSE, digits = 3,
               caption = "Top suspected intrusions ranked by intrusion probability.")
}
```

Flagged finds warrant closer examination: they may represent genuinely intrusive material (e.g., items displaced by animal burrowing or construction cuts), or they may sit at genuine phase boundaries where assignment is inherently ambiguous.

## Stratigraphic Unit Purity

The `us_summary_table()` function aggregates diagnostics per stratigraphic unit, reporting dominant phase, purity, mean entropy, energy, and intrusion count.

```{r us-purity}
us_tab <- us_summary_table(fit)
knitr::kable(us_tab, row.names = FALSE, digits = 3,
             caption = "Per-US diagnostics: dominant phase, purity, mean entropy, mean ESE, and number of intrusions.")
```

Units with purity > 90% indicate consistent phase assignment within the context. Lower purity and higher mean entropy correspond to disturbed contexts where bioturbation or construction cuts have introduced material from adjacent phases.

```{r purity-stats}
n_pure <- sum(us_tab$purity >= 0.9)
cat(n_pure, "out of", nrow(us_tab), "units have purity >= 90%\n")
```

## Phase Transition Matrix

Shows how often finds from phase $i$ sit directly above finds from phase $j$, revealing spatial adjacency between phases:

```{r transition-matrix}
tm <- phase_transition_matrix(fit)
print(tm)
```

High off-diagonal values indicate zones of contact between phases. This is expected at the boundary between successive occupation periods.

## Model Validation

When true phase assignments are known (e.g., from simulation or independently dated contexts), palimpsestr provides two validation metrics:

- **Adjusted Rand Index (ARI)**: measures agreement between estimated and true phases, corrected for chance. Values near 1 indicate perfect agreement; values near 0 indicate random agreement.
- **Confusion matrix**: cross-tabulates estimated versus true phases, with automatic reordering to maximise diagonal agreement.

```{r validation}
ari <- adjusted_rand_index(fit, villa_romana$true_phase)
cat("Adjusted Rand Index:", round(ari, 3), "\n")
```

```{r confusion}
confusion_matrix(fit, villa_romana$true_phase)
```

```{r gg-confusion, fig.cap="Confusion matrix heatmap. Strong diagonal = correct phase recovery. Off-diagonal cells indicate misattributed finds, often corresponding to bioturbated or redeposited material."}
if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_confusion(fit, villa_romana$true_phase)
}
```

## Sensitivity Analysis of SEI Weights

The SEI combines four evidence domains (spatial, vertical, temporal, cultural) with configurable weights. To assess how weight choices affect the results, we compare three configurations:

```{r sensitivity}
configs <- list(
  equal    = c(ws = 1, wz = 1, wt = 1, wc = 1),
  spatial  = c(ws = 2, wz = 1, wt = 0.5, wc = 0.5),
  temporal = c(ws = 0.5, wz = 0.5, wt = 2, wc = 1)
)

sens <- do.call(rbind, lapply(names(configs), function(nm) {
  w <- configs[[nm]]
  f <- fit_sef(villa_romana, k = 4, weights = w, seed = 42,
               tafonomy = "taf_score", context = "context")
  data.frame(
    config = nm,
    pdi = pdi(f),
    ari = adjusted_rand_index(f, villa_romana$true_phase),
    mean_entropy = mean(f$entropy, na.rm = TRUE),
    stringsAsFactors = FALSE
  )
}))
knitr::kable(sens, digits = 4,
             caption = "Sensitivity of PDI, ARI, and mean entropy to weight configuration.")
```

The results show that the model is reasonably robust to weight variation, though emphasising the most informative domain for a given site can improve performance. The `optimize_weights()` function provides a data-driven approach to weight selection via cross-validated log-likelihood.

## Bootstrap Confidence Intervals

Uncertainty on key statistics can be quantified via bootstrap resampling:

```{r bootstrap}
bs <- bootstrap_sef(fit, n_boot = 50, true_labels = villa_romana$true_phase, verbose = FALSE)
knitr::kable(bs, digits = 4,
             caption = "Bootstrap confidence intervals (50 replicates) for key model statistics.")
```

```{r gg-bootstrap, fig.cap="Bootstrap confidence intervals for PDI, mean entropy, mean energy, log-likelihood, and ARI."}
if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_bootstrap(bs)
}
```

## Harris Matrix Integration

palimpsestr can incorporate stratigraphic information from a Harris Matrix as a penalty during model fitting. The `harris_from_contexts()` function auto-generates a penalty matrix from the mean depth of finds in each context. Alternatively, `read_harris()` imports an external Harris Matrix from a CSV edge list.

```{r harris}
H <- harris_from_contexts(villa_romana, z_col = "z", context_col = "context")
cat("Harris penalty matrix:", nrow(H), "x", ncol(H), "\n")
```

```{r harris-fit}
fit_h <- fit_sef(villa_romana, k = 4, harris = H,
                 tafonomy = "taf_score", context = "context",
                 n_init = 5, seed = 42)
cat("PDI without Harris:", round(pdi(fit), 4), "\n")
cat("PDI with Harris:   ", round(pdi(fit_h), 4), "\n")
cat("ARI without Harris:", round(adjusted_rand_index(fit, villa_romana$true_phase), 4), "\n")
cat("ARI with Harris:   ", round(adjusted_rand_index(fit_h, villa_romana$true_phase), 4), "\n")
```

### Validate phases against stratigraphy

```{r harris-validate}
validate_phases_harris(fit)
```

The `consistent` column flags whether the dominant phase ordering respects stratigraphic depth. `FALSE` indicates an inversion that may warrant investigation.

## Interpretive Report

The `report_sef()` function generates a plain-language summary of the analysis, suitable for inclusion in excavation reports. It is available in English and Italian.

```{r report-en}
report_sef(fit, lang = "en")
```

```{r report-it}
report_sef(fit, lang = "it")
```

## Structured Exports

All results can be exported to CSV for integration with external databases and GIS:

```{r export, eval=FALSE}
export_results(fit, dir = "results/", format = "csv", prefix = "villa_romana")
```

This creates 4 files:

- `villa_romana_phases.csv` --- phase assignments and diagnostics per find
- `villa_romana_intrusions.csv` --- intrusion probabilities
- `villa_romana_us_summary.csv` --- per-US aggregated statistics
- `villa_romana_model_summary.csv` --- model-level metrics (n, k, PDI, BIC, etc.)

## GIS Export

For integration with GIS workflows, palimpsestr can export results as `sf` spatial objects:

```{r gis-export, eval=FALSE}
if (requireNamespace("sf", quietly = TRUE)) {
  sf_pts   <- as_sf_phase(fit, crs = 32632L, dims = "XYZ")
  sf_links <- as_sf_links(fit, quantile_threshold = 0.90, crs = 32632L)
  sf::st_write(sf_pts,   "villa_romana.gpkg", layer = "phases")
  sf::st_write(sf_links, "villa_romana.gpkg", layer = "sei_links")
}
```

## Interactive Visualization

When **plotly** is available, any `gg_*` plot can be converted to an interactive visualization using `as_plotly()`:

```{r interactive, eval=FALSE}
as_plotly(gg_phasefield(fit))
as_plotly(gg_intrusions(fit))
```

## Comparison with Traditional Approaches

| Aspect | Harris Matrix | k-means | palimpsestr (SEF) |
|:-------|:-------------|:--------|:------------------|
| Evidence domains | Stratigraphy only | Spatial only | Spatial + vertical + temporal + cultural |
| Assignment type | Deterministic | Hard | Probabilistic (soft) |
| Mixing detection | No | No | Yes (entropy, ESE) |
| Intrusion detection | Manual | No | Automatic |
| Scalability | Manual | Good | Good |
| Interpretability | High | Low | High (PDI, reports) |

The SEF framework does not replace the Harris Matrix. Rather, it extends it. The Harris Matrix remains indispensable for documenting observed physical relationships between depositional units. What palimpsestr adds is a quantitative, probabilistic layer of analysis that operates on find-level data within those units --- precisely the scale at which palimpsest formation is most problematic.

## Model Limitations and Methodological Notes

### Diagonal Covariance Assumption

The EM uses diagonal Gaussians, assuming conditional independence between features within each phase. This is a deliberate parsimony choice: for $p$ features and $k$ phases, diagonal covariance requires $k(2p+1)-1$ parameters versus $k(p(p+3)/2+1)-1$ for full covariance. With typical archaeological datasets (n = 50--500, p = 7--10), full covariance risks overfitting.

### SEI Component Normalisation

Each SEI component is normalised to [0, 1] by dividing by its within-dataset maximum. This makes weights interpretable but means **absolute SEI values are not comparable across datasets**. Cross-site comparisons should use derived statistics (PDI, mean entropy) which are scale-invariant.

### PDI Interpretation

The PDI is a descriptive measure, not a formal test statistic. The interpretive guideline (PDI > 0.7 = well-resolved) is empirically derived from simulation studies and should not be treated as a formal decision boundary. Bootstrap confidence intervals provide uncertainty quantification.

### Intrusion Detection

The intrusion probability score is a heuristic combining rescaled entropy, ESE, and inverse local SEI. The 0.5 threshold is an exploratory guideline. Users should examine flagged observations individually using domain knowledge.

### Phase Labelling

Phase numbers are arbitrary in mixture models (label switching). Use `reorder_phases()` to ensure phase 1 = deepest (oldest) stratum. The bootstrap procedure applies phase reordering to each replicate.

### Cross-Validation

The `cv_sef()` and `optimize_weights()` functions standardise test data using the training set's center and scale, ensuring consistent feature scales across folds.

### Scalability

The SEI and ESE matrices are $O(n^2)$ in memory. For datasets exceeding ~2000 finds, use `sei_sparse()` or the `max_dist` parameter in `sei_matrix()` to limit computation to spatial neighbours.

## Conclusion

The **palimpsestr** package offers a unified probabilistic framework for decomposing archaeological palimpsests, addressing a long-standing gap between qualitative stratigraphic reasoning and quantitative spatial analysis. By jointly modelling spatial proximity, vertical distribution, chronological overlap, and cultural similarity, the Stratigraphic Entanglement Field approach produces soft phase assignments that acknowledge --- rather than suppress --- the inherent uncertainty of mixed deposits. The three diagnostic statistics (SEI, ESE, PDI) provide interpretable, actionable summaries at the find, deposit, and assemblage levels respectively.

The package is available on GitHub at <https://github.com/enzococca/palimpsestr>.

```{r eval=FALSE}
# install.packages("remotes")
remotes::install_github("enzococca/palimpsestr")
```