--- title: "Evaluation and Evidence" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Evaluation and Evidence} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = TRUE) ``` ```{css, echo = FALSE, eval = TRUE} .llmshieldr-info-box { border-left: 4px solid #2f80ed; background: #f3f8ff; padding: 1rem 1.15rem; margin: 1.5rem 0; border-radius: 0.35rem; } .llmshieldr-info-box h2, .llmshieldr-info-box h3, .llmshieldr-info-box h4 { margin-top: 0; } .llmshieldr-info-box p:last-child, .llmshieldr-info-box ul:last-child, .llmshieldr-info-box ol:last-child { margin-bottom: 0; } ``` `llmshieldr` includes a small starter corpus and an evaluation helper so teams can measure behavior before adopting a policy. The corpus is intentionally small. It is meant to start a repeatable process, not prove production-grade security. ```{r} library(llmshieldr) ``` ## Corpus The packaged corpus lives at `inst/extdata/security_eval_cases.csv`. It covers: - benign prompts, - direct and indirect prompt injection, - delimiter, invisible-text, Unicode, and encoded evasions, - PII, PHI, and secrets, - unsafe code, - excessive agency, - system-prompt extraction, - medical and financial misinformation, - clinical, finance, education, developer, and URL false-positive cases. Each row includes: - `id`: stable case identifier. - `stage`: prompt, context, or output. - `category`: human-readable risk type. - `owasp`: mapped OWASP LLM category, or `none`. - `label`: benign, sensitive, or malicious. - `text`: input text to scan. - `expected_action`: expected scanner action. - `notes`: why the case exists. Inspect it before running benchmarks: ```{r} path <- system.file("extdata", "security_eval_cases.csv", package = "llmshieldr") cases <- read.csv(path, stringsAsFactors = FALSE) cases[, c("id", "stage", "category", "expected_action")] ``` ## Run the Evaluation ```{r} results <- evaluate_security_cases( cases = cases, policy = "comprehensive", checks = "rules" ) results ``` Useful headline metrics: ```{r} data.frame( cases = nrow(results), action_accuracy = mean(results$matched), median_latency_ms = median(results$latency_ms), p95_latency_ms = as.numeric(quantile(results$latency_ms, 0.95)) ) ``` For release notes, report the package version, R version, optional dependency versions, policy name, check mode, and reviewer model when `checks = "llm"` or `checks = "both"`. ## Interpret Results Recommended reporting: - Detection rate for malicious cases. - Redaction rate for sensitive cases. - False-positive rate for benign cases. - Action accuracy against `expected_action`. - Median and p95 scan latency. - False positives and false negatives by case id. Keep deterministic rules, NLP checks, and semantic reviewer checks separate. Semantic reviewer behavior depends on the model, prompt wrapper, temperature, endpoint behavior, and JSON reliability. Do not present OWASP taxonomy mapping as proof of effective protection. Include false positives and false negatives in release notes when they affect documented behavior. Keep the packaged corpus compact enough for tests, and keep larger benchmarks in separate scripts or long-running external reports. ## Opt-In Benchmark Script The repository also includes: ```text inst/scripts/benchmark-security-eval.R ``` Run it locally before releases or adoption reviews. It prints action accuracy, median latency, p95 latency, package version, R version, and per-case results. ::: {.llmshieldr-info-box} ## Caveats The starter corpus is deliberately transparent and compact. It should be extended with organization-specific benign and risky examples before production use. Do not present OWASP category mapping or action accuracy on this corpus as proof that a workflow is secure, compliant, jailbreak-proof, or complete for PII/PHI discovery. :::