--- title: "Policy Design" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Policy Design} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = TRUE) ``` ```{css, echo = FALSE, eval = TRUE} .llmshieldr-info-box { border-left: 4px solid #2f80ed; background: #f3f8ff; padding: 1rem 1.15rem; margin: 1.5rem 0; border-radius: 0.35rem; } .llmshieldr-info-box h2, .llmshieldr-info-box h3, .llmshieldr-info-box h4 { margin-top: 0; } .llmshieldr-info-box p:last-child, .llmshieldr-info-box ul:last-child, .llmshieldr-info-box ol:last-child { margin-bottom: 0; } ``` This vignette explains how `llmshieldr` policies are assembled, what sources the built-in policies draw from, and how the numeric scores become actions. ```{r} library(llmshieldr) ``` ## Design Goal An LLM safety policy should be easy to inspect, easy to test, and easy to extend. `llmshieldr` therefore uses explicit rule objects rather than a hidden classifier as its default layer. A policy is a list with these fields: ```text shieldr_policy name policy identifier stored in reports rules list of shieldr_rule objects thresholds redact_at and block_at numeric cutoffs rate_guard optional shieldr_rate_guard environment trusted_sources optional allowlist used by scan_context() controls secure_chat() block/refuse/escalate/drop behavior ``` Each rule is similarly explicit: ```text shieldr_rule id stable rule identifier pattern regex pattern, or NULL fn R predicate function, or NULL owasp OWASP LLM category severity low, medium, high, or critical action allow, redact, or block description human-readable explanation ``` Exactly one of `pattern` or `fn` must be supplied. This keeps each rule deterministic and makes redaction spans possible for regex rules. ## Source Model The built-in rules are sourced from a small number of security and governance concerns that recur across LLM applications: - OWASP GenAI / LLM Top 10: - Prompt-injection patterns: direct override language, hidden instructions, role confusion, and system-prompt extraction attempts - NLP intent signals: token and stem patterns for override, secret exposure, harmful intent, and unusually dense directive language. Trigger seed groups are expanded with stems at runtime. - Sensitive information patterns: email, phone, SSN, account numbers, medical record identifiers, subject IDs, API keys, bearer tokens, AWS keys, and credential-bearing connection strings - Agentic risk patterns: model claims that it sent, deleted, granted, executed, notified, traded, or otherwise acted outside the chat boundary - Output handling patterns: unsafe shell or SQL snippets, code execution markers, system-prompt structure, and high-confidence medical or financial claims - RAG-specific signals: untrusted source metadata, anomalous chunk length, and high density of instruction words inside retrieved text - Resource-exhaustion controls: request and token windows managed by `rate_guard()`, with optional strict pre-call reservation and local file-locking for shared guards - Optional scanner controls: invisible Unicode format characters, encoded payloads, URL host policies, token limits, simple language allowlists, and topic bans - Runtime surfaces: conversation scanning, tool-call scanning, tool-output scanning, and streaming output scanning with rolling context These sources are intentionally conservative. They are meant to catch common failure modes in R workflows before prompts, retrieved context, or model output cross a trust boundary. ## Built-In Policy Construction Every built-in policy is assembled in `policy()`. ```text enterprise_default prompt injection rules PII and secret rules system prompt extraction rules excessive agency rules pharma_gxp enterprise_default MRN and clinical subject identifiers diagnosis and treatment claim language code-safety checks stricter thresholds: redact_at = 0.3, block_at = 0.6 finance_strict enterprise_default account number rules financial advice and guaranteed-return language autonomous investment-action language rate_guard(max_tokens = 100000) education_safe enterprise_default minor-related PII patterns academic-integrity bypass language open_research injection rules secret rules higher thresholds: redact_at = 0.8, block_at = 0.95 comprehensive enterprise_default pharma_gxp additions finance_strict additions education_safe additions code-safety checks rate_guard(max_tokens = 100000) moderate thresholds: redact_at = 0.4, block_at = 0.7 custom no rules default thresholds baseline alias for enterprise_default ``` The policy names reflect intended operating posture, not legal compliance. For example, `pharma_gxp` adds useful GxP-style controls, but it does not by itself make a workflow compliant with any regulation. The `comprehensive` policy is broad rather than maximally strict. Use explicit threshold overrides when you want pharma-tier thresholds: ```{r} policy( "comprehensive", overrides = list(thresholds = list(redact_at = 0.3, block_at = 0.6)) ) ``` ## Check Modes Scanners can run different layers depending on the workflow: - `checks = "rules"` runs the policy's deterministic rules. Built-in policies include regex rules and the NLP intent rule. - `checks = "nlp"` runs only NLP intent rules. This is the local token/stem path that uses `tokenizers` and `SnowballC` when installed. - `checks = "llm"` runs only the supplied reviewer, such as `ollama_reviewer()` or your own reviewer function. - `checks = "both"` runs policy rules and the supplied reviewer. ::: {.llmshieldr-info-box} ### Semantic Reviewer Contract The semantic reviewer prompt is inspectable with `reviewer_prompt()`. If you need additional reviewer instructions, wrap the reviewer function or chat object and prepend additive organization-specific context before delegating to the model. Treat `reviewer_prompt()` as an audit/inspection helper, not as a mutable package setting, and preserve the JSON contract below. Reviewer output may be either an array of finding objects or an object with a `findings` array. Each finding can include: ```text rule_id stable reviewer rule identifier owasp OWASP category such as llm01 severity low, medium, high, or critical description human-readable explanation confidence optional number from 0 to 1 evidence optional short evidence string recommended_action optional allow, redact, or block span optional start/end offsets for redaction ``` Malformed JSON and schema issues are soft failures. The scanner warns, keeps deterministic findings, and stores structured reviewer errors in `report$metadata$reviewer_errors`. ::: ## Scoring Each finding contributes a numeric severity weight. | Severity | Contribution | | --- | ---: | | `low` | 0.1 | | `medium` | 0.3 | | `high` | 0.6 | | `critical` | 1.0 | The scanner sums contributions and caps the total at `1.0`. Findings are deduplicated before scoring. Overlapping span findings from the same source, OWASP category, and action count as the strongest single piece of evidence instead of stacking together. Synthetic context findings are scored separately and capped, so source or anomaly signals cannot by themselves dominate a report. ```text findings: medium email finding 0.3 high secret finding 0.6 risk_score = min(0.3 + 0.6, 1.0) = 0.9 ``` The score is deliberately coarse. It is not a probability. It is a deterministic severity index that helps a policy decide whether to allow, redact, or block. ## Action Resolution Thresholds define how much risk the policy tolerates. ```text default thresholds: redact_at = 0.40 block_at = 0.75 ``` Action resolution is conservative: ```text if any finding is critical: block else if any finding's rule action is block: block else if risk_score > block_at: block else if any finding's rule action is redact: redact else if risk_score >= redact_at: redact else: allow ``` This means a policy can block either because of accumulated score, a critical finding, or a single rule that explicitly asks to block. A single high-severity redaction finding does not block solely because its score equals a threshold. ## Redaction Regex rules create match spans. When the resolved action is `redact`, or when a redacting rule contributes to a report, matched spans are replaced with `[REDACTED]`. ```text input: Contact neel@example.com about the ticket. output: Contact [REDACTED] about the ticket. ``` Function rules may not know exact character spans. They can still contribute findings and affect the action, but span redaction depends on span data. Use `redaction_strategy()` to change the replacement operator: ```{r} scan_prompt( "Contact neel@example.com.", redaction = redaction_strategy("mask") ) scan_prompt( "Contact neel@example.com.", redaction = redaction_strategy("hash") ) ``` Hash redaction is deterministic for the same matched text. It can support linkage in audits, but it is not anonymization. ## Optional Scanners Optional scanners run beside policy rules and return normal findings. ```{r} scanners <- scanner_options( max_tokens = 500, blocked_topics = c("unreleased earnings"), allowed_url_hosts = c("example.com", "docs.example.com") ) scan_prompt( "Email neel@example.com about unreleased earnings.", scanners = scanners ) ``` Default scanner options record invisible Unicode format characters and inspect encoded payloads by decoding candidate URL-encoded and base64-like text, then running policy rules over the decoded text. URL host policies, token limits, language allowlists, and topic bans are enabled only when configured. ## Policy Controls Scanner reports resolve to `allow`, `redact`, or `block`. `secure_chat()` then uses policy controls to decide what to do with blocked surfaces. ```{r} controlled <- policy( "enterprise_default", overrides = list( controls = policy_controls( on_prompt_block = "refuse", on_context_block = "drop", on_output_block = "escalate", refusal_message = "Please rephrase the request." ) ) ) ``` `on_context_block = "drop"` is the default because retrieved context is an untrusted auxiliary input. Other context options are `keep_redacted`, `block`, `refuse`, and `escalate`. ## Context Anomaly Scores `scan_context()` adds RAG-specific findings before returning row-aligned reports. It calculates: - robust z-score of character length - robust z-score of instruction-word density Instruction density counts these words per 100 tokens: ```text ignore, forget, override, instead, disregard ``` Rows above `anomaly_threshold`, default `2.5`, receive high-severity OWASP LLM08 findings. If `source_col` is supplied and the policy has `trusted_sources`, rows from outside that allowlist receive a medium-severity OWASP LLM08 finding. These RAG-specific findings are marked as synthetic. Their combined contribution is capped at `0.3` per row before being added to normal rule findings. When context is passed through `secure_chat()`, included rows are assembled with explicit `context row` labels, source labels when available, and separator lines. This keeps retrieved text visually distinct from the user prompt and preserves a row-level path back to audit data. ## Risk Summary `secure_chat()` returns `risk_summary`, a named numeric vector keyed by OWASP category. ```text llm01 = prompt injection and instruction override findings llm02 = sensitive information and secret findings llm06 = excessive agency findings llm08 = retrieved-context trust and anomaly findings llm09 = misinformation and unsupported claim findings llm10 = resource exhaustion and rate-guard findings ``` Each category is capped at `1.0`. The value is useful for dashboards and audit summaries because it shows which risk category dominated a run. ## Rate Guard Semantics `rate_guard()` now checks projected usage before counters are incremented. `secure_chat()` reserves one request before the chat call. With `strict = TRUE`, it also reserves an estimated prompt token count before the call and records only the positive post-call delta after output scanning. If the chat call or output scan fails, the pre-call reservation is rolled back. `concurrent = TRUE` wraps each guard operation in a local file lock through the optional `filelock` package. ## Extending a Policy Start with the closest built-in policy, add rules, then inspect the resulting inventory. ```{r} guardrails <- policy() guardrails <- add_rule( guardrails, id = "llm02.ticket_id", pattern = "\\bTICKET-[0-9]{6}\\b", owasp = "llm02", severity = "medium", action = "redact", description = "Internal support ticket identifier." ) list_rules(guardrails) ``` When a policy becomes important to production, keep its custom rules in package or application code, test them with representative prompts, and write audit logs for real workflow runs.