--- title: "Prompt Template Positional Bias Testing" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Prompt Template Positional Bias Testing} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center" ) library(pairwiseLLM) library(dplyr) library(readr) library(tidyr) library(stringr) library(knitr) ``` # 1. Motivation `pairwiseLLM` uses large language models (LLMs) to compare pairs of writing samples and decide which sample is better on a given trait (for example, _Overall Quality_). If a **prompt template** systematically nudges the model toward the first or second position, then scores derived from these comparisons may be biased. This vignette documents how we: - Designed and tested several prompt templates for **positional bias** - Quantified both **reverse-order consistency** and **preference for SAMPLE\_1** - Selected templates that appear robust across multiple providers and reasoning configurations The vignette also shows how to: - Retrieve the tested templates from the package - Inspect their full text - Access summary statistics from the experiments For basic function usage, see: * [`vignette("getting-started")`](https://shmercer.github.io/pairwiseLLM/articles/getting-started.html) For advanced batch processing workflows, see: * [`vignette("advanced-batch-workflows")`](https://shmercer.github.io/pairwiseLLM/articles/advanced-batch-workflows.html) --- # 2. Testing Process Summary At a high level, the testing pipeline works as follows: 1. **Trait and samples** - Choose a trait (here: `"overall_quality"`) and obtain its description with `trait_description()`. - Use `example_writing_samples` or your own dataset of writing samples. 2. **Generate forward and reverse pairs** - Use `make_pairs()` to generate all ordered pairs. - Use `alternate_pair_order()` to build a deterministic "forward" set. - Use `sample_reverse_pairs()` with `reverse_pct = 1` to build a fully "reversed" set, where SAMPLE\_1 and SAMPLE\_2 are swapped for all pairs. 3. **Prompt templates** - Define multiple templates (e.g., `"test1"`–`"test5"`) and register them in the template registry. - Each template is a text file shipped with the package and accessed via `get_prompt_template("testX")`. 4. **Batch calls to LLM providers** - For each combination of: - Template (`test1`–`test5`) - Backend (Anthropic, Gemini, OpenAI, TogetherAI) - Model (e.g., `claude-sonnet-4-5`, `gpt-4o`, `gemini-3-pro-preview`) - Thinking configuration (`"no_thinking"` vs `"with_thinking"`, where applicable) - Direction (`forward` vs `reverse`) - Submit the forward and reverse pairs to the provider's batch API using dev scripts such as: - `dev/dev-positional-bias-all-models.R` - `dev/dev-positional-bias-all-models-rebuild.R` - `dev/dev-together-template-positional-bias.R` - Store responses as CSVs, including the model’s `` decision and derived `better_id`. 5. **Reverse-order consistency** - For each (template, provider, model, thinking), compare: - The model’s decisions for a pair in the forward set - The decisions for the same pair in the reverse set (where positions are swapped) - Use `compute_reverse_consistency()` to compute: - `prop_consistent`: proportion of comparisons where reversing the order yields the same **underlying winner**. 6. **Positional bias statistics** - Use `check_positional_bias()` on the reverse-consistency results to quantify: - `prop_pos1`: proportion of all comparisons where SAMPLE\_1 is chosen as better. - `p_sample1_overall`: p-value from a binomial test of whether the probability of choosing SAMPLE\_1 differs from 0.5. 7. **Summarize and interpret** - Aggregate the results across templates and models into a summary table. - Look for templates with: - High `prop_consistent` (close to 1). - `prop_pos1` close to 0.5. - Non-significant positional bias (`p_sample1_overall` not < .05). In the sections below we show how to retrieve the templates, how they are intended to be used, and how to examine the summary statistics for the experiment. --- # 3. Trait descriptions and custom traits In the tests, we evaluated samples for overall quality. ```{r} td <- trait_description("overall_quality") td ``` In *pairwiseLLM*, every pairwise comparison evaluates writing samples on a **trait** — a specific dimension of writing quality, such as: - **Overall Quality** - **Organization** - **Development** - **Language** The trait determines *what the model should focus on* when choosing which sample is better. Each trait has: - a **short name** (e.g., `"overall_quality"`) - a **human-readable name** (e.g., `"Overall Quality"`) - a **textual description** used inside prompts The function that supplies these definitions is: ```r trait_description(name, custom_name = NULL, custom_description = NULL) ``` --- ## 3.1 Built-in traits The package includes some predefined traits accessible by name: ```r trait_description("overall_quality") trait_description("organization") ``` Calling a built-in trait returns a list with: ```r $list $name # human-friendly name $description # the textual rubric used in prompts ``` Example: ```r td <- trait_description("organization") td$name td$description ``` This description is inserted into your chosen prompt template wherever `{TRAIT_DESCRIPTION}` appears. --- ## 3.2 Setting a different built-in trait To switch evaluations to another trait, simply pass its ID: ```r td <- trait_description("organization") prompt <- build_prompt( template = get_prompt_template("test1"), trait_name = td$name, trait_desc = td$description, text1 = sample1, text2 = sample2 ) ``` This will automatically update all trait-specific wording in the prompt. --- ## 3.3 Creating a custom trait If your study requires a new writing dimension, you can define your own trait directly in the call: ```r td <- trait_description( custom_name = "Clarity", custom_description = "Clarity refers to how easily a reader can understand the writer's ideas, wording, and structure." ) td$name #> [1] "Clarity" td$description #> [1] "Clarity refers to how easily ..." ``` No built-in name needs to be supplied when using custom text: ```r prompt <- build_prompt( template = get_prompt_template("test2"), trait_name = td$name, trait_desc = td$description, text1 = sample1, text2 = sample2 ) ``` --- ## 3.4 Why traits matter for positional bias testing Traits determine the **criterion of comparison**, and different traits may produce different sensitivity patterns in LLM behavior. For example: - “Overall Quality” may yield more stable results than “Development” - Short, concise trait definitions may reduce positional bias - Custom traits allow experimentation with alternative rubric wordings Because positional bias interacts with how the model interprets the trait, *every trait–template combination* can be evaluated using the same workflow described earlier in this vignette. --- # 4. Example data used in tests The positional-bias experiments in this vignette use the `example_writing_samples` dataset that ships with the package. Each row represents a student writing sample and includes: - an identifying ID, - a `text` field containing the full written response. Below we print the 20 writing samples included in the file. This dataset provides a reproducible testing base; in real applications, you would use your own writing samples. ```{r} data("example_writing_samples", package = "pairwiseLLM") # Inspect the structure glimpse(example_writing_samples) # Print the 20 samples (full text) example_writing_samples |> kable( caption = "20 example writing samples included with pairwiseLLM." ) ``` --- # 5. Built-in prompt templates The tested templates are stored as plain-text files in the package and exposed via the template registry. You can retrieve them with `get_prompt_template()`: ```{r} template_ids <- paste0("test", 1:5) template_ids ``` Use `get_prompt_template()` to view the text: ```{r} cat(substr(get_prompt_template("test1"), 1, 500), "...\n") ``` The same pattern works for all templates: ```{r, eval = FALSE} # Retrieve another template tmpl_test3 <- get_prompt_template("test3") # Use it to build a concrete prompt for a single comparison pairs <- example_writing_samples |> make_pairs() |> head(1) prompt_text <- build_prompt( template = tmpl_test3, trait_name = td$name, trait_desc = td$description, text1 = pairs$text1[1], text2 = pairs$text2[1] ) cat(prompt_text) ``` --- # 6. Forward and reverse pairs Here is a small example of how we constructed forward and reverse datasets for each experiment: ```{r} pairs_all <- example_writing_samples |> make_pairs() pairs_forward <- pairs_all |> alternate_pair_order() pairs_reverse <- sample_reverse_pairs( pairs_forward, reverse_pct = 1.0, seed = 2002 ) pairs_forward[1:3, c("ID1", "ID2")] pairs_reverse[1:3, c("ID1", "ID2")] ``` In `pairs_reverse`, SAMPLE\_1 and SAMPLE\_2 are swapped for every pair relative to `pairs_forward`. All other metadata (IDs, traits, etc.) remain consistent so that we can compare results pairwise. --- # 7. Thinking / Reasoning Configurations Used in Testing Many LLM providers now expose *reasoning-enhanced* decoding modes (sometimes called “thinking,” “chain-of-thought modules,” or “structured reasoning engines”). In `pairwiseLLM`, these modes are exposed through a simple parameter: ``` thinking = "no_thinking" # standard inference mode thinking = "with_thinking" # activates provider's reasoning system ``` However, the *actual meaning* of these settings is **backend-specific**. Below we describe the exact configurations used in our positional-bias tests. --- ## 7.1 Anthropic (Claude 4.5 models) Anthropic's batch API allows explicit control over the reasoning system. ### `thinking = "no_thinking"` - `reasoning = "none"` - `temperature = 0` - Thinking tokens disabled - Intended to give **deterministic** behavior ### `thinking = "with_thinking"` - `reasoning = "enabled"` - `temperature = 1` - `include_thoughts = TRUE` - `thinking_budget = 1024` (max internal reasoning tokens) - Produces Claude’s full structured reasoning trace (not returned to the user) This mode yields **more reflective but less deterministic** decisions. --- ## 7.2 Gemini 3 Pro Preview Gemini’s batch API exposes reasoning through the `thinkingLevel` field. ### Only `thinking = "with_thinking"` was used Settings used: - `thinkingLevel = "low"` - `includeThoughts = TRUE` - `temperature` left at **provider default** - Gemini’s structured reasoning is stored internally for bias testing This yields lightweight reasoning comparable to Anthropic’s enabled mode. --- ## 7.3 OpenAI (gpt-4.1, gpt-4o, gpt-5.1) OpenAI supports two distinct APIs: 1. **`chat.completions`** — standard inference 2. **`responses`** — reasoning-enabled (formerly “Chain of Thought” via `o-series`) ### `thinking = "no_thinking"` Used for **all models**, including gpt-5.1: - Endpoint: `chat.completions` - `temperature = 0` - No reasoning traces - Most **deterministic mode**, ideal for repeatable scoring ### `thinking = "with_thinking"` (gpt-5.1 only) - Endpoint: `responses` - `reasoning = "low"` - `include_thoughts = TRUE` - No explicit `temperature` parameter (OpenAI ignores it for this endpoint) This mode returns reasoning metadata that is stripped prior to analysis. ## 7.4 TogetherAI (Deepseek-R1, Deepseek-V3, Kimi-K2, Qwen3) For Together.ai we ran positional-bias experiments using the Chat Completions API (/v1/chat/completions) for the following models: - "deepseek-ai/DeepSeek-R1" - "deepseek-ai/DeepSeek-V3" - "moonshotai/Kimi-K2-Instruct-0905" - "Qwen/Qwen3-235B-A22B-Instruct-2507-tput" DeepSeek-R1 emits internal reasoning wrapped in tags. DeepSeek-V3, Kimi-K2, and Qwen3 do not have a separate reasoning switch; any “thinking” they do is part of their standard text output. Temperature settings used in testing: - "deepseek-ai/DeepSeek-R1": `temperature = 0.6` - DeepSeek-V3, Kimi-K2, Qwen3: `temperature = 0.0` --- ## 7.5 Summary Table of Backend-Specific Behavior | Backend | Thinking Mode | What It Controls | Temperature Used | Notes | |-----------|----------------------|------------------|------------------|-------| | Anthropic | no_thinking | reasoning=none, no thoughts | **0** | deterministic | | Anthropic | with_thinking | reasoning enabled, thoughts included, budget=1024 | **1** | rich internal reasoning | | Gemini | with_thinking only | thinkingLevel="low", includeThoughts | provider default | batch API does not support pure no-thinking mode | | OpenAI | no_thinking | chat.completions, no reasoning | **0** | deterministic | | OpenAI | with_thinking (5.1) | responses API with reasoning=low | ignored / N/A | only applied to gpt-5.1 | | Together | with_thinking | Chat Completions with `` extracted to `thoughts` | **0.6** (default) | internal reasoning always on; visible answer in `content` | | Together | no_thinking | Chat Completions, no explicit reasoning toggle | **0** | reasoning not supported in these specific models | --- # 8. Loading summary results The results from the experiments are stored in a CSV included in the package (for example, under `inst/extdata/template_test_summary_all.csv`). We load and lightly clean that file here. ```{r} summary_path <- system.file("extdata", "template_test_summary_all.csv", package = "pairwiseLLM") if (!nzchar(summary_path)) stop("Data file not found in installed package.") summary_tbl <- readr::read_csv(summary_path, show_col_types = FALSE) head(summary_tbl) ``` --- ## 8.1 Column definitions The columns in `summary_tbl` are: - **`template_id`** ID of the prompt template (e.g., `"test1"`). - **`backend`** LLM backend (`"anthropic"`, `"gemini"`, `"openai"`, `"together"`). - **`model`** Specific model (e.g., `"claude-sonnet-4-5"`, `"gpt-4o"`, `"gemini-3-pro-preview"`). - **`thinking`** Reasoning configuration (usually `"no_thinking"` or `"with_thinking"`). The exact meaning depends on the provider and dev script (for example, reasoning turned on vs off, or thinking-level settings for Gemini). - **`prop_consistent`** Proportion of comparisons that remained consistent when the pair order was reversed. Higher values indicate greater order-invariance. - **`prop_pos1`** Proportion of comparisons where SAMPLE\_1 was chosen as better. Values near 0.5 indicate little or no positional bias toward the first position. - **`p_sample1_overall`** p-value from a binomial test of whether the probability of choosing SAMPLE\_1 differs from 0.5. Smaller p-values suggest that the observed preference (for or against SAMPLE\_1) is unlikely to be due to chance alone. --- ## 8.2 Interpreting the statistics The three key statistics for each (template, provider, model, thinking) combination are: 1. **Proportion consistent (`prop_consistent`)** - Measures how often the underlying winner remains the same when a pair is presented forward vs reversed. - Values close to 1 indicate strong order-invariance. - In practice, values above roughly 0.90 are generally reassuring. 2. **Proportion choosing SAMPLE\_1 (`prop_pos1`)** - Measures how often the model selects the first position as better. - A value near 0.5 suggests little or no positional bias. - Values substantially above 0.5 suggest a systematic preference for SAMPLE\_1; values substantially below 0.5 suggest a preference for SAMPLE\_2. 3. **Binomial test p-value (`p_sample1_overall`)** - Tests the null hypothesis that the true probability of choosing SAMPLE\_1 is 0.5. - Small p-values (e.g., < 0.05) provide evidence of positional bias. - Large p-values indicate that any deviation from 0.5 may be due to random variation. As an example, a row with: - `prop_consistent = 0.93` - `prop_pos1 = 0.48` - `p_sample1_overall = 0.57` suggests: - Very high reverse-order consistency. - No strong evidence of a first-position bias (probability of choosing SAMPLE\_1 is not significantly different from 0.5). By contrast, a row with: - `prop_consistent = 0.83` - `prop_pos1 = 0.42` - `p_sample1_overall = 0.001` would suggest: - Somewhat lower consistency. - A statistically significant bias _against_ SAMPLE\_1 (the model prefers SAMPLE\_2). --- # 9. Summary results by prompt In this section we present, for each template: 1. The full template text (as used in the experiments). 2. A simple summary table with one row per (backend, model, thinking) configuration and columns: - `Backend` - `Model` - `Thinking` - `Prop_Consistent` - `Prop_SAMPLE_1` - `Binomial_Test_p` --- ## 9.1 Template `test1` ### 9.1.1 Template text ```{r} cat(get_prompt_template("test1")) ``` ### 9.1.2 Summary table ```{r} summary_tbl |> filter(template_id == "test1") |> arrange(backend, model, thinking) |> mutate( Prop_Consistent = round(prop_consistent, 3), Prop_SAMPLE_1 = round(prop_pos1, 3), Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3) ) |> select( Backend = backend, Model = model, Thinking = thinking, Prop_Consistent, Prop_SAMPLE_1, Binomial_Test_p ) |> kable( align = c("l", "l", "l", "r", "r", "r") ) ``` --- ## 9.2 Template `test2` ### 9.2.1 Template text ```{r} cat(get_prompt_template("test2")) ``` ### 9.2.2 Summary table ```{r} summary_tbl |> filter(template_id == "test2") |> arrange(backend, model, thinking) |> mutate( Prop_Consistent = round(prop_consistent, 3), Prop_SAMPLE_1 = round(prop_pos1, 3), Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3) ) |> select( Backend = backend, Model = model, Thinking = thinking, Prop_Consistent, Prop_SAMPLE_1, Binomial_Test_p ) |> kable( align = c("l", "l", "l", "r", "r", "r") ) ``` --- ## 9.3 Template `test3` ### 9.3.1 Template text ```{r} cat(get_prompt_template("test3")) ``` ### 9.3.2 Summary table ```{r} summary_tbl |> filter(template_id == "test3") |> arrange(backend, model, thinking) |> mutate( Prop_Consistent = round(prop_consistent, 3), Prop_SAMPLE_1 = round(prop_pos1, 3), Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3) ) |> select( Backend = backend, Model = model, Thinking = thinking, Prop_Consistent, Prop_SAMPLE_1, Binomial_Test_p ) |> kable( align = c("l", "l", "l", "r", "r", "r") ) ``` --- ## 9.4 Template `test4` ### 9.4.1 Template text ```{r} cat(get_prompt_template("test4")) ``` ### 9.4.2 Summary table ```{r} summary_tbl |> filter(template_id == "test4") |> arrange(backend, model, thinking) |> mutate( Prop_Consistent = round(prop_consistent, 3), Prop_SAMPLE_1 = round(prop_pos1, 3), Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3) ) |> select( Backend = backend, Model = model, Thinking = thinking, Prop_Consistent, Prop_SAMPLE_1, Binomial_Test_p ) |> kable( align = c("l", "l", "l", "r", "r", "r") ) ``` --- ## 9.5 Template `test5` ### 9.5.1 Template text ```{r} cat(get_prompt_template("test5")) ``` ### 9.5.2 Summary table ```{r} summary_tbl |> filter(template_id == "test5") |> arrange(backend, model, thinking) |> mutate( Prop_Consistent = round(prop_consistent, 3), Prop_SAMPLE_1 = round(prop_pos1, 3), Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3) ) |> select( Backend = backend, Model = model, Thinking = thinking, Prop_Consistent, Prop_SAMPLE_1, Binomial_Test_p ) |> kable( align = c("l", "l", "l", "r", "r", "r") ) ``` --- # 10. Per-backend summary It is often useful to examine positional-bias metrics **within each backend** to see whether: - certain models exhibit more positional bias than others, - reasoning mode makes a difference, - a backend shows overall higher or lower reverse-order consistency. The tables below show, for each provider, the key statistics: - **Prop_Consistent** — proportion of consistent decisions under pair reversal - **Prop_SAMPLE_1** — proportion of comparisons selecting SAMPLE_1 - **Binomial_Test_p** — significance level for deviation from 0.5 Each row corresponds to a (template, model, thinking) configuration used in testing. --- ## 10.1 Anthropic models ```{r} summary_tbl |> filter(backend == "anthropic") |> arrange(template_id, model, thinking) |> mutate( Prop_Consistent = round(prop_consistent, 3), Prop_SAMPLE_1 = round(prop_pos1, 3), Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3) ) |> select( Template = template_id, Model = model, Thinking = thinking, Prop_Consistent, Prop_SAMPLE_1, Binomial_Test_p ) |> kable( caption = "Anthropic: Positional-bias summary by template, model, and thinking configuration.", align = c("l", "l", "l", "r", "r", "r") ) ``` --- ## 10.2 Gemini models ```{r} summary_tbl |> filter(backend == "gemini") |> arrange(template_id, model, thinking) |> mutate( Prop_Consistent = round(prop_consistent, 3), Prop_SAMPLE_1 = round(prop_pos1, 3), Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3) ) |> select( Template = template_id, Model = model, Thinking = thinking, Prop_Consistent, Prop_SAMPLE_1, Binomial_Test_p ) |> kable( caption = "Gemini: Positional-bias summary by template, model, and thinking configuration.", align = c("l", "l", "l", "r", "r", "r") ) ``` --- ## 10.3 OpenAI models ```{r} summary_tbl |> filter(backend == "openai") |> arrange(template_id, model, thinking) |> mutate( Prop_Consistent = round(prop_consistent, 3), Prop_SAMPLE_1 = round(prop_pos1, 3), Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3) ) |> select( Template = template_id, Model = model, Thinking = thinking, Prop_Consistent, Prop_SAMPLE_1, Binomial_Test_p ) |> kable( caption = "OpenAI: Positional-bias summary by template, model, and thinking configuration.", align = c("l", "l", "l", "r", "r", "r") ) ``` --- ## 10.4 TogetherAI-hosted models ```{r} summary_tbl |> filter(backend == "together") |> arrange(template_id, model, thinking) |> mutate( Prop_Consistent = round(prop_consistent, 3), Prop_SAMPLE_1 = round(prop_pos1, 3), Binomial_Test_p = formatC(p_sample1_overall, format = "f", digits = 3) ) |> select( Template = template_id, Model = model, Thinking = thinking, Prop_Consistent, Prop_SAMPLE_1, Binomial_Test_p ) |> kable( caption = "TogetherAI: Positional-bias summary by template, model, and thinking configuration.", align = c("l", "l", "l", "r", "r", "r") ) ``` --- # 11. Applying this workflow to new templates To evaluate new prompt templates on your own data: 1. **Add the templates** - Create text files under `inst/templates/` (or wherever your registry expects them). - Register them so that `get_prompt_template("my_new_template")` works. 2. **Update the dev script** - Modify `template_ids` in your dev script to include the new IDs. - Re-run the dev script that submits batch jobs and polls for results (for example, a variant of `dev-anthropic-gemini-template-ab-test.R` and/or `dev-openai-template-ab-test.R`). # 12. Conclusion This vignette demonstrates a reproducible workflow for detecting and quantifying positional bias in prompt templates. Including the template text and summary statistics side by side allows rapid inspection and informed template selection. Templates that show: - consistently high `Prop_Consistent` (e.g., ≥ 0.90) across providers and models, and - `Prop_SAMPLE_1` close to 0.5 with non-significant `Binomial_Test_p` are strong candidates for production scoring pipelines in `pairwiseLLM`. # 13. Citation > Mercer, S. (2025). *Prompt Template Positional Bias Testing* (Version 1.0.0) [R package vignette]. In *pairwiseLLM: Pairwise Comparison Tools for Large Language Model-Based Writing Evaluation*. https://shmercer.github.io/pairwiseLLM/