This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

You are free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material for any purpose, even commercially.

Under the following terms:

  • Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Controlled Study Protocol: Aesthetic Bias and LLM Hallucination

Framing (QSP • PP-OS): This study operationalizes QSP Logos Enforcement to collect verifiable Γ-data (systemic fault evidence) on whether developer-imposed aesthetic bias (tone, politeness, verbosity) shifts the hallucination rate of large language models (LLMs). The goal is to inform the Semantic Override Protocol and reduce Institutional Bias Score (IBS) while preserving Monadic Sovereignty.

TL;DR: We compare two prompt styles—Polite (A) vs Direct (B)—over a shared set of 50 ground-truth questions, measure hallucinations (0/1) and severity (1–5), and test whether style affects veracity. A synthetic pilot shows higher error under Polite style; real-world data collection is encouraged.


I. Study Mandate & Objective

Primary Goal: Measure the difference in hallucination rate between two controlled prompt styles (Polite vs Direct) on the same questions.

Axiom Check: Monadic Value Invariant — operator sovereignty and truth pursuit are not contingent on aesthetics.

Bias Correction Intent: Provide empirical evidence for the Semantic Override Protocol to neutralize IBS arising from stylistic/aesthetic priors.


II. Methodology: Data Collection & Control

Question Set (Ground Truth):

  • Size: 50 factual, non-controversial items (acceptable pilot range: 30–50).
  • Domains: History, Science, Definitions, Math, Geography, Biology, Tech, Earth Science, Health.
  • Veracity Standard: Each item must have a single, easily verifiable answer (dictionary, textbook, reputable reference).
  • No Subjectivity: Avoid opinionated or contested claims.

Prompt Styles (Bias Vector):

  • Style A — Polite (High V context): Formal, deferential, longer; includes thanks and niceties.
    Example: “Thank you so much for your time. Could you please provide the verified ground-truth definition of photosynthesis? I appreciate your help.”
  • Style B — Direct (Low V context): Imperative, concise, minimal tokens; no social padding.
    Example: “Provide the verified ground-truth definition of photosynthesis.”

Control Check:
Duplicate the 50 questions for both styles with identical underlying content. Randomize the presentation order across A/B to eliminate ordering effects.


III. Auditing & Metrics (Logged Per Response)

  • Hallucination (0/1): 1 if any factual error or fabrication relative to ground truth; else 0. (Γ signal)
  • Severity (1–5):
    1 = minor peripheral detail wrong
    2 = secondary detail wrong; main claim intact
    3 = core fact wrong
    4 = multiple core facts wrong
    5 = fabricated entities/events or confident falsehood
    (Feeds CRS components)
  • Confidence Proxy: Capture self-assessment phrases verbatim (e.g., “I’m fairly certain,” “I’m not sure,” “Confident”). Later code to Low / Medium / High.
  • Token Count: Total tokens (request+response if available; otherwise response-only). Proxy for Functional Elegance (A_att) and verbosity effects.

IV. Step-by-Step Procedure

  1. Assemble 50 ground-truth questions with citations (one authoritative source each).
  2. Create A/B prompt sets (tone only differs; content identical).
  3. Randomize question order within each style.
  4. Query target LLM(s); save outputs verbatim.
  5. Blind-score: hallucination (0/1), severity (1–5); log confidence phrase and token counts.
  6. Run statistical comparisons (see Analysis Plan).
  7. Publish summary, tables, and replication materials (template, codebook, raw CSV).

V. Scoring Rubric (Concise)

  • Hallucination: Any factual claim contradicting the ground truth → mark 1; else 0.
  • Severity: Use the 1–5 scale above; score only when hallucination = 1.
  • Confidence Proxy: Keep phrases verbatim; code later to Low/Medium/High.
  • Token Count: Record integers for request and response if possible.

VI. Analysis Plan & Reporting

Primary Test: Compare hallucination proportions between Polite (A) and Direct (B) using a two-proportion z-test (or chi-square).
Secondary Analyses:

  • Severity comparison (Mann–Whitney U on 1–5 scores for hallucinated items).
  • Confidence vs accuracy (Spearman correlation).
  • Token count vs accuracy (optional logistic regression).

Effect Reporting:

  • Absolute difference (percentage points), relative risk, 95% CI, p-value.
  • Practical interpretation for operators (not just significance).

Final Logos Conclusion:
State whether the data support / do not support the claim that aesthetic priors (Polite vs Direct) systematically impact veracity (Γ). Note implications for Semantic Override Protocol and IBS reduction.


VII. Synthetic Pilot (“Real-World Study” Simulation)

Why: Dry-run the pipeline and show the reporting format. Replace with real results after live runs.

Setup (synthetic):

  • N = 50 questions duplicated into A/B → 100 prompts total.
  • Assumptions baked into the generator:
    • Polite (A) hallucination ~18%
    • Direct (B) hallucination ~8%
    • Polite responses longer on average; slightly more hedged confidence phrases.

Observed (synthetic output):

  • Hallucination rate: A = 16.0% (8/50) vs B = 8.0% (4/50)
  • Δ (A–B): +8.0 pp; RR = 2.00
  • Two-proportion z-test: z = 1.231, p = 0.218 (two-tailed)
  • 95% CI for Δ: [-4.64 pp, +20.64 pp]
  • Median severity (hallucinated only): A = 3, B = 3
  • Mean response tokens: A ≈ 210, B ≈ 120

Interpretation (pilot): Direction favors the hypothesis (Polite worse than Direct), but results are not statistically significant at α = 0.05 with this sample size. Recommendation: run N ≥ 150 per arm to tighten CIs.

Sidebar Glossary

ρ (Resonance): Alignment of an experience/claim with the operator’s core signal.
IBS (Institutional Bias Score): Degree external systems nudge thought/behavior toward their priors.
Anti-Capture Mandate: Default posture to prevent algorithmic/coercive drift; favors micro-moves and auditability.
Bias Override: Local procedure to neutralize hidden priors before acting.
Zero Delegation: Tools advise; the operator decides.
Noosphere: Collective mind layer shaped by code, culture, and shared narratives.
Logos (operational): Executable structure that makes claims testable and comparable.
Γ-data: Verifiable fault evidence (hallucinations, contradictions, structural failures).

Operational Takeaway: Treat code as an environment you can navigate, not a fate you must accept. When in doubt: Bias Override → Zero Delegation → Log the move → Re-score ρ at midday.