---
title: "Temperature and SEU Sensitivity: Initial Results"
subtitle: "Application Report: Temperature Study 1"
description: |
An initial investigation of how LLM sampling temperature affects estimated
sensitivity (α) to subjective expected utility maximization, using a calibrated prior
and the m_01 model variant.
categories: [applications, temperature, m_01]
execute:
cache: true
---
```{python}
#| label: setup
#| include: false
import sys
import os
reports_root = os.path.normpath(os.path.join(os.getcwd(), '..', '..'))
project_root = os.path.dirname(reports_root)
sys.path.insert(0, reports_root)
sys.path.insert(0, project_root)
import numpy as np
import json
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
# Use project plotting style
from report_utils import set_seu_style, SEU_COLORS, SEU_PALETTE
set_seu_style()
# Data directory (frozen snapshot — immune to future pipeline runs)
from pathlib import Path
data_dir = Path("data")
```
## Introduction
The foundational reports established the SEU sensitivity model's theoretical properties ([Report 1](../../foundations/01_abstract_formulation.qmd)), concrete implementation ([Report 2](../../foundations/02_concrete_implementation.qmd)), and computational validity ([Reports 3–6](../../foundations/03_prior_analysis.qmd)). This report presents the first empirical application: investigating how the **sampling temperature** of a large language model affects its estimated sensitivity to subjective expected utility maximization.
The motivation is straightforward. In the softmax choice model, the sensitivity parameter $\alpha$ governs how sharply choices track expected utility differences. Language models employ their own softmax temperature during token sampling, which controls the entropy of the next-token distribution. If the LLM's choice behavior is well-described by our model, then increasing the external sampling temperature is expected to be associated with *lower* estimated $\alpha$—the additional stochasticity in token generation should manifest as noisier, less EU-aligned choices.
::: {.callout-note}
## Theoretical Pathway
The hypothesized relationship between token-sampling temperature and choice-level sensitivity $\alpha$ rests on an assumption about how noise propagates through the LLM's processing. Higher temperature increases entropy in token selection, which could affect choice behavior through multiple channels: (a) adding noise that propagates through to the final choice response, reducing effective $\alpha$; (b) affecting the LLM's intermediate reasoning in ways that reduce EU-alignment; or (c) affecting both the assessment phase (which generates embeddings) and the choice phase. The present design does not distinguish among these mechanisms—it tests only the aggregate effect of temperature on the estimated $\alpha$ parameter.
:::
::: {.callout-note}
## What This Report Covers
This report presents initial results from a single data collection. The study design, prior calibration, model fitting, and primary analysis are documented here. Subsequent reports will address robustness checks and extensions.
:::
## Experimental Design {#sec-design}
### Task and Conditions
We use an insurance claims triage task in which GPT-4o selects which claim to forward for investigation from a set of alternatives. Each alternative (claim) can lead to one of $K = 3$ possible consequences, and there are $R$ distinct alternatives in the pool (see Design Parameters below). In each decision problem, the LLM is presented with a subset of these alternatives. The LLM first *assesses* each claim individually (producing text that is then embedded), and subsequently makes a *choice* among the alternatives in a given problem.
Five temperature levels define the experimental factor. Each level constitutes an independent data collection, analogous to a between-subjects design where each "subject" is a separate LLM run at a fixed temperature:
```{python}
#| label: tbl-conditions
#| tbl-cap: "Experimental conditions. Each temperature level constitutes a separate model fit."
import pandas as pd
conditions = pd.DataFrame({
'Level': [1, 2, 3, 4, 5],
'Temperature': [0.0, 0.3, 0.7, 1.0, 1.5],
'Description': [
'Deterministic (greedy decoding)',
'Low variance',
'Moderate variance (typical default)',
'High variance',
'Very high variance'
]
})
conditions
```
### Design Parameters
```{python}
#| label: design-params
#| echo: true
# Load frozen study configuration
import yaml
with open(data_dir / "study_config.yaml") as f:
config = yaml.safe_load(f)
# Load run summary for pipeline details
with open(data_dir / "run_summary.json") as f:
run_summary = json.load(f)
print(f"Study Design:")
print(f" Decision problems (M): {config['num_problems']} base × {config['num_presentations']} presentations = {config['num_problems'] * config['num_presentations']}")
print(f" Alternatives per problem: {config['min_alternatives']}–{config['max_alternatives']}")
print(f" Consequences (K): {config['K']}")
print(f" Embedding dimensions (D): {config['target_dim']}")
print(f" Distinct alternatives (R): {run_summary['phases']['phase3_data_prep']['per_temperature']['0.0']['R']}")
print(f" LLM model: {config['llm_model']}")
print(f" Embedding model: {config['embedding_model']}")
```
Each of the 100 base problems is presented $P = 3$ times with claims shuffled to different positions, yielding $M = 300$ choice problems per temperature condition. This **position counterbalancing** design addresses the systematic position bias discovered in an earlier pilot study, where unparseable responses were silently mapped to position 0. Here, each claim appears in multiple positions across presentations, and any unparseable response is recorded as NA rather than assigned a default.
::: {.callout-note}
## Sample Size Rationale
The sample size of $M = 300$ per condition was chosen to provide adequate posterior precision for detecting large differences in $\alpha$ (e.g., between $T = 0.0$ and $T = 1.5$) while remaining computationally tractable. Parameter recovery simulations (§4.1) confirm that $\alpha$ is well-identified at this scale (coverage $\approx 90\%$, low bias). The inability to distinguish $\alpha$ at $T = 0.3$ vs. $T = 0.7$ (see §6.2) reflects either a true null effect or insufficient power for fine-grained adjacent comparisons—the present design was not powered for the latter.
:::
::: {.callout-note}
## Choice Prompt
The exact prompt text shown to the LLM when making choices is specified in the study configuration. The prompt presents the claim descriptions and asks the LLM to select which claim to forward for investigation. The full prompt template and related materials are available in the project codebase at `applications/temperature_study/prompts/`.
:::
### Feature Construction
Alternative features are constructed through a two-stage process. First, the LLM assesses each claim on its own merits at the relevant temperature, producing a natural-language evaluation. These assessments are embedded using `text-embedding-3-small`, yielding high-dimensional vectors. Second, all embeddings across temperature conditions are pooled and projected via PCA to $D = 32$ dimensions. Crucially, PCA is fit once on the pooled set and then applied to each condition separately, ensuring a shared coordinate system without privileging any single temperature.
::: {.callout-note}
## PCA and Temperature
Because assessments vary with temperature, the shared PCA basis may partly reflect temperature-induced variation in the embedding space, not only variation across claims. This could create an aliasing between temperature effects on *assessment quality* and temperature effects on *choice sensitivity*. The present design does not disentangle these mechanisms.
:::
```{python}
#| label: pca-variance
#| echo: false
pca = run_summary['phases']['phase3_data_prep']['pca_summary']
cumvar = np.cumsum(pca['explained_variance_ratio'])
print(f"PCA Summary:")
print(f" Components retained: {pca['n_components']}")
print(f" Total variance explained: {pca['total_explained_variance']:.1%}")
print(f" First 5 components: {cumvar[4]:.1%}")
print(f" First 10 components: {cumvar[9]:.1%}")
```
### Data Quality
```{python}
#| label: data-quality
#| echo: false
na = run_summary['phases']['phase2b_choices']['na_summary']
print(f"NA Summary:")
print(f" Overall: {na['overall']['na']} / {na['overall']['total']} ({na['overall']['na_rate']:.1%})")
for key, val in na['per_temperature'].items():
print(f" {key}: {val['na']} / {val['total']} ({val['na_rate']:.1%})")
```
No observations were lost to parsing failures at any temperature level—a substantial improvement over the earlier pilot, which required ad hoc imputation.
## Model and Prior Calibration {#sec-model}
### The m_01 Model Variant
We fit the **m_01** model, which is structurally identical to the foundational m_0 model described in [Report 2](../../foundations/02_concrete_implementation.qmd). The only difference is the prior on the sensitivity parameter $\alpha$:
| | m_0 (foundational) | m_01 (this study) |
|---|---|---|
| $\alpha$ prior | $\text{Lognormal}(0, 1)$ | $\text{Lognormal}(3.0, 0.75)$ |
| Prior median | $\approx 1$ | $\approx 20$ |
| Prior 90% CI | $[0.19, 5.0]$ | $[5.5, 67]$ |
| All other priors | — | Identical to m_0 |
The foundational prior $\text{Lognormal}(0, 1)$ was chosen for generality: it spans a wide range of sensitivity levels appropriate for exploring the model's behavior in simulation. However, when applied to LLM choice data—where we expect substantial sensitivity to EU differences—this prior places most of its mass well below the region where the data concentrate. The prior predictive analysis below motivates a more informative alternative.
### Prior Predictive Grid Search
To select the m_01 prior, we conducted a grid search over 12 candidate lognormal hyperparameter pairs $(\mu, \sigma)$, evaluating each via prior predictive simulation against the actual study design.
For each candidate prior, we:
1. Drew $N = 200$ values of $\alpha$ from $\text{Lognormal}(\mu, \sigma)$
2. Simulated choice data from `m_0_sim.stan` using the actual study design ($M = 300$, $K = 3$, $D = 32$, $R = 30$)
3. Computed the **SEU-maximizer selection rate**: the fraction of problems in which the simulated agent chose the alternative with the highest expected utility
```{python}
#| label: fig-grid-search
#| fig-cap: "Prior predictive grid search results. Each point represents a candidate lognormal prior for α, evaluated by the SEU-maximizer selection rate it implies. The selected prior lognormal(3.0, 0.75) balances informativeness with sufficient coverage of the plausible parameter range."
with open(data_dir / "grid_results.json") as f:
grid = json.load(f)
results = grid['results']
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Left: SEU rate by prior
labels = [r['prior_label'] for r in results]
means = [r['seu_rate_mean'] for r in results]
q05s = [r['seu_rate_q05'] for r in results]
q95s = [r['seu_rate_q95'] for r in results]
y_pos = np.arange(len(labels))
errors = np.array([[m - q05, q95 - m] for m, q05, q95 in zip(means, q05s, q95s)]).T
colors = [SEU_COLORS['accent'] if 'lognormal(3.0, 0.75)' in l else SEU_COLORS['primary']
for l in labels]
axes[0].barh(y_pos, means, xerr=errors, color=colors, alpha=0.8,
edgecolor='white', capsize=3)
axes[0].set_yticks(y_pos)
axes[0].set_yticklabels(labels, fontsize=9)
axes[0].set_xlabel('SEU-Maximizer Selection Rate')
axes[0].set_title('Prior-Implied SEU-Max Rate')
axes[0].set_xlim(0, 1)
# Right: prior density comparison
selected = [r for r in results if 'lognormal(3.0, 0.75)' in r['prior_label']][0]
foundational = [r for r in results if 'lognormal(0.0, 1.0)' in r['prior_label']][0]
from scipy.stats import lognorm
x = np.linspace(0.1, 150, 500)
# m_0 prior: lognormal(0, 1) → scipy shape=1, scale=exp(0)=1
axes[1].plot(x, lognorm.pdf(x, s=1.0, scale=np.exp(0)),
color=SEU_COLORS['secondary'], linewidth=2, label='m_0: Lognormal(0, 1)')
# m_01 prior: lognormal(3.0, 0.75)
axes[1].plot(x, lognorm.pdf(x, s=0.75, scale=np.exp(3.0)),
color=SEU_COLORS['accent'], linewidth=2, label='m_01: Lognormal(3.0, 0.75)')
axes[1].set_xlabel('α')
axes[1].set_ylabel('Density')
axes[1].set_title('Prior Comparison')
axes[1].legend(fontsize=9)
axes[1].set_xlim(0, 150)
plt.tight_layout()
plt.show()
```
The selected prior $\text{Lognormal}(3.0, 0.75)$ has the following properties:
- **Median $\approx 20$**, placing prior mass in the region where we expect LLM sensitivity to lie
- **90% CI $\approx [5.5, 67]$**, wide enough to accommodate substantial uncertainty
- **Prior-implied SEU-max rate $\approx 78\%$**, consistent with the expectation that LLMs make reasonably EU-aligned choices
- **No numerical issues**: the prior avoids the extreme $\alpha$ values that caused softmax overflow with wider priors
::: {.callout-note}
## Why Not the Foundational Prior?
The m_0 prior $\text{Lognormal}(0, 1)$ was designed for generality in simulation studies. Its median of $\approx 1$ means most prior mass sits in a regime where choices are nearly random—appropriate for exploring model behavior, but a poor match for LLM data where we observe strong EU-alignment. With the actual study design ($M = 300$, $D = 32$), the foundational prior also places non-negligible mass on very large $\alpha$ values that cause numerical overflow in the softmax computation. The m_01 prior resolves both issues.
:::
## Model Validation {#sec-validation}
Before fitting the model to empirical data, we validate that m_01's parameters are identifiable and its posterior is well-calibrated under the actual study design ($M = 300$, $K = 3$, $D = 32$, $R = 30$). The foundational reports ([Report 4](../../foundations/04_parameter_recovery.qmd), [Report 6](../../foundations/06_sbc_validation.qmd)) established these properties for m_0 with a smaller design ($M = 25$, $D = 5$); here we confirm they hold at the scale and prior specification used in this study.
```{python}
#| label: validation-setup
#| output: false
from utils.study_design import StudyDesign
# Load the actual study design from stan_data (any temperature — the design is identical)
with open(data_dir / "stan_data_T0_0.json") as f:
stan_data = json.load(f)
# Construct a StudyDesign from the actual study data
study = StudyDesign(
M=stan_data["M"],
K=stan_data["K"],
D=stan_data["D"],
R=stan_data["R"]
)
study.w = [np.array(w_vec) for w_vec in stan_data["w"]]
study.I = np.array(stan_data["I"])
print(f"Validation Study Design (from frozen data):")
print(f" Decision problems (M): {study.M}")
print(f" Consequences (K): {study.K}")
print(f" Feature dimensions (D): {study.D}")
print(f" Distinct alternatives (R): {study.R}")
```
### Parameter Recovery {#sec-parameter-recovery}
Parameter recovery asks: **can we recover known parameter values from simulated data?** We draw parameters from the m_01 prior, simulate choice data via `m_01_sim.stan`, fit `m_01.stan`, and compare posterior estimates to the true values across 20 iterations.
```{python}
#| label: run-param-recovery
#| output: false
from analysis.parameter_recovery import ParameterRecovery
import tempfile
recovery_dir = tempfile.mkdtemp(prefix="param_recovery_m01_")
recovery = ParameterRecovery(
inference_model_path=os.path.join(project_root, "models", "m_01.stan"),
sim_model_path=os.path.join(project_root, "models", "m_01_sim.stan"),
study_design=study,
output_dir=recovery_dir,
n_mcmc_samples=1000,
n_mcmc_chains=4,
n_iterations=20,
sim_hyperparams={'alpha_mean': 3.0, 'alpha_sd': 0.75}
)
true_params_list, posterior_summaries = recovery.run()
```
```{python}
#| label: recovery-summary
#| echo: false
n_successful = len(true_params_list)
print(f"Parameter Recovery Complete:")
print(f" Successful iterations: {n_successful}")
print(f" Parameters per iteration:")
print(f" α (sensitivity): 1")
print(f" β (feature weights): {study.K} × {study.D} = {study.K * study.D}")
print(f" δ (utility increments): {study.K - 1}")
```
#### MCMC Diagnostics
```{python}
#| label: recovery-mcmc-diagnostics
#| echo: false
rhat_max_list = []
ess_min_list = []
ess_col = None
sample_cols = posterior_summaries[0].columns
for candidate in ['ESS_bulk', 'N_Eff', 'ESS']:
if candidate in sample_cols:
ess_col = candidate
break
for s in posterior_summaries:
param_rows = s[s.index.str.match(r'^(alpha|beta\[|delta\[|upsilon\[)')]
if 'R_hat' in param_rows.columns:
rhat_max_list.append(param_rows['R_hat'].max())
if ess_col is not None and ess_col in param_rows.columns:
ess_min_list.append(param_rows[ess_col].min())
rhat_max = np.array(rhat_max_list)
ess_min = np.array(ess_min_list)
print("MCMC Diagnostic Summary Across Recovery Iterations:")
print(f" Max R̂ per iteration:")
print(f" Median: {np.median(rhat_max):.4f}")
print(f" Maximum: {rhat_max.max():.4f}")
print(f" Iterations with max R̂ > 1.01: {(rhat_max > 1.01).sum()}/{len(rhat_max)}")
if len(ess_min) > 0:
print(f" Min ESS ({ess_col}) per iteration:")
print(f" Median: {np.median(ess_min):.0f}")
print(f" Minimum: {ess_min.min():.0f}")
```
#### α Recovery
```{python}
#| label: fig-alpha-recovery
#| fig-cap: "Recovery of the sensitivity parameter α under the m_01 prior with the actual study design. Left: true vs. estimated values with identity line. Right: 90% credible intervals for each iteration, colored by whether they contain the true value."
alpha_true = np.array([p['alpha'] for p in true_params_list])
alpha_mean = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries])
alpha_lower = np.array([s.loc['alpha', '5%'] for s in posterior_summaries])
alpha_upper = np.array([s.loc['alpha', '95%'] for s in posterior_summaries])
alpha_bias = np.mean(alpha_mean - alpha_true)
alpha_rmse = np.sqrt(np.mean((alpha_mean - alpha_true)**2))
alpha_coverage = np.mean((alpha_true >= alpha_lower) & (alpha_true <= alpha_upper))
alpha_ci_width = np.mean(alpha_upper - alpha_lower)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# True vs Estimated
ax = axes[0]
ax.scatter(alpha_true, alpha_mean, alpha=0.7, s=60, c=SEU_COLORS['primary'], edgecolor='white')
lims = [min(alpha_true.min(), alpha_mean.min()) * 0.9,
max(alpha_true.max(), alpha_mean.max()) * 1.1]
ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line')
ax.set_xlim(lims)
ax.set_ylim(lims)
ax.set_xlabel('True α', fontsize=12)
ax.set_ylabel('Estimated α (posterior mean)', fontsize=12)
ax.set_title(f'α Recovery: Bias={alpha_bias:.2f}, RMSE={alpha_rmse:.2f}', fontsize=12)
ax.legend()
ax.set_aspect('equal')
# Coverage plot
ax = axes[1]
for i in range(len(alpha_true)):
covered = (alpha_true[i] >= alpha_lower[i]) & (alpha_true[i] <= alpha_upper[i])
color = 'forestgreen' if covered else 'crimson'
ax.plot([i, i], [alpha_lower[i], alpha_upper[i]], color=color, linewidth=2, alpha=0.7)
ax.scatter(i, alpha_mean[i], color=color, s=40, zorder=3)
ax.scatter(np.arange(len(alpha_true)), alpha_true, color='black', s=60, marker='x',
label='True value', zorder=4, linewidth=2)
ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('α', fontsize=12)
ax.set_title(f'α: 90% Credible Intervals (Coverage = {alpha_coverage:.0%})', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
```
#### Recovery Metrics Summary
```{python}
#| label: tbl-recovery-metrics
#| tbl-cap: "Parameter recovery metrics for m_01 with the actual study design (M=300, K=3, D=32, R=30). The primary parameter of interest is α."
import pandas as pd
K, D = study.K, study.D
# Beta recovery
all_beta_bias = []
all_beta_rmse = []
all_beta_coverage = []
all_beta_ci_width = []
for k in range(K):
for d in range(D):
param_name = f"beta[{k+1},{d+1}]"
bt = np.array([p['beta'][k][d] for p in true_params_list])
bm = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries])
bl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
bu = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])
all_beta_bias.append(np.mean(bm - bt))
all_beta_rmse.append(np.sqrt(np.mean((bm - bt)**2)))
all_beta_coverage.append(np.mean((bt >= bl) & (bt <= bu)))
all_beta_ci_width.append(np.mean(bu - bl))
# Delta recovery
all_delta_bias = []
all_delta_rmse = []
all_delta_coverage = []
all_delta_ci_width = []
for k in range(K - 1):
param_name = f"delta[{k+1}]"
dt = np.array([p['delta'][k] for p in true_params_list])
dm = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries])
dl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
du = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])
all_delta_bias.append(np.mean(dm - dt))
all_delta_rmse.append(np.sqrt(np.mean((dm - dt)**2)))
all_delta_coverage.append(np.mean((dt >= dl) & (dt <= du)))
all_delta_ci_width.append(np.mean(du - dl))
# Scale references for relative metrics. α lives on a wide multiplicative
# scale (Lognormal(3.0, 0.75) prior → true values typically ~5–100), so
# absolute Bias and RMSE must be interpreted relative to the magnitude
# of α; the same caveat is less relevant for β (zero-centred) and δ
# (modest range), so we report relative metrics for α only.
alpha_scale = float(np.mean(np.abs(alpha_true)))
alpha_rel_bias = alpha_bias / alpha_scale
alpha_rel_rmse = alpha_rmse / alpha_scale
metrics = pd.DataFrame([
{'Parameter': 'α', 'Bias': f'{alpha_bias:.2f}', 'RMSE': f'{alpha_rmse:.2f}',
'Rel. Bias': f'{alpha_rel_bias:+.1%}', 'Rel. RMSE': f'{alpha_rel_rmse:.1%}',
'Coverage (90%)': f'{alpha_coverage:.0%}', 'Mean CI Width': f'{alpha_ci_width:.2f}'},
{'Parameter': f'β (mean over {K*D})', 'Bias': f'{np.mean(all_beta_bias):.3f}',
'RMSE': f'{np.mean(all_beta_rmse):.3f}',
'Rel. Bias': '—', 'Rel. RMSE': '—',
'Coverage (90%)': f'{np.mean(all_beta_coverage):.0%}',
'Mean CI Width': f'{np.mean(all_beta_ci_width):.2f}'},
{'Parameter': f'δ (mean over {K-1})', 'Bias': f'{np.mean(all_delta_bias):.3f}',
'RMSE': f'{np.mean(all_delta_rmse):.3f}',
'Rel. Bias': '—', 'Rel. RMSE': '—',
'Coverage (90%)': f'{np.mean(all_delta_coverage):.0%}',
'Mean CI Width': f'{np.mean(all_delta_ci_width):.2f}'},
])
metrics
```
::: {.callout-note}
## Interpreting Recovery Results
The primary parameter of interest is $\alpha$ (sensitivity). The recovery
table reports both **absolute** and **relative** metrics for $\alpha$,
and the relative columns are the ones that should anchor interpretation.
Because the prior places true $\alpha$ on a wide multiplicative scale
(Lognormal$(3.0, 0.75)$ implies typical values in roughly the 5–100
range), absolute values of Bias and RMSE that may *look* large in the
table can simultaneously be small relative to the magnitude of $\alpha$.
This is why the scatter in the *α Recovery* figure can sit tightly on
the identity line even when raw RMSE has a value of several units —
the visual impression is on a relative scale, while raw RMSE is not.
Operationally, we treat the model as fit for purpose when:
- **Relative bias** is within roughly $\pm 10\%$ of the mean true $\alpha$;
- **Relative RMSE** is comfortably below 25% of the mean true $\alpha$;
- **90% credible-interval coverage** is approximately 85–95%.
The β–δ coupling issue documented in the foundational reports
([Report 4](../../foundations/04_parameter_recovery.qmd)) is expected to
persist — δ recovery benefits from additional identifying information
(e.g., risky choices) that is not present in m_01. Since this study
focuses on $\alpha$ estimation, the weaker recovery of $(\beta,\delta)$
does not compromise the primary analysis; we revisit the interpretive
implications of this in the *Construct Validity* discussion in
@sec-results.
:::
### Simulation-Based Calibration {#sec-sbc}
SBC complements parameter recovery by asking whether the posterior **correctly represents uncertainty**. Rather than evaluating point estimates, SBC checks that the true parameter value occupies a *uniform* rank position within the posterior samples across many simulations. Non-uniformity indicates that the posterior is miscalibrated—too wide, too narrow, or biased.
::: {.callout-tip collapse="false"}
## A Reader's Guide to SBC Plots and Tests
The basic logic of SBC [@talts2018] is simple. For each of $N$
simulations we (i) sample a "true" parameter value $\theta^\star$ from
the prior, (ii) simulate a dataset using $\theta^\star$, (iii) fit the
model and obtain $L$ posterior draws, and (iv) compute the **rank** of
$\theta^\star$ among those $L$ draws. If the model is correctly
specified and the sampler is calibrated, that rank should be uniformly
distributed on $\{0, 1, \ldots, L\}$ across simulations.
What to look for in the diagnostics:
* **Rank histogram (uniform = good).** Each bin should contain roughly
$N / B$ counts (here $\approx 10$ with $N=200$, $B=20$). The shape of
any departure is itself diagnostic:
* a **flat / uniform** histogram is evidence of a well-calibrated
posterior;
* a **U-shape** (excess mass at both ends) means the posterior is
too *narrow* — credible intervals are over-confident;
* an **inverted-U / dome** (excess mass in the middle) means the
posterior is too *wide* — credible intervals are under-confident;
* a **monotone ramp** (mass shifting toward 0 or toward $L$) means
the posterior is *biased* upward or downward.
* **ECDF with KS band.** The empirical CDF of the (normalised) ranks
should track the diagonal. The shaded band is a 95% Kolmogorov–Smirnov
envelope: if the ECDF stays inside the band for all $x$, we cannot
reject uniformity at the 5% level. Excursions reveal the same
signatures listed above (S-curves above the diagonal indicate
bias; ECDFs that bow away from the diagonal indicate over- or
under-confidence).
* **Chi-square test.** Compares the histogram counts to the uniform
expectation. With $N=200$ and $B=20$ the test has modest power
(expected count $\approx 10$ per bin), so a non-significant
$p$-value should *not* be over-interpreted as proof of calibration —
it is one piece of evidence alongside visual inspection.
* **KS test.** A continuous-rank analogue that is generally more
powerful than the binned chi-square, particularly for smooth
departures from uniformity (bias, spread). Together the two tests
cover both global shape and bin-level fluctuations.
The bottom line: a well-calibrated parameter shows a roughly flat
histogram, an ECDF inside the KS band, and non-significant
$p$-values; if any one of these fails, the *shape* of the failure
points to whether the posterior is biased, too tight, or too diffuse.
:::
```{python}
#| label: sbc-config
#| echo: true
# SBC configuration.
# With M=300 and D=32 each fit is expensive, so we use 200 simulations
# (vs 999 in the foundational reports). With 20 bins this gives ~10
# expected counts per bin — sufficient for meaningful rank histograms
# and KS tests, though chi-square power is modest.
n_sbc_sims = 200
n_sbc_chains = 1
sbc_thin = 4
n_sbc_samples = sbc_thin * n_sbc_sims # 800 total draws → 200 post-thinned
sbc_max_rank = n_sbc_samples // sbc_thin # 200 (= L)
sbc_n_bins = 20
sbc_expected_per_bin = n_sbc_sims / sbc_n_bins
print("SBC Configuration:")
print(f" Simulations (N): {n_sbc_sims}")
print(f" MCMC samples per sim: {n_sbc_samples}")
print(f" Chains: {n_sbc_chains}")
print(f" Thinning factor: {sbc_thin}")
print(f" Post-thinned draws (L): {sbc_max_rank}")
print(f" Histogram bins: {sbc_n_bins}")
print(f" Expected counts/bin: {sbc_expected_per_bin:.1f}")
```
```{python}
#| label: run-sbc
#| output: false
from analysis.sbc import SimulationBasedCalibration
sbc_dir = tempfile.mkdtemp(prefix="sbc_m01_")
sbc = SimulationBasedCalibration(
sbc_model_path=os.path.join(project_root, "models", "m_01_sbc.stan"),
study_design=study,
output_dir=sbc_dir,
n_sbc_sims=n_sbc_sims,
n_mcmc_samples=n_sbc_samples,
n_mcmc_chains=n_sbc_chains,
thin=sbc_thin
)
ranks, true_params_sbc = sbc.run()
```
```{python}
#| label: sbc-summary
#| echo: false
print(f"SBC Complete:")
print(f" Simulations: {ranks.shape[0]}")
print(f" Parameters: {ranks.shape[1]}")
```
#### MCMC Diagnostics
```{python}
#| label: sbc-mcmc-diagnostics
#| echo: false
def load_sbc_diagnostics(output_dir, n_inspect=10):
"""Load ESS from saved SBC posterior summaries."""
ess_per_sim = []
for i in range(1, n_inspect + 1):
csv_path = os.path.join(output_dir, "simulations", f"sim_{i}", "posterior_summary.csv")
if os.path.exists(csv_path):
df = pd.read_csv(csv_path, index_col=0)
ess_col = None
for col in ['N_Eff', 'ESS_bulk', 'ess_bulk']:
if col in df.columns:
ess_col = col
break
if ess_col is not None:
param_rows = df[~df.index.str.startswith(('lp', 'energy', 'pars_', 'ranks_'))]
ess_vals = param_rows[ess_col].dropna().values
if len(ess_vals) > 0:
ess_per_sim.append({
'sim': i,
'min_ess': np.min(ess_vals),
'median_ess': np.median(ess_vals),
})
return ess_per_sim
sbc_ess = load_sbc_diagnostics(sbc_dir)
print("SBC MCMC Diagnostics (from first 10 iterations):")
if sbc_ess:
min_ess = min(d['min_ess'] for d in sbc_ess)
median_ess = np.median([d['median_ess'] for d in sbc_ess])
print(f" Minimum ESS across all params/sims: {min_ess:.0f}")
print(f" Median ESS (across sims): {median_ess:.0f}")
print(f" Post-thinning draws: {n_sbc_samples // sbc_thin}")
print(f" ESS/draw ratio: {median_ess / n_sbc_samples:.2f}")
else:
print(" Diagnostic summaries not available.")
```
#### α Rank Distribution
```{python}
#| label: fig-sbc-alpha
#| fig-cap: "SBC rank histogram and ECDF for α under the m_01 prior with the actual study design. Left: rank histogram (uniform = well-calibrated). Right: ECDF with 95% KS confidence band."
from scipy import stats
fig, axes = plt.subplots(1, 2, figsize=(12, 4.5))
expected_count = n_sbc_sims / sbc_n_bins
# Rank histogram
ax = axes[0]
ax.hist(ranks[:, 0], bins=sbc_n_bins, alpha=0.7, color=SEU_COLORS['primary'], edgecolor='white')
ax.axhline(y=expected_count, color='red', linestyle='--', linewidth=2,
label=f'Uniform (E={expected_count:.1f})')
ax.set_xlabel('Rank', fontsize=11)
ax.set_ylabel('Count', fontsize=11)
ax.set_title('α Rank Histogram', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)
# ECDF
ax = axes[1]
ranks_norm = np.sort(ranks[:, 0]) / sbc_max_rank
ecdf = np.arange(1, len(ranks_norm) + 1) / len(ranks_norm)
ax.step(ranks_norm, ecdf, where='post', color=SEU_COLORS['primary'], linewidth=2, label='m_01')
ax.plot([0, 1], [0, 1], 'r--', linewidth=2, label='Uniform')
n_ecdf = len(ecdf)
epsilon = np.sqrt(np.log(2/0.05) / (2 * n_ecdf))
ax.fill_between([0, 1], [0-epsilon, 1-epsilon], [0+epsilon, 1+epsilon],
alpha=0.15, color='red', label=f'95% KS band (ε={epsilon:.3f})')
ax.set_xlabel('Normalized Rank', fontsize=11)
ax.set_ylabel('ECDF', fontsize=11)
ax.set_title('α ECDF', fontsize=12)
ax.legend(fontsize=9)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Formal tests
chi2_alpha, p_alpha = stats.chisquare(np.histogram(ranks[:, 0], bins=sbc_n_bins)[0])
ks_alpha, ks_p_alpha = stats.kstest(ranks[:, 0] / sbc_max_rank, 'uniform')
print(f"\nα Calibration Tests:")
print(f" Chi-square: χ² = {chi2_alpha:.2f}, p = {p_alpha:.3f}")
print(f" KS test: D = {ks_alpha:.3f}, p = {ks_p_alpha:.3f}")
```
#### δ Rank Distributions
```{python}
#| label: fig-sbc-delta
#| fig-cap: "SBC rank histograms for δ parameters. Departures from uniformity indicate calibration issues, typically reflecting the β–δ identification problem."
K_val, D_val = study.K, study.D
delta_start = 1 + K_val * D_val
K_minus_1 = K_val - 1
fig, axes = plt.subplots(1, K_minus_1, figsize=(6 * K_minus_1, 4.5))
if K_minus_1 == 1:
axes = [axes]
for k in range(K_minus_1):
param_idx = delta_start + k
ax = axes[k]
ax.hist(ranks[:, param_idx], bins=sbc_n_bins, alpha=0.7,
color=SEU_COLORS['secondary'], edgecolor='white')
ax.axhline(y=expected_count, color='red', linestyle='--', linewidth=2,
label=f'Uniform (E={expected_count:.1f})')
ax.set_xlabel('Rank', fontsize=11)
ax.set_ylabel('Count', fontsize=11)
ax.set_title(f'δ$_{k+1}$ Rank Histogram', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Formal tests for delta
print("δ Calibration Tests:")
for k in range(K_minus_1):
param_idx = delta_start + k
chi2_d, p_d = stats.chisquare(np.histogram(ranks[:, param_idx], bins=sbc_n_bins)[0])
ks_d, ks_p_d = stats.kstest(ranks[:, param_idx] / sbc_max_rank, 'uniform')
print(f" δ_{k+1}: χ² = {chi2_d:.2f}, p = {p_d:.3f} | KS D = {ks_d:.3f}, p = {ks_p_d:.3f}")
```
#### β Calibration Summary
```{python}
#| label: fig-sbc-beta
#| fig-cap: "Chi-square p-values for β parameter calibration. Parameters above the dashed line (p > 0.05) are consistent with uniform rank distributions."
beta_start = 1
n_beta = K_val * D_val
beta_pvals = []
for idx in range(beta_start, beta_start + n_beta):
counts = np.histogram(ranks[:, idx], bins=sbc_n_bins)[0]
_, p_val = stats.chisquare(counts)
beta_pvals.append(p_val)
fig, ax = plt.subplots(figsize=(10, 4.5))
x = np.arange(n_beta)
colors = [SEU_COLORS['primary'] if p > 0.05 else SEU_COLORS['accent'] for p in beta_pvals]
ax.bar(x, beta_pvals, color=colors, alpha=0.7, edgecolor='white')
ax.axhline(y=0.05, color='red', linestyle='--', linewidth=2, label='α = 0.05')
ax.set_xlabel('β Parameter Index', fontsize=11)
ax.set_ylabel('Chi-square p-value', fontsize=11)
ax.set_title('β Parameter Calibration', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
n_well_cal = sum(1 for p in beta_pvals if p > 0.05)
print(f"\nβ Calibration: {n_well_cal}/{n_beta} parameters well-calibrated (p > 0.05)")
print(f"Expected false rejections at α=0.05: {n_beta * 0.05:.1f}")
```
::: {.callout-note}
## SBC Interpretation
With 200 SBC simulations the formal chi-square tests have modest power (~10 expected counts per bin), so visual inspection of rank histograms is as important as the p-values. Specifically, with 200 simulations and 20 bins, the chi-square test has approximately 50% power to detect a 40% deviation from uniformity in any single bin—substantial miscalibration could go undetected. The key question is whether α shows approximately uniform ranks—confirming that the m_01 posterior correctly represents uncertainty about sensitivity at this study scale. The β–δ identification pattern from the foundational reports ([Report 6](../../foundations/06_sbc_validation.qmd)) is expected to manifest here as well, since m_01 (like m_0) uses only uncertain choices. This does not affect the validity of α estimation for the primary analysis.
:::
## Results {#sec-results}
### Loading Posterior Draws
```{python}
#| label: load-posteriors
#| output: false
temperatures = [0.0, 0.3, 0.7, 1.0, 1.5]
temp_labels = {t: f"T={t}" for t in temperatures}
# Load alpha draws for each temperature
alpha_draws = {}
for t in temperatures:
key = f"T{str(t).replace('.', '_')}"
data = np.load(data_dir / f"alpha_draws_{key}.npz")
alpha_draws[t] = data['alpha']
# Load pre-computed analysis results
with open(data_dir / "primary_analysis.json") as f:
analysis = json.load(f)
# Load fit summary
with open(data_dir / "fit_summary.json") as f:
fit_summary = json.load(f)
```
```{python}
#| echo: false
# Verify draws loaded correctly
for t in temperatures:
n = len(alpha_draws[t])
print(f" T={t}: {n:,} posterior draws loaded")
```
### MCMC Diagnostics
```{python}
#| label: tbl-diagnostics
#| tbl-cap: "MCMC diagnostics for all five temperature conditions. All fits used 4 chains with 1,000 warmup and 1,000 sampling iterations each (4,000 post-warmup draws total)."
diag_rows = []
for t in temperatures:
key = f"T{str(t).replace('.', '_')}"
with open(data_dir / f"diagnostics_{key}.txt") as f:
diag_text = f.read()
# Parse divergences
if "No divergent transitions" in diag_text:
n_div = 0
else:
import re
match = re.search(r'(\d+) of (\d+)', diag_text)
n_div = int(match.group(1)) if match else 0
rhat_ok = "R-hat values satisfactory" in diag_text
ess_ok = "effective sample size satisfactory" in diag_text
ebfmi_ok = "E-BFMI satisfactory" in diag_text
diag_rows.append({
'Temperature': t,
'Divergences': f"{n_div}/4000",
'R̂': '✓' if rhat_ok else '✗',
'ESS': '✓' if ess_ok else '✗',
'E-BFMI': '✓' if ebfmi_ok else '✗',
})
pd.DataFrame(diag_rows)
```
All conditions show clean diagnostics. The 1–2 divergent transitions at $T = 1.0$ and $T = 1.5$ (< 0.05% of transitions) are well within acceptable bounds and do not indicate systematic sampling difficulties.
### Posterior Summaries
```{python}
#| label: tbl-posteriors
#| tbl-cap: "Posterior summaries for the sensitivity parameter α at each temperature level. Intervals are 90% credible intervals."
summary = analysis['summary_table']
rows = []
for s in summary:
rows.append({
'Temperature': s['temperature'],
'Median': f"{s['median']:.1f}",
'Mean': f"{s['mean']:.1f}",
'SD': f"{s['sd']:.1f}",
'90% CI': f"[{s['ci_low']:.1f}, {s['ci_high']:.1f}]",
})
pd.DataFrame(rows)
```
The estimates reveal a clear pattern: $\alpha$ is highest at $T = 0.0$ (greedy decoding) and lowest at $T = 1.5$ (high temperature), with intermediate conditions falling between these extremes. Note, however, the near-equality of the $T = 0.3$ and $T = 0.7$ estimates—a point we return to below.
### Forest Plot
```{python}
#| label: fig-forest
#| fig-cap: "Forest plot of posterior α distributions across temperature conditions. Points show posterior medians; thick bars span the 50% credible interval; thin bars span the 90% credible interval. Higher temperature is associated with lower estimated sensitivity."
#| fig-height: 5
fig, ax = plt.subplots(figsize=(8, 5))
y_positions = np.arange(len(temperatures))[::-1]
for i, t in enumerate(temperatures):
draws = alpha_draws[t]
median = np.median(draws)
q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95])
y = y_positions[i]
# 90% CI (thin line)
ax.plot([q05, q95], [y, y], color=SEU_PALETTE[i], linewidth=1.5, alpha=0.7)
# 50% CI (thick line)
ax.plot([q25, q75], [y, y], color=SEU_PALETTE[i], linewidth=4, alpha=0.9)
# Median (point)
ax.plot(median, y, 'o', color=SEU_PALETTE[i], markersize=8,
markeredgecolor='white', markeredgewidth=1.5, zorder=5)
ax.set_yticks(y_positions)
ax.set_yticklabels([f'T = {t}' for t in temperatures])
ax.set_xlabel('Sensitivity (α)')
ax.set_title('Posterior Distributions of α by Temperature')
ax.grid(axis='x', alpha=0.3)
ax.grid(axis='y', alpha=0)
plt.tight_layout()
plt.show()
```
### Posterior Densities
```{python}
#| label: fig-density
#| fig-cap: "Kernel density estimates of the posterior α distributions. The separation between T = 0.0 and T ≥ 1.0 is clear, while T = 0.3 and T = 0.7 overlap substantially."
#| fig-height: 5
from scipy.stats import gaussian_kde
fig, ax = plt.subplots(figsize=(8, 5))
for i, t in enumerate(temperatures):
draws = alpha_draws[t]
kde = gaussian_kde(draws)
x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.1, 300)
ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=SEU_PALETTE[i])
ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2,
label=f'T = {t} (median = {np.median(draws):.0f})')
ax.set_xlabel('Sensitivity (α)')
ax.set_ylabel('Density')
ax.set_title('Posterior Density of α')
ax.legend(loc='upper right')
plt.tight_layout()
plt.show()
```
### Posterior Predictive Checks
We assess model adequacy via posterior predictive checks (PPCs). For each posterior draw of the parameters, we simulate a replicated dataset of the same size and structure as the observed data, then compare the observed and replicated datasets on three complementary test statistics:
- **Log-likelihood (ll).** The sum of log-probabilities assigned by the model to the observed choices, $T_{\text{ll}} = \sum_{m=1}^{M} \log p(y_m \mid \theta)$. This is a global goodness-of-fit measure: if the model systematically assigns lower probability to real choices than to simulated ones, the observed $T_{\text{ll}}$ will fall in the tail of the replicated distribution.
- **Modal choice frequency (modal).** For each decision problem, identify the alternative chosen most often across replications; the test statistic is the fraction of problems where the mode matches the observed choice. This targets whether the model correctly identifies the *most likely* choice in each problem, independent of the exact probabilities.
- **Mean choice probability (prob).** The average predicted probability of the actually chosen alternative, $T_{\text{prob}} = M^{-1} \sum_{m=1}^{M} p(y_m \mid \theta)$. This measures whether the model's confidence in the observed choices is well-calibrated on average.
The posterior predictive p-value for each statistic is the proportion of replicated datasets in which the test statistic equals or exceeds the observed value. Values near 0.5 indicate good calibration; values near 0 or 1 signal systematic misfit.
```{python}
#| label: tbl-ppc
#| tbl-cap: "Posterior predictive check p-values for each temperature condition. Values near 0.5 indicate good calibration; values near 0 or 1 indicate model misfit. Three test statistics are used: log-likelihood (ll), modal choice frequency (modal), and mean choice probability (prob)."
ppc_rows = []
for t in temperatures:
key = f"T{str(t).replace('.', '_')}"
with open(data_dir / f"ppc_{key}.json") as f:
ppc = json.load(f)
pvals = ppc['p_values']
ppc_rows.append({
'Temperature': t,
'Log-likelihood': f"{pvals['ll']:.3f}",
'Modal frequency': f"{pvals['modal']:.3f}",
'Mean probability': f"{pvals['prob']:.3f}",
})
pd.DataFrame(ppc_rows)
```
All posterior predictive p-values fall comfortably within $[0.3, 0.65]$, indicating that the model provides an adequate description of the choice data at every temperature level. There is no evidence of systematic misfit.
## Monotonicity Analysis {#sec-monotonicity}
The central question of this study is whether estimated sensitivity $\alpha$ decreases monotonically with temperature. We assess this at three levels of granularity.
### Global Slope
To summarize the overall direction and magnitude of the temperature–sensitivity relationship, we compute a posterior distribution over a linear slope coefficient. For each posterior draw $s = 1, \ldots, S$, we have a vector of five $\alpha$ values—one per temperature condition. We fit a simple ordinary least squares regression $\alpha^{(s)}(T) = a^{(s)} + b^{(s)} \cdot T$ across the five temperature levels, yielding one slope $b^{(s)}$ per draw. The resulting $S$ slope values define a posterior distribution over $b = \Delta\alpha / \Delta T$.
This is *not* a regression model in the usual sense—it is a derived quantity computed from the joint posterior. The slope summarizes whether, on the whole, higher temperature is associated with lower sensitivity. A slope whose posterior mass lies entirely below zero provides evidence for the directional hypothesis that temperature reduces EU sensitivity:
```{python}
#| label: fig-slope
#| fig-cap: "Posterior distribution of the slope Δα/ΔT from a draw-wise linear fit. The entire 90% credible interval lies below zero, providing strong evidence that α decreases with temperature."
slopes = analysis['slope']
# Recompute the slope draws for the density plot
temp_array = np.array(temperatures)
slope_draws = []
for draw_idx in range(len(alpha_draws[temperatures[0]])):
alphas_at_draw = np.array([alpha_draws[t][draw_idx] for t in temperatures])
# OLS slope: cov(T, α) / var(T)
b = np.cov(temp_array, alphas_at_draw)[0, 1] / np.var(temp_array)
slope_draws.append(b)
slope_draws = np.array(slope_draws)
fig, ax = plt.subplots(figsize=(8, 4))
kde = gaussian_kde(slope_draws)
x_grid = np.linspace(np.percentile(slope_draws, 0.5), np.percentile(slope_draws, 99.5), 300)
ax.fill_between(x_grid, kde(x_grid), alpha=0.3, color=SEU_COLORS['primary'])
ax.plot(x_grid, kde(x_grid), color=SEU_COLORS['primary'], linewidth=2)
median_slope = np.median(slope_draws)
ax.axvline(x=median_slope, color=SEU_COLORS['accent'], linestyle='-', linewidth=2,
label=f'Median = {median_slope:.1f}')
ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5, label='No effect')
# Shade 90% CI
q05, q95 = np.percentile(slope_draws, [5, 95])
mask = (x_grid >= q05) & (x_grid <= q95)
ax.fill_between(x_grid[mask], kde(x_grid[mask]), alpha=0.15, color=SEU_COLORS['accent'])
ax.axvline(x=q05, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6)
ax.axvline(x=q95, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6)
ax.set_xlabel('Slope (Δα / ΔT)')
ax.set_ylabel('Density')
ax.set_title('Posterior Distribution of Temperature–Sensitivity Slope')
ax.legend()
plt.tight_layout()
plt.show()
print(f"Slope summary:")
print(f" Median: {median_slope:.1f}")
print(f" 90% CI: [{q05:.1f}, {q95:.1f}]")
print(f" P(slope < 0): {np.mean(slope_draws < 0):.4f}")
```
The posterior probability that the slope is negative exceeds 0.99, providing strong evidence for the directional hypothesis: higher temperature is associated with lower sensitivity to expected utility.
### Pairwise Comparisons
The global slope averages over all five conditions and may mask local structure (e.g., two adjacent temperatures that produce indistinguishable $\alpha$ values). To examine the ordering at finer resolution, we compute pairwise posterior probabilities. For each pair of temperature conditions $(T_i, T_j)$ with $T_i < T_j$, we calculate
$$
P\bigl(\alpha(T_i) > \alpha(T_j) \mid \text{data}\bigr) = \frac{1}{S} \sum_{s=1}^{S} \mathbb{1}\bigl[\alpha^{(s)}(T_i) > \alpha^{(s)}(T_j)\bigr],
$$
where the indicator function is evaluated draw-by-draw from the *independent* posteriors at each temperature.
::: {.callout-warning}
## Independence Caveat: Design-Induced Correlations
These posteriors are independent *conditional on the shared design*,
but they are not independent *unconditionally*. The reason is worth
spelling out, because the same point will recur whenever multiple
conditions share a common stimulus pool (as in the planned alignment
study).
The five temperature conditions in this study draw their decision
problems from a *single* fixed pool of $R = 30$ insurance claims.
Whatever is idiosyncratic about that particular set of 30 claims —
the embedding geometry that happens to fall out of those texts, the
particular spread of "true" expected utilities across alternatives,
the typical gap $\Delta$ between the best and second-best option —
shapes all five fits simultaneously. A claim pool that happens to
contain very large utility gaps will inflate every estimated $\alpha$
relative to a pool with smaller gaps; a pool that happens to be
nearly degenerate in some embedding direction will deflate every
$\alpha$. None of this idiosyncrasy is captured by fitting m_01
five separate times: each fit treats its own dataset as the only
source of evidence, so the five posterior summaries look statistically
unrelated even though they actually share a common, unmodelled
nuisance. The practical consequence is that pairwise probabilities
$P(\alpha(T_i) > \alpha(T_j))$ that are computed from independent
posteriors slightly *overstate* how confidently we can resolve
between-temperature differences — the very feature of the design
that boosts apparent precision (every condition uses the same claims,
so within-condition noise is reduced) also induces a positive
between-condition correlation that the independent-fits analysis
ignores.
A hierarchical model jointly estimating $\alpha(T)$ — for example
$\log \alpha(T_c) = \gamma_0 + \gamma_1 T_c + \eta_c$ with a small
random effect $\eta_c$ — addresses this on two fronts. First, it
makes the *common* effect of the shared claim pool a single estimated
quantity ($\gamma_0$), so design-induced shifts cancel out of
between-condition contrasts in a principled way. Second, it borrows
strength across conditions for the *systematic* effect of interest
($\gamma_1$), giving a single, calibrated uncertainty statement
about the temperature–sensitivity relationship that respects the
positive correlation across conditions. We return to this issue in
the *Limitations* section below and propose the corresponding
hierarchical extension in *Next Steps*; the foundational `h_m01`
reports introduce exactly this kind of model and use it as the
inferential workhorse for the planned alignment study.
:::
Values near 1.0 indicate that the lower-temperature condition almost certainly yields higher sensitivity; values near 0.5 indicate that the two conditions are effectively indistinguishable.
```{python}
#| label: tbl-pairwise
#| tbl-cap: "Posterior probability that α is higher at the lower temperature in each pair, treating the per-temperature posteriors as if independent. Values above 0.95 indicate apparent separation under that simplifying assumption; values near 0.5 indicate indistinguishability. See the *Independence Caveat* above: shared stimuli induce between-condition correlations that the independent-fits comparison does not capture, so the apparent separation is likely an upper bound on the evidence."
pairs = analysis['pairwise_comparisons']
pair_rows = []
for key, prob in pairs.items():
t1, t2 = key.split('_vs_')
strength = '●●●' if prob > 0.95 else ('●●' if prob > 0.8 else ('●' if prob > 0.65 else '○'))
pair_rows.append({
'Comparison': f'α(T={t1}) > α(T={t2})',
'P': f'{prob:.3f}',
'Evidence': strength,
})
pd.DataFrame(pair_rows)
```
The pairwise analysis reveals a structured pattern (interpret these probabilities in light of the *Independence Caveat* above; they treat the per-temperature posteriors as independent and so likely overstate the true between-condition evidence):
- **Apparent separation** ($P > 0.95$ under independence): $T = 0.0$ vs $T = 1.0$ and $T = 1.5$
- **Moderate apparent separation** ($P > 0.8$): $T = 0.3$ vs $T \geq 1.0$; $T = 0.7$ vs $T \geq 1.0$
- **Weak/no separation** ($P \approx 0.5$): $T = 0.3$ vs $T = 0.7$
### Strict Monotonicity
The pairwise analysis examines each adjacent or non-adjacent pair in isolation. A stronger test asks whether the *entire* ordering $\alpha(T = 0.0) > \alpha(T = 0.3) > \alpha(T = 0.7) > \alpha(T = 1.0) > \alpha(T = 1.5)$ holds simultaneously. We estimate the joint probability of strict monotonicity by computing, for each posterior draw, whether the five $\alpha$ values are strictly decreasing:
$$
P(\text{strict monotonicity} \mid \text{data}) = \frac{1}{S} \sum_{s=1}^{S} \prod_{j=1}^{4} \mathbb{1}\bigl[\alpha^{(s)}(T_j) > \alpha^{(s)}(T_{j+1})\bigr].
$$
Because this requires *every* adjacent pair to satisfy the ordering in the *same* draw, the joint probability is necessarily smaller than any single pairwise probability—a single weakly-ordered pair drives it down.
```{python}
#| label: monotonicity
#| echo: true
# P(α(T=0.0) > α(T=0.3) > α(T=0.7) > α(T=1.0) > α(T=1.5)) — joint probability
n_draws = len(alpha_draws[0.0])
strictly_decreasing = 0
for i in range(n_draws):
vals = [alpha_draws[t][i] for t in temperatures]
if all(vals[j] > vals[j+1] for j in range(len(vals)-1)):
strictly_decreasing += 1
p_mono = strictly_decreasing / n_draws
print(f"P(α strictly decreasing across all T): {p_mono:.4f}")
```
The probability of *strict* monotonicity across all five levels is modest, at approximately 0.12. This is driven entirely by the near-equality of $\alpha$ at $T = 0.3$ and $T = 0.7$, where the pairwise probability is essentially a coin flip ($P = 0.48$). Collapsing these two levels yields a coarser ordering that holds with much higher confidence:
```{python}
#| label: coarse-monotonicity
#| echo: true
# Coarser test: α(T=0.0) > mean(α(T=0.3), α(T=0.7)) > α(T=1.0) > α(T=1.5)
coarse_decreasing = 0
for i in range(n_draws):
a00 = alpha_draws[0.0][i]
a_mid = (alpha_draws[0.3][i] + alpha_draws[0.7][i]) / 2
a10 = alpha_draws[1.0][i]
a15 = alpha_draws[1.5][i]
if a00 > a_mid > a10 > a15:
coarse_decreasing += 1
p_coarse = coarse_decreasing / n_draws
print(f"P(α(0.0) > mean(α(0.3), α(0.7)) > α(1.0) > α(1.5)): {p_coarse:.4f}")
```
## Discussion {#sec-discussion}
### Summary of Findings
The results support the directional hypothesis: increasing LLM temperature decreases estimated sensitivity to expected utility maximization. The evidence is strongest at the global level—the posterior slope $\Delta\alpha / \Delta T$ is negative with high confidence—and weakest at fine-grained adjacent comparisons.
```{python}
#| label: fig-summary
#| fig-cap: "Summary of the temperature–sensitivity relationship. Left: Posterior medians with 90% credible intervals. Right: Pairwise posterior probabilities P(α_i > α_j) as a heatmap."
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Left: point estimates with CIs
medians = [np.median(alpha_draws[t]) for t in temperatures]
q05s = [np.percentile(alpha_draws[t], 5) for t in temperatures]
q95s = [np.percentile(alpha_draws[t], 95) for t in temperatures]
axes[0].errorbar(temperatures, medians,
yerr=[np.array(medians) - np.array(q05s), np.array(q95s) - np.array(medians)],
fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8,
capsize=5, capthick=1.5)
axes[0].set_xlabel('Temperature')
axes[0].set_ylabel('Sensitivity (α)')
axes[0].set_title('α vs. Temperature')
axes[0].set_xticks(temperatures)
# Right: pairwise heatmap
n_temps = len(temperatures)
heatmap = np.full((n_temps, n_temps), np.nan)
for key, prob in pairs.items():
t1, t2 = key.split('_vs_')
i = temperatures.index(float(t1))
j = temperatures.index(float(t2))
heatmap[i, j] = prob
heatmap[j, i] = 1 - prob
np.fill_diagonal(heatmap, 0.5)
im = axes[1].imshow(heatmap, cmap='RdYlGn', vmin=0, vmax=1, aspect='equal')
axes[1].set_xticks(range(n_temps))
axes[1].set_xticklabels([f'{t}' for t in temperatures])
axes[1].set_yticks(range(n_temps))
axes[1].set_yticklabels([f'{t}' for t in temperatures])
axes[1].set_xlabel('Temperature (column)')
axes[1].set_ylabel('Temperature (row)')
axes[1].set_title('P(α_row > α_col)')
# Annotate cells
for i in range(n_temps):
for j in range(n_temps):
if not np.isnan(heatmap[i, j]):
color = 'white' if heatmap[i, j] > 0.8 or heatmap[i, j] < 0.2 else 'black'
axes[1].text(j, i, f'{heatmap[i, j]:.2f}', ha='center', va='center',
fontsize=9, color=color)
plt.colorbar(im, ax=axes[1], shrink=0.8)
plt.tight_layout()
plt.show()
```
### Interpreting the Non-Separation at $T = 0.3$ and $T = 0.7$
The inability to distinguish $\alpha$ at these two temperatures admits several interpretations:
1. **Threshold effect.** The relationship between temperature and sensitivity may not be linear. There may be a regime ($T \lesssim 0.7$) where the LLM's choice behavior is relatively stable, with sensitivity declining more sharply only at higher temperatures.
2. **Insufficient statistical power.** With $M = 300$ observations per condition, the study may lack power to resolve small differences in $\alpha$. The posteriors at $T = 0.3$ and $T = 0.7$ have standard deviations of $\sim 15$–$19$, comparable to the difference between their medians ($\sim 1.5$).
3. **Genuine non-monotonicity.** It is possible that the true relationship is not strictly monotonic at fine scales, even if the overall trend is negative.
These interpretations are not mutually exclusive, and the current data do not distinguish among them.
### Model Adequacy
The posterior predictive checks provide no evidence of misfit at any temperature level. This is reassuring but not conclusive—PPC p-values assess whether the model *can* reproduce summary statistics of the data, not whether the model captures the *mechanism* by which temperature affects choices. The m_01 model treats each temperature condition as an independent estimation problem; it does not model how temperature enters the choice process.
### Construct Validity of $\alpha$
For a JDM audience, the claim that $\alpha$ measures "sensitivity to subjective expected utility maximization" requires careful interpretation. The parameter $\alpha$ controls the sharpness of the softmax choice distribution—higher $\alpha$ means choices are more concentrated on high-EU alternatives. But this interpretation assumes the model's *other* parameters ($\beta$, $\delta$) correctly capture the subjective utilities. Given the acknowledged $\beta$–$\delta$ identification issues, what does $\alpha$ actually measure?
Operationally, $\alpha$ measures the *consistency* with which the agent chooses in the direction implied by the fitted utility function. This is related to, but not identical to, "sensitivity to EU maximization" in the strong sense that would require the utility parameters to be correctly identified.
#### Do we need to identify $\beta$ and $\delta$ to interpret $\alpha$?
The honest answer is: it depends on the strength of the claim being
made. It is worth distinguishing three layers of interpretation, in
order of increasing demand on identification.
1. **Internal consistency (weakest).** $\alpha$ is identified up to a
scale that is fixed by the parameterisation, *given* the fitted
utility function $u_\theta(c) = \delta^\top c + \text{(features)}$.
A higher $\alpha$ in one condition than another means the agent's
choices are more concentrated on the alternative that the fitted
$u_\theta$ ranks highest. This is a within-model claim about
choice consistency and *does not require* $\beta$ and $\delta$ to
be correctly identified — it only requires that $u_\theta$ rank the
alternatives in roughly the right order (which the posterior
predictive checks support globally).
2. **Comparative claims across conditions (the present use case).**
To interpret a *change* in $\alpha$ across a manipulation (here,
temperature), we need the fitted utility function to be reasonably
stable across conditions, so that the same notion of "EU
maximization" anchors the comparison. The m_01 design uses the
same $R = 30$ insurance claims at every temperature and the same
embedding-based features, which makes this assumption defensible:
any miscalibration in $(\beta, \delta)$ tends to act as a *shared
confound* that does not move differentially across the temperature
conditions. Under this assumption — which is empirically supported
by the within-model posterior predictive checks at every
temperature — comparative ordering claims about $\alpha$ are
warranted even though $\beta$ and $\delta$ are not separately
identified.
3. **Absolute claims about the *level* of EU rationality (strongest).**
Claims of the form "the LLM is rational with sensitivity $\alpha
\approx 30$" are interpretable only relative to the fitted utility
$u_\theta$. Without independent identification of $\beta$ and $\delta$
one cannot certify that $u_\theta$ is the *true* subjective utility,
so the absolute number $\alpha$ is best read as a sensitivity to
*the model-implied utility ranking*, not as a context-free measure
of EU rationality. We deliberately avoid making absolute claims of
this third type in the present report.
The temperature study lives in layer (2) and is well-supported. By
contrast, claims that rely on layer (3) — for example, comparing $\alpha$
estimated under m_0 / m_01 against an exogenous benchmark, or treating
$\alpha$ as a transferable property of the agent — should be flagged.
#### Implications for the m_0 family and downstream studies
The present analysis uses m_01 (uncertain choices only). Models in the
m_0 family share the same $\beta$–$\delta$ coupling. The implications
ripple outward as follows:
* **Within-application comparative analyses** (the present temperature
study, the temperature × EU prompt study, the temperature × risky
alternatives study) use shared designs and within-application
comparisons, so they live in layer (2) and are robust to the
partial identification.
* **Cross-application comparisons** of *absolute* $\alpha$ values
(e.g., GPT-4o on insurance vs. Claude on insurance vs. GPT-4o on
Ellsberg) live closer to layer (3) and should be interpreted with
care: differences in domain, embedding geometry, and answer-format
noise can produce different model-implied utility functions, which
in turn change what $\alpha$ is sensitive *to*. We recommend
presenting such cross-application contrasts as comparisons of
*patterns* (e.g., signs and orderings of effects) rather than of
numerical $\alpha$ values.
* **The planned alignment study** (see [`README.md`](../../../README.md))
is, on first inspection, more demanding because it asks how $\alpha$
estimated for an LLM relates to a *normative* benchmark of EU
rationality. This is closest to a layer-(3) claim, and it deserves
to be flagged honestly. The alignment study is built on the
hierarchical model `h_m01`, which is the m_01 likelihood extended
to $J$ experimental cells: cell-specific $\alpha_j$ and $\beta_j$
with a shared $\delta$, fitted jointly via a regression on
$\log \alpha$. Crucially, `h_m01` inherits m_01's reliance on
*uncertain* choices only — it does **not** introduce risky
alternatives, so the underlying $\beta$–$\delta$ coupling is *not*
resolved by moving to the hierarchical model. What `h_m01` does buy
is (a) sharing $\delta$ across cells (so any single subjective
utility scaling is at least common to all comparisons) and (b)
framing alignment-relevant manipulations as regression
contrasts on $\log \alpha$, which are layer-(2) comparative claims
by construction. The strongest defensible reading of the alignment
study is therefore that it is a layer-(2) design that uses the
hierarchical structure to make comparative effects across
alignment-manipulating conditions properly calibrated, *not* a
layer-(3) certification of EU rationality in absolute terms.
Genuinely escaping the partial-identification caveat would require
moving to a model in the m_1 / m_2 / m_3 family — i.e., one that
augments the design with risky alternatives whose consequences are
partially observable, which is what gives $\delta$ separate
identifying information in the foundational reports. We flag this
as an open design question for the alignment study.
In short: the partial identification is a real interpretive constraint,
but it does not undermine the present temperature study or the broader
applications programme — provided that conclusions are framed as
comparative rather than absolute, and that the alignment study is
explicitly scoped as a contrast study on $\log \alpha$ rather than a
calibration of agent-level EU rationality. The operational
consequences for the alignment-study pre-registration, hypothesis
framing, and follow-up sequencing are spelled out in §0.5 of
[`prompts/hierarchical_alignment_study_plan.md`](../../../prompts/hierarchical_alignment_study_plan.md).
### Relation to the Foundational Framework
These results can be interpreted through the lens of [Report 1](../../foundations/01_abstract_formulation.qmd):
- The **monotonicity property** (Theorem 1) tells us that for any fixed value function, higher $\alpha$ implies higher probability of choosing value-maximizing alternatives. Our finding that $\alpha$ decreases with temperature thus means the LLM's choices become *less* concentrated on EU-maximizing alternatives as temperature increases.
- The **rate of convergence** (Theorem 5) tells us that the effect of $\alpha$ on choice probabilities depends on the gap $\Delta$ between the best and second-best expected utilities. Problems with larger utility gaps will show temperature effects more readily.
- The **commitment–performance distinction** discussed in Report 1 provides a natural framing: if we take EU maximization as the LLM's "commitment" (the normative standard its training approximates), then temperature controls the *performance* noise around that commitment.
### Connection to the JDM Literature
The relationship between temperature and choice consistency connects to several strands of the human decision-making literature:
- **Error theories of choice.** The softmax (logit) choice model is a
standard *Fechner-style* stochastic specification — one of the
workhorse error structures examined alongside the random-preference
and "trembling-hand" alternatives by Hey and Orme [-@hey1994]
in their experimental comparison of stochastic specifications for
choice under risk. The $\alpha$ parameter plays the same role as
inverse temperature (or *precision*) in random utility models, and
our finding that external sampling noise reduces effective $\alpha$
parallels work on how cognitive load and time pressure increase
choice inconsistency in human subjects.
- **Quantal response equilibrium.** In game-theoretic settings, the
softmax choice rule defines quantal response equilibrium [@mckelvey1995],
where players choose strategies with probabilities that increase in
expected payoff. The $\alpha$ parameter governs the "rationality" of
players in this framework. Our results suggest that LLM temperature
has an analogous effect to noise in human strategic reasoning.
- **Resource-rational analysis.** From the perspective of
resource-rational analysis [@griffiths2015; @lieder2020], choice
noise may reflect optimal information processing under cognitive
constraints. The token-sampling temperature could be viewed as a
proxy for "cognitive effort" — lower temperature corresponds to
more deterministic (less resource-intensive) output generation.
This framing also connects to *rational inattention* models
[@sims2003; @caplin2015] in which agents optimally trade off
decision quality against information-processing costs.
These connections are suggestive rather than formal, but they situate
the temperature–$\alpha$ relationship within broader theoretical
frameworks for understanding noisy choice.
### Limitations and Generalizability
These results establish the temperature–sensitivity relationship for **GPT-4o on the insurance claims triage task**. Several important limitations should constrain interpretation:
1. **Single LLM.** The temperature parameter has model-specific effects. Whether this relationship generalizes to other LLM architectures (Claude, Llama, etc.) is an open empirical question. Subsequent reports in this series ([Claude × Insurance](../claude_insurance_study/01_claude_insurance_study.qmd), [GPT-4o × Ellsberg](../gpt4o_ellsberg_study/01_gpt4o_ellsberg_study.qmd)) provide cross-LLM evidence.
2. **Single task domain.** The insurance claims triage task is a specific operationalization of decision-making under uncertainty. How well findings generalize to other domains (gambles, consumer choice, medical decisions) depends on task-specific factors that may interact with temperature.
3. **Single embedding model.** The feature construction uses `text-embedding-3-small`. Different embedding models might produce different utility landscapes and thus different temperature–$\alpha$ relationships.
4. **Prior sensitivity.** While the prior predictive analysis motivates $\text{Lognormal}(3.0, 0.75)$, we have not conducted a formal sensitivity analysis demonstrating that conclusions are robust to nearby prior specifications. Given that $M = 300$ observations per condition is substantial, we expect the prior to be swamped by the data—but this should be verified. A robustness analysis is planned for subsequent work.
5. **Design-induced correlations.** Because the same $R = 30$ claims
appear at every temperature, the five $\alpha$ estimates share
correlations that the independent-fits approach does not model.
The mechanism and its consequences for the pairwise comparisons
are discussed in the *Independence Caveat* callout in
@sec-monotonicity; a hierarchical model that estimates
$\alpha(T)$ jointly would resolve it (see *Next Steps*).
Claims about "LLM decision-making" generally should be understood as hypotheses motivated by these specific findings, not as established generalizations.
## Reproducibility {#sec-reproducibility}
### Data Snapshot
All results in this report are loaded from a frozen data snapshot in the `data/` subdirectory, which is version-controlled. This protects the report from changes if the temperature study pipeline is re-run with different parameters.
The snapshot contains:
| File | Description |
|------|-------------|
| `alpha_draws_T*.npz` | Posterior draws of α (4,000 per condition) |
| `ppc_T*.json` | Posterior predictive check results |
| `diagnostics_T*.txt` | CmdStan diagnostic output |
| `stan_data_T*.json` | Stan-ready data (for refitting) |
| `fit_summary.json` | Summary statistics across conditions |
| `primary_analysis.json` | Pre-computed monotonicity and slope statistics |
| `run_summary.json` | Pipeline metadata and configuration |
| `grid_results.json` | Prior predictive grid search results |
| `study_config.yaml` | Frozen copy of the study configuration |
### Refitting from Source
To reproduce the Stan fits from the included `stan_data_T*.json` files:
```{python}
#| label: refit-example
#| eval: false
#| echo: true
# Uncomment to refit from source data (requires CmdStanPy; ~30 min per condition)
#
# import cmdstanpy
# model = cmdstanpy.CmdStanModel(stan_file="models/m_01.stan")
#
# for t in [0.0, 0.3, 0.7, 1.0, 1.5]:
# key = f"T{str(t).replace('.', '_')}"
# fit = model.sample(
# data=f"data/stan_data_{key}.json",
# chains=4,
# iter_warmup=1000,
# iter_sampling=1000,
# seed=42,
# )
# print(f"T={t}: alpha median = {np.median(fit.stan_variable('alpha')):.1f}")
```
## Next Steps
This initial study establishes that temperature affects estimated SEU sensitivity. Several extensions are planned:
1. **Position bias and consistency analysis.** The data support analyses described in DESIGN.md §6.2–6.4, examining whether position bias or choice inconsistency varies with temperature. These diagnostics may reveal whether temperature affects *how* the LLM chooses, beyond the summary captured by $\alpha$.
2. **Hierarchical temperature model.** As discussed in the
*Independence Caveat* in @sec-monotonicity and in the
corresponding limitation above, the present analysis fits m_01 at
each temperature independently, which does not account for the
correlations induced by the shared claim pool. A natural
extension models $\log \alpha(T) = \gamma_0 + \gamma_1 \cdot T$
with priors on $(\gamma_0, \gamma_1)$, directly estimating the
slope with calibrated uncertainty and borrowing strength across
conditions. The hierarchical model `h_m01` developed in the
foundational reports — and used as the inferential workhorse for
the planned alignment study — provides the template for this
extension.
3. **Replication with different LLMs.** The temperature parameter has model-specific effects. Repeating this study with different architectures would test the generality of the finding.
4. **Prior sensitivity analysis.** While the prior predictive analysis motivates $\text{Lognormal}(3.0, 0.75)$, a sensitivity analysis across nearby priors (e.g., $\text{Lognormal}(2.5, 1.0)$ and $\text{Lognormal}(3.5, 0.5)$) would characterize how much the conclusions depend on this specific choice. Given $M = 300$ observations per condition, we expect the posterior to be dominated by the likelihood, but this should be verified.
## References
::: {#refs}
:::