Temperature and SEU Sensitivity: Initial Results

Jeff Helzner

Temperature and SEU Sensitivity: Initial Results

Application Report: Temperature Study 1

applications

temperature

m_01

An initial investigation of how LLM sampling temperature affects estimated sensitivity (α) to subjective expected utility maximization, using a calibrated prior and the m_01 model variant.

Author

Jeff Helzner

Published

May 12, 2026

0.1 Introduction

The foundational reports established the SEU sensitivity model’s theoretical properties (Report 1), concrete implementation (Report 2), and computational validity (Reports 3–6). This report presents the first empirical application: investigating how the sampling temperature of a large language model affects its estimated sensitivity to subjective expected utility maximization.

The motivation is straightforward. In the softmax choice model, the sensitivity parameter $\alpha$ governs how sharply choices track expected utility differences. Language models employ their own softmax temperature during token sampling, which controls the entropy of the next-token distribution. If the LLM’s choice behavior is well-described by our model, then increasing the external sampling temperature is expected to be associated with lower estimated $\alpha$—the additional stochasticity in token generation should manifest as noisier, less EU-aligned choices.

Theoretical Pathway

The hypothesized relationship between token-sampling temperature and choice-level sensitivity $\alpha$ rests on an assumption about how noise propagates through the LLM’s processing. Higher temperature increases entropy in token selection, which could affect choice behavior through multiple channels: (a) adding noise that propagates through to the final choice response, reducing effective $\alpha$; (b) affecting the LLM’s intermediate reasoning in ways that reduce EU-alignment; or (c) affecting both the assessment phase (which generates embeddings) and the choice phase. The present design does not distinguish among these mechanisms—it tests only the aggregate effect of temperature on the estimated $\alpha$ parameter.

What This Report Covers

This report presents initial results from a single data collection. The study design, prior calibration, model fitting, and primary analysis are documented here. Subsequent reports will address robustness checks and extensions.

0.2 Experimental Design

0.2.1 Task and Conditions

We use an insurance claims triage task in which GPT-4o selects which claim to forward for investigation from a set of alternatives. Each alternative (claim) can lead to one of $K = 3$ possible consequences, and there are $R$ distinct alternatives in the pool (see Design Parameters below). In each decision problem, the LLM is presented with a subset of these alternatives. The LLM first assesses each claim individually (producing text that is then embedded), and subsequently makes a choice among the alternatives in a given problem.

Five temperature levels define the experimental factor. Each level constitutes an independent data collection, analogous to a between-subjects design where each “subject” is a separate LLM run at a fixed temperature:

Show code

import pandas as pd

conditions = pd.DataFrame({
    'Level': [1, 2, 3, 4, 5],
    'Temperature': [0.0, 0.3, 0.7, 1.0, 1.5],
    'Description': [
        'Deterministic (greedy decoding)',
        'Low variance',
        'Moderate variance (typical default)',
        'High variance',
        'Very high variance'
    ]
})
conditions

Table 1: Experimental conditions. Each temperature level constitutes a separate model fit.

	Level	Temperature	Description
0	1	0.0	Deterministic (greedy decoding)
1	2	0.3	Low variance
2	3	0.7	Moderate variance (typical default)
3	4	1.0	High variance
4	5	1.5	Very high variance

0.2.2 Design Parameters

Show code

# Load frozen study configuration
import yaml

with open(data_dir / "study_config.yaml") as f:
    config = yaml.safe_load(f)

# Load run summary for pipeline details
with open(data_dir / "run_summary.json") as f:
    run_summary = json.load(f)

print(f"Study Design:")
print(f"  Decision problems (M):      {config['num_problems']} base × {config['num_presentations']} presentations = {config['num_problems'] * config['num_presentations']}")
print(f"  Alternatives per problem:    {config['min_alternatives']}–{config['max_alternatives']}")
print(f"  Consequences (K):            {config['K']}")
print(f"  Embedding dimensions (D):    {config['target_dim']}")
print(f"  Distinct alternatives (R):   {run_summary['phases']['phase3_data_prep']['per_temperature']['0.0']['R']}")
print(f"  LLM model:                   {config['llm_model']}")
print(f"  Embedding model:             {config['embedding_model']}")

Study Design:
  Decision problems (M):      100 base × 3 presentations = 300
  Alternatives per problem:    2–4
  Consequences (K):            3
  Embedding dimensions (D):    32
  Distinct alternatives (R):   30
  LLM model:                   gpt-4o
  Embedding model:             text-embedding-3-small

Each of the 100 base problems is presented $P = 3$ times with claims shuffled to different positions, yielding $M = 300$ choice problems per temperature condition. This position counterbalancing design addresses the systematic position bias discovered in an earlier pilot study, where unparseable responses were silently mapped to position 0. Here, each claim appears in multiple positions across presentations, and any unparseable response is recorded as NA rather than assigned a default.

Sample Size Rationale

The sample size of $M = 300$ per condition was chosen to provide adequate posterior precision for detecting large differences in $\alpha$ (e.g., between $T = 0.0$ and $T = 1.5$) while remaining computationally tractable. Parameter recovery simulations (§4.1) confirm that $\alpha$ is well-identified at this scale (coverage $\approx 90\%$, low bias). The inability to distinguish $\alpha$ at $T = 0.3$ vs. $T = 0.7$ (see §6.2) reflects either a true null effect or insufficient power for fine-grained adjacent comparisons—the present design was not powered for the latter.

Choice Prompt

The exact prompt text shown to the LLM when making choices is specified in the study configuration. The prompt presents the claim descriptions and asks the LLM to select which claim to forward for investigation. The full prompt template and related materials are available in the project codebase at applications/temperature_study/prompts/.

0.2.3 Feature Construction

Alternative features are constructed through a two-stage process. First, the LLM assesses each claim on its own merits at the relevant temperature, producing a natural-language evaluation. These assessments are embedded using text-embedding-3-small, yielding high-dimensional vectors. Second, all embeddings across temperature conditions are pooled and projected via PCA to $D = 32$ dimensions. Crucially, PCA is fit once on the pooled set and then applied to each condition separately, ensuring a shared coordinate system without privileging any single temperature.

PCA and Temperature

Because assessments vary with temperature, the shared PCA basis may partly reflect temperature-induced variation in the embedding space, not only variation across claims. This could create an aliasing between temperature effects on assessment quality and temperature effects on choice sensitivity. The present design does not disentangle these mechanisms.

PCA Summary:
  Components retained: 32
  Total variance explained: 82.2%
  First 5 components: 35.4%
  First 10 components: 53.0%

0.2.4 Data Quality

NA Summary:
  Overall: 0 / 1500 (0.0%)
  T=0.0: 0 / 300 (0.0%)
  T=0.3: 0 / 300 (0.0%)
  T=0.7: 0 / 300 (0.0%)
  T=1.0: 0 / 300 (0.0%)
  T=1.5: 0 / 300 (0.0%)

No observations were lost to parsing failures at any temperature level—a substantial improvement over the earlier pilot, which required ad hoc imputation.

0.3 Model and Prior Calibration

0.3.1 The m_01 Model Variant

We fit the m_01 model, which is structurally identical to the foundational m_0 model described in Report 2. The only difference is the prior on the sensitivity parameter $\alpha$:

	m_0 (foundational)	m_01 (this study)
$\alpha$ prior	$\text{Lognormal}(0, 1)$	$\text{Lognormal}(3.0, 0.75)$
Prior median	$\approx 1$	$\approx 20$
Prior 90% CI	$[0.19, 5.0]$	$[5.5, 67]$
All other priors	—	Identical to m_0

The foundational prior $\text{Lognormal}(0, 1)$ was chosen for generality: it spans a wide range of sensitivity levels appropriate for exploring the model’s behavior in simulation. However, when applied to LLM choice data—where we expect substantial sensitivity to EU differences—this prior places most of its mass well below the region where the data concentrate. The prior predictive analysis below motivates a more informative alternative.

0.3.2 Prior Predictive Grid Search

To select the m_01 prior, we conducted a grid search over 12 candidate lognormal hyperparameter pairs $(\mu, \sigma)$, evaluating each via prior predictive simulation against the actual study design.

For each candidate prior, we:

Drew $N = 200$ values of $\alpha$ from $\text{Lognormal}(\mu, \sigma)$
Simulated choice data from m_0_sim.stan using the actual study design ($M = 300$, $K = 3$, $D = 32$, $R = 30$)
Computed the SEU-maximizer selection rate: the fraction of problems in which the simulated agent chose the alternative with the highest expected utility

Show code

with open(data_dir / "grid_results.json") as f:
    grid = json.load(f)

results = grid['results']

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Left: SEU rate by prior
labels = [r['prior_label'] for r in results]
means = [r['seu_rate_mean'] for r in results]
q05s = [r['seu_rate_q05'] for r in results]
q95s = [r['seu_rate_q95'] for r in results]

y_pos = np.arange(len(labels))
errors = np.array([[m - q05, q95 - m] for m, q05, q95 in zip(means, q05s, q95s)]).T

colors = [SEU_COLORS['accent'] if 'lognormal(3.0, 0.75)' in l else SEU_COLORS['primary'] 
          for l in labels]

axes[0].barh(y_pos, means, xerr=errors, color=colors, alpha=0.8, 
             edgecolor='white', capsize=3)
axes[0].set_yticks(y_pos)
axes[0].set_yticklabels(labels, fontsize=9)
axes[0].set_xlabel('SEU-Maximizer Selection Rate')
axes[0].set_title('Prior-Implied SEU-Max Rate')
axes[0].set_xlim(0, 1)

# Right: prior density comparison
selected = [r for r in results if 'lognormal(3.0, 0.75)' in r['prior_label']][0]
foundational = [r for r in results if 'lognormal(0.0, 1.0)' in r['prior_label']][0]

from scipy.stats import lognorm
x = np.linspace(0.1, 150, 500)

# m_0 prior: lognormal(0, 1) → scipy shape=1, scale=exp(0)=1
axes[1].plot(x, lognorm.pdf(x, s=1.0, scale=np.exp(0)), 
             color=SEU_COLORS['secondary'], linewidth=2, label='m_0: Lognormal(0, 1)')
# m_01 prior: lognormal(3.0, 0.75)
axes[1].plot(x, lognorm.pdf(x, s=0.75, scale=np.exp(3.0)),
             color=SEU_COLORS['accent'], linewidth=2, label='m_01: Lognormal(3.0, 0.75)')
axes[1].set_xlabel('α')
axes[1].set_ylabel('Density')
axes[1].set_title('Prior Comparison')
axes[1].legend(fontsize=9)
axes[1].set_xlim(0, 150)

plt.tight_layout()
plt.show()

Figure 1: Prior predictive grid search results. Each point represents a candidate lognormal prior for α, evaluated by the SEU-maximizer selection rate it implies. The selected prior lognormal(3.0, 0.75) balances informativeness with sufficient coverage of the plausible parameter range.

The selected prior $\text{Lognormal}(3.0, 0.75)$ has the following properties:

Median $\approx 20$, placing prior mass in the region where we expect LLM sensitivity to lie
90% CI $\approx [5.5, 67]$, wide enough to accommodate substantial uncertainty
Prior-implied SEU-max rate $\approx 78\%$, consistent with the expectation that LLMs make reasonably EU-aligned choices
No numerical issues: the prior avoids the extreme $\alpha$ values that caused softmax overflow with wider priors

Why Not the Foundational Prior?

The m_0 prior $\text{Lognormal}(0, 1)$ was designed for generality in simulation studies. Its median of $\approx 1$ means most prior mass sits in a regime where choices are nearly random—appropriate for exploring model behavior, but a poor match for LLM data where we observe strong EU-alignment. With the actual study design ($M = 300$, $D = 32$), the foundational prior also places non-negligible mass on very large $\alpha$ values that cause numerical overflow in the softmax computation. The m_01 prior resolves both issues.

0.4 Model Validation

Before fitting the model to empirical data, we validate that m_01’s parameters are identifiable and its posterior is well-calibrated under the actual study design ($M = 300$, $K = 3$, $D = 32$, $R = 30$). The foundational reports (Report 4, Report 6) established these properties for m_0 with a smaller design ($M = 25$, $D = 5$); here we confirm they hold at the scale and prior specification used in this study.

Show code

from utils.study_design import StudyDesign

# Load the actual study design from stan_data (any temperature — the design is identical)
with open(data_dir / "stan_data_T0_0.json") as f:
    stan_data = json.load(f)

# Construct a StudyDesign from the actual study data
study = StudyDesign(
    M=stan_data["M"],
    K=stan_data["K"],
    D=stan_data["D"],
    R=stan_data["R"]
)
study.w = [np.array(w_vec) for w_vec in stan_data["w"]]
study.I = np.array(stan_data["I"])

print(f"Validation Study Design (from frozen data):")
print(f"  Decision problems (M): {study.M}")
print(f"  Consequences (K):      {study.K}")
print(f"  Feature dimensions (D): {study.D}")
print(f"  Distinct alternatives (R): {study.R}")

0.4.1 Parameter Recovery

Parameter recovery asks: can we recover known parameter values from simulated data? We draw parameters from the m_01 prior, simulate choice data via m_01_sim.stan, fit m_01.stan, and compare posterior estimates to the true values across 20 iterations.

Show code

from analysis.parameter_recovery import ParameterRecovery
import tempfile

recovery_dir = tempfile.mkdtemp(prefix="param_recovery_m01_")

recovery = ParameterRecovery(
    inference_model_path=os.path.join(project_root, "models", "m_01.stan"),
    sim_model_path=os.path.join(project_root, "models", "m_01_sim.stan"),
    study_design=study,
    output_dir=recovery_dir,
    n_mcmc_samples=1000,
    n_mcmc_chains=4,
    n_iterations=20,
    sim_hyperparams={'alpha_mean': 3.0, 'alpha_sd': 0.75}
)

true_params_list, posterior_summaries = recovery.run()

Parameter Recovery Complete:
  Successful iterations: 20
  Parameters per iteration:
    α (sensitivity): 1
    β (feature weights): 3 × 32 = 96
    δ (utility increments): 2

0.4.1.1 MCMC Diagnostics

MCMC Diagnostic Summary Across Recovery Iterations:
  Max R̂ per iteration:
    Median: 1.0049
    Maximum: 1.0121
    Iterations with max R̂ > 1.01: 2/20
  Min ESS (ESS_bulk) per iteration:
    Median: 575
    Minimum: 317

0.4.1.2 α Recovery

Show code

alpha_true = np.array([p['alpha'] for p in true_params_list])
alpha_mean = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries])
alpha_lower = np.array([s.loc['alpha', '5%'] for s in posterior_summaries])
alpha_upper = np.array([s.loc['alpha', '95%'] for s in posterior_summaries])

alpha_bias = np.mean(alpha_mean - alpha_true)
alpha_rmse = np.sqrt(np.mean((alpha_mean - alpha_true)**2))
alpha_coverage = np.mean((alpha_true >= alpha_lower) & (alpha_true <= alpha_upper))
alpha_ci_width = np.mean(alpha_upper - alpha_lower)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# True vs Estimated
ax = axes[0]
ax.scatter(alpha_true, alpha_mean, alpha=0.7, s=60, c=SEU_COLORS['primary'], edgecolor='white')
lims = [min(alpha_true.min(), alpha_mean.min()) * 0.9,
        max(alpha_true.max(), alpha_mean.max()) * 1.1]
ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line')
ax.set_xlim(lims)
ax.set_ylim(lims)
ax.set_xlabel('True α', fontsize=12)
ax.set_ylabel('Estimated α (posterior mean)', fontsize=12)
ax.set_title(f'α Recovery: Bias={alpha_bias:.2f}, RMSE={alpha_rmse:.2f}', fontsize=12)
ax.legend()
ax.set_aspect('equal')

# Coverage plot
ax = axes[1]
for i in range(len(alpha_true)):
    covered = (alpha_true[i] >= alpha_lower[i]) & (alpha_true[i] <= alpha_upper[i])
    color = 'forestgreen' if covered else 'crimson'
    ax.plot([i, i], [alpha_lower[i], alpha_upper[i]], color=color, linewidth=2, alpha=0.7)
    ax.scatter(i, alpha_mean[i], color=color, s=40, zorder=3)

ax.scatter(np.arange(len(alpha_true)), alpha_true, color='black', s=60, marker='x',
           label='True value', zorder=4, linewidth=2)
ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('α', fontsize=12)
ax.set_title(f'α: 90% Credible Intervals (Coverage = {alpha_coverage:.0%})', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Figure 2: Recovery of the sensitivity parameter α under the m_01 prior with the actual study design. Left: true vs. estimated values with identity line. Right: 90% credible intervals for each iteration, colored by whether they contain the true value.

0.4.1.3 Recovery Metrics Summary

Show code

import pandas as pd

K, D = study.K, study.D

# Beta recovery
all_beta_bias = []
all_beta_rmse = []
all_beta_coverage = []
all_beta_ci_width = []

for k in range(K):
    for d in range(D):
        param_name = f"beta[{k+1},{d+1}]"
        bt = np.array([p['beta'][k][d] for p in true_params_list])
        bm = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries])
        bl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
        bu = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])

        all_beta_bias.append(np.mean(bm - bt))
        all_beta_rmse.append(np.sqrt(np.mean((bm - bt)**2)))
        all_beta_coverage.append(np.mean((bt >= bl) & (bt <= bu)))
        all_beta_ci_width.append(np.mean(bu - bl))

# Delta recovery
all_delta_bias = []
all_delta_rmse = []
all_delta_coverage = []
all_delta_ci_width = []

for k in range(K - 1):
    param_name = f"delta[{k+1}]"
    dt = np.array([p['delta'][k] for p in true_params_list])
    dm = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries])
    dl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
    du = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])

    all_delta_bias.append(np.mean(dm - dt))
    all_delta_rmse.append(np.sqrt(np.mean((dm - dt)**2)))
    all_delta_coverage.append(np.mean((dt >= dl) & (dt <= du)))
    all_delta_ci_width.append(np.mean(du - dl))

# Scale references for relative metrics. α lives on a wide multiplicative
# scale (Lognormal(3.0, 0.75) prior → true values typically ~5–100), so
# absolute Bias and RMSE must be interpreted relative to the magnitude
# of α; the same caveat is less relevant for β (zero-centred) and δ
# (modest range), so we report relative metrics for α only.
alpha_scale = float(np.mean(np.abs(alpha_true)))
alpha_rel_bias = alpha_bias / alpha_scale
alpha_rel_rmse = alpha_rmse / alpha_scale

metrics = pd.DataFrame([
    {'Parameter': 'α', 'Bias': f'{alpha_bias:.2f}', 'RMSE': f'{alpha_rmse:.2f}',
     'Rel. Bias': f'{alpha_rel_bias:+.1%}', 'Rel. RMSE': f'{alpha_rel_rmse:.1%}',
     'Coverage (90%)': f'{alpha_coverage:.0%}', 'Mean CI Width': f'{alpha_ci_width:.2f}'},
    {'Parameter': f'β (mean over {K*D})', 'Bias': f'{np.mean(all_beta_bias):.3f}',
     'RMSE': f'{np.mean(all_beta_rmse):.3f}',
     'Rel. Bias': '—', 'Rel. RMSE': '—',
     'Coverage (90%)': f'{np.mean(all_beta_coverage):.0%}',
     'Mean CI Width': f'{np.mean(all_beta_ci_width):.2f}'},
    {'Parameter': f'δ (mean over {K-1})', 'Bias': f'{np.mean(all_delta_bias):.3f}',
     'RMSE': f'{np.mean(all_delta_rmse):.3f}',
     'Rel. Bias': '—', 'Rel. RMSE': '—',
     'Coverage (90%)': f'{np.mean(all_delta_coverage):.0%}',
     'Mean CI Width': f'{np.mean(all_delta_ci_width):.2f}'},
])
metrics

Table 2: Parameter recovery metrics for m_01 with the actual study design (M=300, K=3, D=32, R=30). The primary parameter of interest is α.

	Parameter	Bias	RMSE	Rel. Bias	Rel. RMSE	Coverage (90%)	Mean CI Width
0	α	1.72	3.57	+9.3%	19.2%	95%	13.08
1	β (mean over 96)	-0.002	0.878	—	—	90%	2.88
2	δ (mean over 2)	-0.000	0.262	—	—	95%	0.82

Interpreting Recovery Results

The primary parameter of interest is $\alpha$ (sensitivity). The recovery table reports both absolute and relative metrics for $\alpha$, and the relative columns are the ones that should anchor interpretation. Because the prior places true $\alpha$ on a wide multiplicative scale (Lognormal$(3.0, 0.75)$ implies typical values in roughly the 5–100 range), absolute values of Bias and RMSE that may look large in the table can simultaneously be small relative to the magnitude of $\alpha$. This is why the scatter in the α Recovery figure can sit tightly on the identity line even when raw RMSE has a value of several units — the visual impression is on a relative scale, while raw RMSE is not.

Operationally, we treat the model as fit for purpose when:

Relative bias is within roughly $\pm 10\%$ of the mean true $\alpha$;
Relative RMSE is comfortably below 25% of the mean true $\alpha$;
90% credible-interval coverage is approximately 85–95%.

The β–δ coupling issue documented in the foundational reports (Report 4) is expected to persist — δ recovery benefits from additional identifying information (e.g., risky choices) that is not present in m_01. Since this study focuses on $\alpha$ estimation, the weaker recovery of $(\beta,\delta)$ does not compromise the primary analysis; we revisit the interpretive implications of this in the Construct Validity discussion in Section 0.5.

0.4.2 Simulation-Based Calibration

SBC complements parameter recovery by asking whether the posterior correctly represents uncertainty. Rather than evaluating point estimates, SBC checks that the true parameter value occupies a uniform rank position within the posterior samples across many simulations. Non-uniformity indicates that the posterior is miscalibrated—too wide, too narrow, or biased.

A Reader’s Guide to SBC Plots and Tests

The basic logic of SBC (Talts et al. 2018) is simple. For each of $N$ simulations we (i) sample a “true” parameter value $\theta^\star$ from the prior, (ii) simulate a dataset using $\theta^\star$, (iii) fit the model and obtain $L$ posterior draws, and (iv) compute the rank of $\theta^\star$ among those $L$ draws. If the model is correctly specified and the sampler is calibrated, that rank should be uniformly distributed on $\{0, 1, \ldots, L\}$ across simulations.

What to look for in the diagnostics:

Rank histogram (uniform = good). Each bin should contain roughly $N / B$ counts (here $\approx 10$ with $N=200$, $B=20$). The shape of any departure is itself diagnostic:
- a flat / uniform histogram is evidence of a well-calibrated posterior;
- a U-shape (excess mass at both ends) means the posterior is too narrow — credible intervals are over-confident;
- an inverted-U / dome (excess mass in the middle) means the posterior is too wide — credible intervals are under-confident;
- a monotone ramp (mass shifting toward 0 or toward $L$) means the posterior is biased upward or downward.
ECDF with KS band. The empirical CDF of the (normalised) ranks should track the diagonal. The shaded band is a 95% Kolmogorov–Smirnov envelope: if the ECDF stays inside the band for all $x$, we cannot reject uniformity at the 5% level. Excursions reveal the same signatures listed above (S-curves above the diagonal indicate bias; ECDFs that bow away from the diagonal indicate over- or under-confidence).
Chi-square test. Compares the histogram counts to the uniform expectation. With $N=200$ and $B=20$ the test has modest power (expected count $\approx 10$ per bin), so a non-significant $p$-value should not be over-interpreted as proof of calibration — it is one piece of evidence alongside visual inspection.
KS test. A continuous-rank analogue that is generally more powerful than the binned chi-square, particularly for smooth departures from uniformity (bias, spread). Together the two tests cover both global shape and bin-level fluctuations.

The bottom line: a well-calibrated parameter shows a roughly flat histogram, an ECDF inside the KS band, and non-significant $p$-values; if any one of these fails, the shape of the failure points to whether the posterior is biased, too tight, or too diffuse.

Show code

# SBC configuration.
# With M=300 and D=32 each fit is expensive, so we use 200 simulations
# (vs 999 in the foundational reports). With 20 bins this gives ~10
# expected counts per bin — sufficient for meaningful rank histograms
# and KS tests, though chi-square power is modest.
n_sbc_sims = 200
n_sbc_chains = 1
sbc_thin = 4
n_sbc_samples = sbc_thin * n_sbc_sims   # 800 total draws → 200 post-thinned
sbc_max_rank = n_sbc_samples // sbc_thin  # 200 (= L)
sbc_n_bins = 20
sbc_expected_per_bin = n_sbc_sims / sbc_n_bins

print("SBC Configuration:")
print(f"  Simulations (N):        {n_sbc_sims}")
print(f"  MCMC samples per sim:   {n_sbc_samples}")
print(f"  Chains:                 {n_sbc_chains}")
print(f"  Thinning factor:        {sbc_thin}")
print(f"  Post-thinned draws (L): {sbc_max_rank}")
print(f"  Histogram bins:         {sbc_n_bins}")
print(f"  Expected counts/bin:    {sbc_expected_per_bin:.1f}")

SBC Configuration:
  Simulations (N):        200
  MCMC samples per sim:   800
  Chains:                 1
  Thinning factor:        4
  Post-thinned draws (L): 200
  Histogram bins:         20
  Expected counts/bin:    10.0

Show code

from analysis.sbc import SimulationBasedCalibration

sbc_dir = tempfile.mkdtemp(prefix="sbc_m01_")

sbc = SimulationBasedCalibration(
    sbc_model_path=os.path.join(project_root, "models", "m_01_sbc.stan"),
    study_design=study,
    output_dir=sbc_dir,
    n_sbc_sims=n_sbc_sims,
    n_mcmc_samples=n_sbc_samples,
    n_mcmc_chains=n_sbc_chains,
    thin=sbc_thin
)

ranks, true_params_sbc = sbc.run()

SBC Complete:
  Simulations: 200
  Parameters: 99

0.4.2.1 MCMC Diagnostics

SBC MCMC Diagnostics (from first 10 iterations):
  Minimum ESS across all params/sims: 91
  Median ESS (across sims):           823
  Post-thinning draws:                200
  ESS/draw ratio:                     1.03

0.4.2.2 α Rank Distribution

Show code

from scipy import stats

fig, axes = plt.subplots(1, 2, figsize=(12, 4.5))

expected_count = n_sbc_sims / sbc_n_bins

# Rank histogram
ax = axes[0]
ax.hist(ranks[:, 0], bins=sbc_n_bins, alpha=0.7, color=SEU_COLORS['primary'], edgecolor='white')
ax.axhline(y=expected_count, color='red', linestyle='--', linewidth=2,
           label=f'Uniform (E={expected_count:.1f})')
ax.set_xlabel('Rank', fontsize=11)
ax.set_ylabel('Count', fontsize=11)
ax.set_title('α Rank Histogram', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)

# ECDF
ax = axes[1]
ranks_norm = np.sort(ranks[:, 0]) / sbc_max_rank
ecdf = np.arange(1, len(ranks_norm) + 1) / len(ranks_norm)
ax.step(ranks_norm, ecdf, where='post', color=SEU_COLORS['primary'], linewidth=2, label='m_01')
ax.plot([0, 1], [0, 1], 'r--', linewidth=2, label='Uniform')

n_ecdf = len(ecdf)
epsilon = np.sqrt(np.log(2/0.05) / (2 * n_ecdf))
ax.fill_between([0, 1], [0-epsilon, 1-epsilon], [0+epsilon, 1+epsilon],
                alpha=0.15, color='red', label=f'95% KS band (ε={epsilon:.3f})')

ax.set_xlabel('Normalized Rank', fontsize=11)
ax.set_ylabel('ECDF', fontsize=11)
ax.set_title('α ECDF', fontsize=12)
ax.legend(fontsize=9)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Formal tests
chi2_alpha, p_alpha = stats.chisquare(np.histogram(ranks[:, 0], bins=sbc_n_bins)[0])
ks_alpha, ks_p_alpha = stats.kstest(ranks[:, 0] / sbc_max_rank, 'uniform')
print(f"\nα Calibration Tests:")
print(f"  Chi-square: χ² = {chi2_alpha:.2f}, p = {p_alpha:.3f}")
print(f"  KS test:    D = {ks_alpha:.3f}, p = {ks_p_alpha:.3f}")

Figure 3: SBC rank histogram and ECDF for α under the m_01 prior with the actual study design. Left: rank histogram (uniform = well-calibrated). Right: ECDF with 95% KS confidence band.


α Calibration Tests:
  Chi-square: χ² = 18.00, p = 0.522
  KS test:    D = 0.085, p = 0.105

0.4.2.3 δ Rank Distributions

Show code

K_val, D_val = study.K, study.D
delta_start = 1 + K_val * D_val
K_minus_1 = K_val - 1

fig, axes = plt.subplots(1, K_minus_1, figsize=(6 * K_minus_1, 4.5))
if K_minus_1 == 1:
    axes = [axes]

for k in range(K_minus_1):
    param_idx = delta_start + k
    ax = axes[k]
    ax.hist(ranks[:, param_idx], bins=sbc_n_bins, alpha=0.7,
            color=SEU_COLORS['secondary'], edgecolor='white')
    ax.axhline(y=expected_count, color='red', linestyle='--', linewidth=2,
               label=f'Uniform (E={expected_count:.1f})')
    ax.set_xlabel('Rank', fontsize=11)
    ax.set_ylabel('Count', fontsize=11)
    ax.set_title(f'δ$_{k+1}$ Rank Histogram', fontsize=12)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Formal tests for delta
print("δ Calibration Tests:")
for k in range(K_minus_1):
    param_idx = delta_start + k
    chi2_d, p_d = stats.chisquare(np.histogram(ranks[:, param_idx], bins=sbc_n_bins)[0])
    ks_d, ks_p_d = stats.kstest(ranks[:, param_idx] / sbc_max_rank, 'uniform')
    print(f"  δ_{k+1}: χ² = {chi2_d:.2f}, p = {p_d:.3f} | KS D = {ks_d:.3f}, p = {ks_p_d:.3f}")

Figure 4: SBC rank histograms for δ parameters. Departures from uniformity indicate calibration issues, typically reflecting the β–δ identification problem.

δ Calibration Tests:
  δ_1: χ² = 16.40, p = 0.630 | KS D = 0.055, p = 0.562
  δ_2: χ² = 14.80, p = 0.735 | KS D = 0.055, p = 0.562

0.4.2.4 β Calibration Summary

Show code

beta_start = 1
n_beta = K_val * D_val

beta_pvals = []
for idx in range(beta_start, beta_start + n_beta):
    counts = np.histogram(ranks[:, idx], bins=sbc_n_bins)[0]
    _, p_val = stats.chisquare(counts)
    beta_pvals.append(p_val)

fig, ax = plt.subplots(figsize=(10, 4.5))

x = np.arange(n_beta)
colors = [SEU_COLORS['primary'] if p > 0.05 else SEU_COLORS['accent'] for p in beta_pvals]
ax.bar(x, beta_pvals, color=colors, alpha=0.7, edgecolor='white')
ax.axhline(y=0.05, color='red', linestyle='--', linewidth=2, label='α = 0.05')
ax.set_xlabel('β Parameter Index', fontsize=11)
ax.set_ylabel('Chi-square p-value', fontsize=11)
ax.set_title('β Parameter Calibration', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

n_well_cal = sum(1 for p in beta_pvals if p > 0.05)
print(f"\nβ Calibration: {n_well_cal}/{n_beta} parameters well-calibrated (p > 0.05)")
print(f"Expected false rejections at α=0.05: {n_beta * 0.05:.1f}")

Figure 5: Chi-square p-values for β parameter calibration. Parameters above the dashed line (p > 0.05) are consistent with uniform rank distributions.


β Calibration: 95/96 parameters well-calibrated (p > 0.05)
Expected false rejections at α=0.05: 4.8

SBC Interpretation

With 200 SBC simulations the formal chi-square tests have modest power (~10 expected counts per bin), so visual inspection of rank histograms is as important as the p-values. Specifically, with 200 simulations and 20 bins, the chi-square test has approximately 50% power to detect a 40% deviation from uniformity in any single bin—substantial miscalibration could go undetected. The key question is whether α shows approximately uniform ranks—confirming that the m_01 posterior correctly represents uncertainty about sensitivity at this study scale. The β–δ identification pattern from the foundational reports (Report 6) is expected to manifest here as well, since m_01 (like m_0) uses only uncertain choices. This does not affect the validity of α estimation for the primary analysis.

0.5 Results

0.5.1 Loading Posterior Draws

Show code

temperatures = [0.0, 0.3, 0.7, 1.0, 1.5]
temp_labels = {t: f"T={t}" for t in temperatures}

# Load alpha draws for each temperature
alpha_draws = {}
for t in temperatures:
    key = f"T{str(t).replace('.', '_')}"
    data = np.load(data_dir / f"alpha_draws_{key}.npz")
    alpha_draws[t] = data['alpha']

# Load pre-computed analysis results
with open(data_dir / "primary_analysis.json") as f:
    analysis = json.load(f)

# Load fit summary
with open(data_dir / "fit_summary.json") as f:
    fit_summary = json.load(f)

  T=0.0: 4,000 posterior draws loaded
  T=0.3: 4,000 posterior draws loaded
  T=0.7: 4,000 posterior draws loaded
  T=1.0: 4,000 posterior draws loaded
  T=1.5: 4,000 posterior draws loaded

0.5.2 MCMC Diagnostics

Show code

diag_rows = []
for t in temperatures:
    key = f"T{str(t).replace('.', '_')}"
    with open(data_dir / f"diagnostics_{key}.txt") as f:
        diag_text = f.read()
    
    # Parse divergences
    if "No divergent transitions" in diag_text:
        n_div = 0
    else:
        import re
        match = re.search(r'(\d+) of (\d+)', diag_text)
        n_div = int(match.group(1)) if match else 0
    
    rhat_ok = "R-hat values satisfactory" in diag_text
    ess_ok = "effective sample size satisfactory" in diag_text
    ebfmi_ok = "E-BFMI satisfactory" in diag_text
    
    diag_rows.append({
        'Temperature': t,
        'Divergences': f"{n_div}/4000",
        'R̂': '✓' if rhat_ok else '✗',
        'ESS': '✓' if ess_ok else '✗',
        'E-BFMI': '✓' if ebfmi_ok else '✗',
    })

pd.DataFrame(diag_rows)

Table 3: MCMC diagnostics for all five temperature conditions. All fits used 4 chains with 1,000 warmup and 1,000 sampling iterations each (4,000 post-warmup draws total).

	Temperature	Divergences	R̂	ESS	E-BFMI
0	0.0	0/4000	✓	✓	✓
1	0.3	0/4000	✓	✓	✓
2	0.7	0/4000	✓	✓	✓
3	1.0	1/4000	✓	✓	✓
4	1.5	2/4000	✓	✓	✓

All conditions show clean diagnostics. The 1–2 divergent transitions at $T = 1.0$ and $T = 1.5$ (< 0.05% of transitions) are well within acceptable bounds and do not indicate systematic sampling difficulties.

0.5.3 Posterior Summaries

Show code

summary = analysis['summary_table']

rows = []
for s in summary:
    rows.append({
        'Temperature': s['temperature'],
        'Median': f"{s['median']:.1f}",
        'Mean': f"{s['mean']:.1f}",
        'SD': f"{s['sd']:.1f}",
        '90% CI': f"[{s['ci_low']:.1f}, {s['ci_high']:.1f}]",
    })

pd.DataFrame(rows)

Table 4: Posterior summaries for the sensitivity parameter α at each temperature level. Intervals are 90% credible intervals.

	Temperature	Median	Mean	SD	90% CI
0	0.0	74.1	78.0	23.6	[47.3, 121.1]
1	0.3	54.9	58.2	18.7	[36.1, 88.3]
2	0.7	56.3	58.3	14.6	[38.6, 84.9]
3	1.0	39.1	40.3	9.8	[26.6, 58.3]
4	1.5	36.0	37.0	8.0	[25.5, 51.8]

The estimates reveal a clear pattern: $\alpha$ is highest at $T = 0.0$ (greedy decoding) and lowest at $T = 1.5$ (high temperature), with intermediate conditions falling between these extremes. Note, however, the near-equality of the $T = 0.3$ and $T = 0.7$ estimates—a point we return to below.

0.5.4 Forest Plot

Show code

fig, ax = plt.subplots(figsize=(8, 5))

y_positions = np.arange(len(temperatures))[::-1]

for i, t in enumerate(temperatures):
    draws = alpha_draws[t]
    median = np.median(draws)
    q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95])
    
    y = y_positions[i]
    
    # 90% CI (thin line)
    ax.plot([q05, q95], [y, y], color=SEU_PALETTE[i], linewidth=1.5, alpha=0.7)
    # 50% CI (thick line)
    ax.plot([q25, q75], [y, y], color=SEU_PALETTE[i], linewidth=4, alpha=0.9)
    # Median (point)
    ax.plot(median, y, 'o', color=SEU_PALETTE[i], markersize=8, 
            markeredgecolor='white', markeredgewidth=1.5, zorder=5)

ax.set_yticks(y_positions)
ax.set_yticklabels([f'T = {t}' for t in temperatures])
ax.set_xlabel('Sensitivity (α)')
ax.set_title('Posterior Distributions of α by Temperature')
ax.grid(axis='x', alpha=0.3)
ax.grid(axis='y', alpha=0)

plt.tight_layout()
plt.show()

Figure 6: Forest plot of posterior α distributions across temperature conditions. Points show posterior medians; thick bars span the 50% credible interval; thin bars span the 90% credible interval. Higher temperature is associated with lower estimated sensitivity.

0.5.5 Posterior Densities

Show code

from scipy.stats import gaussian_kde

fig, ax = plt.subplots(figsize=(8, 5))

for i, t in enumerate(temperatures):
    draws = alpha_draws[t]
    kde = gaussian_kde(draws)
    x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.1, 300)
    ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=SEU_PALETTE[i])
    ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2, 
            label=f'T = {t} (median = {np.median(draws):.0f})')

ax.set_xlabel('Sensitivity (α)')
ax.set_ylabel('Density')
ax.set_title('Posterior Density of α')
ax.legend(loc='upper right')

plt.tight_layout()
plt.show()

Figure 7: Kernel density estimates of the posterior α distributions. The separation between T = 0.0 and T ≥ 1.0 is clear, while T = 0.3 and T = 0.7 overlap substantially.

0.5.6 Posterior Predictive Checks

We assess model adequacy via posterior predictive checks (PPCs). For each posterior draw of the parameters, we simulate a replicated dataset of the same size and structure as the observed data, then compare the observed and replicated datasets on three complementary test statistics:

Log-likelihood (ll). The sum of log-probabilities assigned by the model to the observed choices, $T_{\text{ll}} = \sum_{m=1}^{M} \log p(y_m \mid \theta)$. This is a global goodness-of-fit measure: if the model systematically assigns lower probability to real choices than to simulated ones, the observed $T_{\text{ll}}$ will fall in the tail of the replicated distribution.
Modal choice frequency (modal). For each decision problem, identify the alternative chosen most often across replications; the test statistic is the fraction of problems where the mode matches the observed choice. This targets whether the model correctly identifies the most likely choice in each problem, independent of the exact probabilities.
Mean choice probability (prob). The average predicted probability of the actually chosen alternative, $T_{\text{prob}} = M^{-1} \sum_{m=1}^{M} p(y_m \mid \theta)$. This measures whether the model’s confidence in the observed choices is well-calibrated on average.

The posterior predictive p-value for each statistic is the proportion of replicated datasets in which the test statistic equals or exceeds the observed value. Values near 0.5 indicate good calibration; values near 0 or 1 signal systematic misfit.

Show code

ppc_rows = []
for t in temperatures:
    key = f"T{str(t).replace('.', '_')}"
    with open(data_dir / f"ppc_{key}.json") as f:
        ppc = json.load(f)
    
    pvals = ppc['p_values']
    ppc_rows.append({
        'Temperature': t,
        'Log-likelihood': f"{pvals['ll']:.3f}",
        'Modal frequency': f"{pvals['modal']:.3f}",
        'Mean probability': f"{pvals['prob']:.3f}",
    })

pd.DataFrame(ppc_rows)

Table 5: Posterior predictive check p-values for each temperature condition. Values near 0.5 indicate good calibration; values near 0 or 1 indicate model misfit. Three test statistics are used: log-likelihood (ll), modal choice frequency (modal), and mean choice probability (prob).

	Temperature	Log-likelihood	Modal frequency	Mean probability
0	0.0	0.404	0.366	0.338
1	0.3	0.432	0.490	0.476
2	0.7	0.434	0.503	0.385
3	1.0	0.472	0.624	0.529
4	1.5	0.479	0.540	0.515

All posterior predictive p-values fall comfortably within $[0.3, 0.65]$, indicating that the model provides an adequate description of the choice data at every temperature level. There is no evidence of systematic misfit.

0.6 Monotonicity Analysis

The central question of this study is whether estimated sensitivity $\alpha$ decreases monotonically with temperature. We assess this at three levels of granularity.

0.6.1 Global Slope

To summarize the overall direction and magnitude of the temperature–sensitivity relationship, we compute a posterior distribution over a linear slope coefficient. For each posterior draw $s = 1, \ldots, S$, we have a vector of five $\alpha$ values—one per temperature condition. We fit a simple ordinary least squares regression $\alpha^{(s)}(T) = a^{(s)} + b^{(s)} \cdot T$ across the five temperature levels, yielding one slope $b^{(s)}$ per draw. The resulting $S$ slope values define a posterior distribution over $b = \Delta\alpha / \Delta T$.

This is not a regression model in the usual sense—it is a derived quantity computed from the joint posterior. The slope summarizes whether, on the whole, higher temperature is associated with lower sensitivity. A slope whose posterior mass lies entirely below zero provides evidence for the directional hypothesis that temperature reduces EU sensitivity:

Show code

slopes = analysis['slope']

# Recompute the slope draws for the density plot
temp_array = np.array(temperatures)

slope_draws = []
for draw_idx in range(len(alpha_draws[temperatures[0]])):
    alphas_at_draw = np.array([alpha_draws[t][draw_idx] for t in temperatures])
    # OLS slope: cov(T, α) / var(T)
    b = np.cov(temp_array, alphas_at_draw)[0, 1] / np.var(temp_array)
    slope_draws.append(b)
slope_draws = np.array(slope_draws)

fig, ax = plt.subplots(figsize=(8, 4))

kde = gaussian_kde(slope_draws)
x_grid = np.linspace(np.percentile(slope_draws, 0.5), np.percentile(slope_draws, 99.5), 300)
ax.fill_between(x_grid, kde(x_grid), alpha=0.3, color=SEU_COLORS['primary'])
ax.plot(x_grid, kde(x_grid), color=SEU_COLORS['primary'], linewidth=2)

median_slope = np.median(slope_draws)
ax.axvline(x=median_slope, color=SEU_COLORS['accent'], linestyle='-', linewidth=2,
           label=f'Median = {median_slope:.1f}')
ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5, label='No effect')

# Shade 90% CI
q05, q95 = np.percentile(slope_draws, [5, 95])
mask = (x_grid >= q05) & (x_grid <= q95)
ax.fill_between(x_grid[mask], kde(x_grid[mask]), alpha=0.15, color=SEU_COLORS['accent'])
ax.axvline(x=q05, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6)
ax.axvline(x=q95, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6)

ax.set_xlabel('Slope (Δα / ΔT)')
ax.set_ylabel('Density')
ax.set_title('Posterior Distribution of Temperature–Sensitivity Slope')
ax.legend()

plt.tight_layout()
plt.show()

print(f"Slope summary:")
print(f"  Median:  {median_slope:.1f}")
print(f"  90% CI:  [{q05:.1f}, {q95:.1f}]")
print(f"  P(slope < 0): {np.mean(slope_draws < 0):.4f}")

Figure 8: Posterior distribution of the slope Δα/ΔT from a draw-wise linear fit. The entire 90% credible interval lies below zero, providing strong evidence that α decreases with temperature.

Slope summary:
  Median:  -30.8
  90% CI:  [-65.5, -8.3]
  P(slope < 0): 0.9905

The posterior probability that the slope is negative exceeds 0.99, providing strong evidence for the directional hypothesis: higher temperature is associated with lower sensitivity to expected utility.

0.6.2 Pairwise Comparisons

The global slope averages over all five conditions and may mask local structure (e.g., two adjacent temperatures that produce indistinguishable $\alpha$ values). To examine the ordering at finer resolution, we compute pairwise posterior probabilities. For each pair of temperature conditions $(T_i, T_j)$ with $T_i < T_j$, we calculate \[ P\bigl(\alpha(T_i) > \alpha(T_j) \mid \text{data}\bigr) = \frac{1}{S} \sum_{s=1}^{S} \mathbb{1}\bigl[\alpha^{(s)}(T_i) > \alpha^{(s)}(T_j)\bigr], \] where the indicator function is evaluated draw-by-draw from the independent posteriors at each temperature.

Independence Caveat: Design-Induced Correlations

These posteriors are independent conditional on the shared design, but they are not independent unconditionally. The reason is worth spelling out, because the same point will recur whenever multiple conditions share a common stimulus pool (as in the planned alignment study).

The five temperature conditions in this study draw their decision problems from a single fixed pool of $R = 30$ insurance claims. Whatever is idiosyncratic about that particular set of 30 claims — the embedding geometry that happens to fall out of those texts, the particular spread of “true” expected utilities across alternatives, the typical gap $\Delta$ between the best and second-best option — shapes all five fits simultaneously. A claim pool that happens to contain very large utility gaps will inflate every estimated $\alpha$ relative to a pool with smaller gaps; a pool that happens to be nearly degenerate in some embedding direction will deflate every $\alpha$. None of this idiosyncrasy is captured by fitting m_01 five separate times: each fit treats its own dataset as the only source of evidence, so the five posterior summaries look statistically unrelated even though they actually share a common, unmodelled nuisance. The practical consequence is that pairwise probabilities $P(\alpha(T_i) > \alpha(T_j))$ that are computed from independent posteriors slightly overstate how confidently we can resolve between-temperature differences — the very feature of the design that boosts apparent precision (every condition uses the same claims, so within-condition noise is reduced) also induces a positive between-condition correlation that the independent-fits analysis ignores.

A hierarchical model jointly estimating $\alpha(T)$ — for example $\log \alpha(T_c) = \gamma_0 + \gamma_1 T_c + \eta_c$ with a small random effect $\eta_c$ — addresses this on two fronts. First, it makes the common effect of the shared claim pool a single estimated quantity ($\gamma_0$), so design-induced shifts cancel out of between-condition contrasts in a principled way. Second, it borrows strength across conditions for the systematic effect of interest ($\gamma_1$), giving a single, calibrated uncertainty statement about the temperature–sensitivity relationship that respects the positive correlation across conditions. We return to this issue in the Limitations section below and propose the corresponding hierarchical extension in Next Steps; the foundational h_m01 reports introduce exactly this kind of model and use it as the inferential workhorse for the planned alignment study.

Values near 1.0 indicate that the lower-temperature condition almost certainly yields higher sensitivity; values near 0.5 indicate that the two conditions are effectively indistinguishable.

Show code

pairs = analysis['pairwise_comparisons']

pair_rows = []
for key, prob in pairs.items():
    t1, t2 = key.split('_vs_')
    strength = '●●●' if prob > 0.95 else ('●●' if prob > 0.8 else ('●' if prob > 0.65 else '○'))
    pair_rows.append({
        'Comparison': f'α(T={t1}) > α(T={t2})',
        'P': f'{prob:.3f}',
        'Evidence': strength,
    })

pd.DataFrame(pair_rows)

Table 6: Posterior probability that α is higher at the lower temperature in each pair, treating the per-temperature posteriors as if independent. Values above 0.95 indicate apparent separation under that simplifying assumption; values near 0.5 indicate indistinguishability. See the Independence Caveat above: shared stimuli induce between-condition correlations that the independent-fits comparison does not capture, so the apparent separation is likely an upper bound on the evidence.

	Comparison	P	Evidence
0	α(T=0.0) > α(T=0.3)	0.773	●
1	α(T=0.0) > α(T=0.7)	0.767	●
2	α(T=0.0) > α(T=1.0)	0.963	●●●
3	α(T=0.0) > α(T=1.5)	0.983	●●●
4	α(T=0.3) > α(T=0.7)	0.480	○
5	α(T=0.3) > α(T=1.0)	0.827	●●
6	α(T=0.3) > α(T=1.5)	0.895	●●
7	α(T=0.7) > α(T=1.0)	0.863	●●
8	α(T=0.7) > α(T=1.5)	0.921	●●
9	α(T=1.0) > α(T=1.5)	0.603	○

The pairwise analysis reveals a structured pattern (interpret these probabilities in light of the Independence Caveat above; they treat the per-temperature posteriors as independent and so likely overstate the true between-condition evidence):

Apparent separation ($P > 0.95$ under independence): $T = 0.0$ vs $T = 1.0$ and $T = 1.5$
Moderate apparent separation ($P > 0.8$): $T = 0.3$ vs $T \geq 1.0$; $T = 0.7$ vs $T \geq 1.0$
Weak/no separation ($P \approx 0.5$): $T = 0.3$ vs $T = 0.7$

0.6.3 Strict Monotonicity

The pairwise analysis examines each adjacent or non-adjacent pair in isolation. A stronger test asks whether the entire ordering $\alpha(T = 0.0) > \alpha(T = 0.3) > \alpha(T = 0.7) > \alpha(T = 1.0) > \alpha(T = 1.5)$ holds simultaneously. We estimate the joint probability of strict monotonicity by computing, for each posterior draw, whether the five $\alpha$ values are strictly decreasing: \[ P(\text{strict monotonicity} \mid \text{data}) = \frac{1}{S} \sum_{s=1}^{S} \prod_{j=1}^{4} \mathbb{1}\bigl[\alpha^{(s)}(T_j) > \alpha^{(s)}(T_{j+1})\bigr]. \] Because this requires every adjacent pair to satisfy the ordering in the same draw, the joint probability is necessarily smaller than any single pairwise probability—a single weakly-ordered pair drives it down.

Show code

# P(α(T=0.0) > α(T=0.3) > α(T=0.7) > α(T=1.0) > α(T=1.5)) — joint probability
n_draws = len(alpha_draws[0.0])
strictly_decreasing = 0

for i in range(n_draws):
    vals = [alpha_draws[t][i] for t in temperatures]
    if all(vals[j] > vals[j+1] for j in range(len(vals)-1)):
        strictly_decreasing += 1

p_mono = strictly_decreasing / n_draws
print(f"P(α strictly decreasing across all T): {p_mono:.4f}")

P(α strictly decreasing across all T): 0.1247

The probability of strict monotonicity across all five levels is modest, at approximately 0.12. This is driven entirely by the near-equality of $\alpha$ at $T = 0.3$ and $T = 0.7$, where the pairwise probability is essentially a coin flip ($P = 0.48$). Collapsing these two levels yields a coarser ordering that holds with much higher confidence:

Show code

# Coarser test: α(T=0.0) > mean(α(T=0.3), α(T=0.7)) > α(T=1.0) > α(T=1.5)
coarse_decreasing = 0

for i in range(n_draws):
    a00 = alpha_draws[0.0][i]
    a_mid = (alpha_draws[0.3][i] + alpha_draws[0.7][i]) / 2
    a10 = alpha_draws[1.0][i]
    a15 = alpha_draws[1.5][i]
    if a00 > a_mid > a10 > a15:
        coarse_decreasing += 1

p_coarse = coarse_decreasing / n_draws
print(f"P(α(0.0) > mean(α(0.3), α(0.7)) > α(1.0) > α(1.5)): {p_coarse:.4f}")

P(α(0.0) > mean(α(0.3), α(0.7)) > α(1.0) > α(1.5)): 0.3815

0.7 Discussion

0.7.1 Summary of Findings

The results support the directional hypothesis: increasing LLM temperature decreases estimated sensitivity to expected utility maximization. The evidence is strongest at the global level—the posterior slope $\Delta\alpha / \Delta T$ is negative with high confidence—and weakest at fine-grained adjacent comparisons.

Show code

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Left: point estimates with CIs
medians = [np.median(alpha_draws[t]) for t in temperatures]
q05s = [np.percentile(alpha_draws[t], 5) for t in temperatures]
q95s = [np.percentile(alpha_draws[t], 95) for t in temperatures]

axes[0].errorbar(temperatures, medians, 
                 yerr=[np.array(medians) - np.array(q05s), np.array(q95s) - np.array(medians)],
                 fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8,
                 capsize=5, capthick=1.5)
axes[0].set_xlabel('Temperature')
axes[0].set_ylabel('Sensitivity (α)')
axes[0].set_title('α vs. Temperature')
axes[0].set_xticks(temperatures)

# Right: pairwise heatmap
n_temps = len(temperatures)
heatmap = np.full((n_temps, n_temps), np.nan)

for key, prob in pairs.items():
    t1, t2 = key.split('_vs_')
    i = temperatures.index(float(t1))
    j = temperatures.index(float(t2))
    heatmap[i, j] = prob
    heatmap[j, i] = 1 - prob

np.fill_diagonal(heatmap, 0.5)

im = axes[1].imshow(heatmap, cmap='RdYlGn', vmin=0, vmax=1, aspect='equal')
axes[1].set_xticks(range(n_temps))
axes[1].set_xticklabels([f'{t}' for t in temperatures])
axes[1].set_yticks(range(n_temps))
axes[1].set_yticklabels([f'{t}' for t in temperatures])
axes[1].set_xlabel('Temperature (column)')
axes[1].set_ylabel('Temperature (row)')
axes[1].set_title('P(α_row > α_col)')

# Annotate cells
for i in range(n_temps):
    for j in range(n_temps):
        if not np.isnan(heatmap[i, j]):
            color = 'white' if heatmap[i, j] > 0.8 or heatmap[i, j] < 0.2 else 'black'
            axes[1].text(j, i, f'{heatmap[i, j]:.2f}', ha='center', va='center', 
                        fontsize=9, color=color)

plt.colorbar(im, ax=axes[1], shrink=0.8)

plt.tight_layout()
plt.show()

Figure 9: Summary of the temperature–sensitivity relationship. Left: Posterior medians with 90% credible intervals. Right: Pairwise posterior probabilities P(α_i > α_j) as a heatmap.

0.7.2 Interpreting the Non-Separation at $T = 0.3$ and $T = 0.7$

The inability to distinguish $\alpha$ at these two temperatures admits several interpretations:

Threshold effect. The relationship between temperature and sensitivity may not be linear. There may be a regime ($T \lesssim 0.7$) where the LLM’s choice behavior is relatively stable, with sensitivity declining more sharply only at higher temperatures.
Insufficient statistical power. With $M = 300$ observations per condition, the study may lack power to resolve small differences in $\alpha$. The posteriors at $T = 0.3$ and $T = 0.7$ have standard deviations of $\sim 15$–$19$, comparable to the difference between their medians ($\sim 1.5$).
Genuine non-monotonicity. It is possible that the true relationship is not strictly monotonic at fine scales, even if the overall trend is negative.

These interpretations are not mutually exclusive, and the current data do not distinguish among them.

0.7.3 Model Adequacy

The posterior predictive checks provide no evidence of misfit at any temperature level. This is reassuring but not conclusive—PPC p-values assess whether the model can reproduce summary statistics of the data, not whether the model captures the mechanism by which temperature affects choices. The m_01 model treats each temperature condition as an independent estimation problem; it does not model how temperature enters the choice process.

0.7.4 Construct Validity of $\alpha$

For a JDM audience, the claim that $\alpha$ measures “sensitivity to subjective expected utility maximization” requires careful interpretation. The parameter $\alpha$ controls the sharpness of the softmax choice distribution—higher $\alpha$ means choices are more concentrated on high-EU alternatives. But this interpretation assumes the model’s other parameters ($\beta$, $\delta$) correctly capture the subjective utilities. Given the acknowledged $\beta$–$\delta$ identification issues, what does $\alpha$ actually measure?

Operationally, $\alpha$ measures the consistency with which the agent chooses in the direction implied by the fitted utility function. This is related to, but not identical to, “sensitivity to EU maximization” in the strong sense that would require the utility parameters to be correctly identified.

0.7.4.1 Do we need to identify $\beta$ and $\delta$ to interpret $\alpha$?

The honest answer is: it depends on the strength of the claim being made. It is worth distinguishing three layers of interpretation, in order of increasing demand on identification.

Internal consistency (weakest). $\alpha$ is identified up to a scale that is fixed by the parameterisation, given the fitted utility function $u_\theta(c) = \delta^\top c + \text{(features)}$. A higher $\alpha$ in one condition than another means the agent’s choices are more concentrated on the alternative that the fitted $u_\theta$ ranks highest. This is a within-model claim about choice consistency and does not require $\beta$ and $\delta$ to be correctly identified — it only requires that $u_\theta$ rank the alternatives in roughly the right order (which the posterior predictive checks support globally).
Comparative claims across conditions (the present use case). To interpret a change in $\alpha$ across a manipulation (here, temperature), we need the fitted utility function to be reasonably stable across conditions, so that the same notion of “EU maximization” anchors the comparison. The m_01 design uses the same $R = 30$ insurance claims at every temperature and the same embedding-based features, which makes this assumption defensible: any miscalibration in $(\beta, \delta)$ tends to act as a shared confound that does not move differentially across the temperature conditions. Under this assumption — which is empirically supported by the within-model posterior predictive checks at every temperature — comparative ordering claims about $\alpha$ are warranted even though $\beta$ and $\delta$ are not separately identified.
Absolute claims about the level of EU rationality (strongest). Claims of the form “the LLM is rational with sensitivity $\alpha \approx 30$” are interpretable only relative to the fitted utility $u_\theta$. Without independent identification of $\beta$ and $\delta$ one cannot certify that $u_\theta$ is the true subjective utility, so the absolute number $\alpha$ is best read as a sensitivity to the model-implied utility ranking, not as a context-free measure of EU rationality. We deliberately avoid making absolute claims of this third type in the present report.

The temperature study lives in layer (2) and is well-supported. By contrast, claims that rely on layer (3) — for example, comparing $\alpha$ estimated under m_0 / m_01 against an exogenous benchmark, or treating $\alpha$ as a transferable property of the agent — should be flagged.

0.7.4.2 Implications for the m_0 family and downstream studies

The present analysis uses m_01 (uncertain choices only). Models in the m_0 family share the same $\beta$–$\delta$ coupling. The implications ripple outward as follows:

Within-application comparative analyses (the present temperature study, the temperature × EU prompt study, the temperature × risky alternatives study) use shared designs and within-application comparisons, so they live in layer (2) and are robust to the partial identification.
Cross-application comparisons of absolute $\alpha$ values (e.g., GPT-4o on insurance vs. Claude on insurance vs. GPT-4o on Ellsberg) live closer to layer (3) and should be interpreted with care: differences in domain, embedding geometry, and answer-format noise can produce different model-implied utility functions, which in turn change what $\alpha$ is sensitive to. We recommend presenting such cross-application contrasts as comparisons of patterns (e.g., signs and orderings of effects) rather than of numerical $\alpha$ values.
The planned alignment study (see README.md) is, on first inspection, more demanding because it asks how $\alpha$ estimated for an LLM relates to a normative benchmark of EU rationality. This is closest to a layer-(3) claim, and it deserves to be flagged honestly. The alignment study is built on the hierarchical model h_m01, which is the m_01 likelihood extended to $J$ experimental cells: cell-specific $\alpha_j$ and $\beta_j$ with a shared $\delta$, fitted jointly via a regression on $\log \alpha$. Crucially, h_m01 inherits m_01’s reliance on uncertain choices only — it does not introduce risky alternatives, so the underlying $\beta$–$\delta$ coupling is not resolved by moving to the hierarchical model. What h_m01 does buy is (a) sharing $\delta$ across cells (so any single subjective utility scaling is at least common to all comparisons) and (b) framing alignment-relevant manipulations as regression contrasts on $\log \alpha$, which are layer-(2) comparative claims by construction. The strongest defensible reading of the alignment study is therefore that it is a layer-(2) design that uses the hierarchical structure to make comparative effects across alignment-manipulating conditions properly calibrated, not a layer-(3) certification of EU rationality in absolute terms. Genuinely escaping the partial-identification caveat would require moving to a model in the m_1 / m_2 / m_3 family — i.e., one that augments the design with risky alternatives whose consequences are partially observable, which is what gives $\delta$ separate identifying information in the foundational reports. We flag this as an open design question for the alignment study.

In short: the partial identification is a real interpretive constraint, but it does not undermine the present temperature study or the broader applications programme — provided that conclusions are framed as comparative rather than absolute, and that the alignment study is explicitly scoped as a contrast study on $\log \alpha$ rather than a calibration of agent-level EU rationality. The operational consequences for the alignment-study pre-registration, hypothesis framing, and follow-up sequencing are spelled out in §0.5 of prompts/hierarchical_alignment_study_plan.md.

0.7.5 Relation to the Foundational Framework

These results can be interpreted through the lens of Report 1:

The monotonicity property (Theorem 1) tells us that for any fixed value function, higher $\alpha$ implies higher probability of choosing value-maximizing alternatives. Our finding that $\alpha$ decreases with temperature thus means the LLM’s choices become less concentrated on EU-maximizing alternatives as temperature increases.
The rate of convergence (Theorem 5) tells us that the effect of $\alpha$ on choice probabilities depends on the gap $\Delta$ between the best and second-best expected utilities. Problems with larger utility gaps will show temperature effects more readily.
The commitment–performance distinction discussed in Report 1 provides a natural framing: if we take EU maximization as the LLM’s “commitment” (the normative standard its training approximates), then temperature controls the performance noise around that commitment.

0.7.6 Connection to the JDM Literature

The relationship between temperature and choice consistency connects to several strands of the human decision-making literature:

Error theories of choice. The softmax (logit) choice model is a standard Fechner-style stochastic specification — one of the workhorse error structures examined alongside the random-preference and “trembling-hand” alternatives by Hey and Orme (1994) in their experimental comparison of stochastic specifications for choice under risk. The $\alpha$ parameter plays the same role as inverse temperature (or precision) in random utility models, and our finding that external sampling noise reduces effective $\alpha$ parallels work on how cognitive load and time pressure increase choice inconsistency in human subjects.
Quantal response equilibrium. In game-theoretic settings, the softmax choice rule defines quantal response equilibrium (McKelvey and Palfrey 1995), where players choose strategies with probabilities that increase in expected payoff. The $\alpha$ parameter governs the “rationality” of players in this framework. Our results suggest that LLM temperature has an analogous effect to noise in human strategic reasoning.
Resource-rational analysis. From the perspective of resource-rational analysis (Griffiths et al. 2015; Lieder and Griffiths 2020), choice noise may reflect optimal information processing under cognitive constraints. The token-sampling temperature could be viewed as a proxy for “cognitive effort” — lower temperature corresponds to more deterministic (less resource-intensive) output generation. This framing also connects to rational inattention models (Sims 2003; Caplin and Dean 2015) in which agents optimally trade off decision quality against information-processing costs.

These connections are suggestive rather than formal, but they situate the temperature–$\alpha$ relationship within broader theoretical frameworks for understanding noisy choice.

0.7.7 Limitations and Generalizability

These results establish the temperature–sensitivity relationship for GPT-4o on the insurance claims triage task. Several important limitations should constrain interpretation:

Single LLM. The temperature parameter has model-specific effects. Whether this relationship generalizes to other LLM architectures (Claude, Llama, etc.) is an open empirical question. Subsequent reports in this series (Claude × Insurance, GPT-4o × Ellsberg) provide cross-LLM evidence.
Single task domain. The insurance claims triage task is a specific operationalization of decision-making under uncertainty. How well findings generalize to other domains (gambles, consumer choice, medical decisions) depends on task-specific factors that may interact with temperature.
Single embedding model. The feature construction uses text-embedding-3-small. Different embedding models might produce different utility landscapes and thus different temperature–$\alpha$ relationships.
Prior sensitivity. While the prior predictive analysis motivates $\text{Lognormal}(3.0, 0.75)$, we have not conducted a formal sensitivity analysis demonstrating that conclusions are robust to nearby prior specifications. Given that $M = 300$ observations per condition is substantial, we expect the prior to be swamped by the data—but this should be verified. A robustness analysis is planned for subsequent work.
Design-induced correlations. Because the same $R = 30$ claims appear at every temperature, the five $\alpha$ estimates share correlations that the independent-fits approach does not model. The mechanism and its consequences for the pairwise comparisons are discussed in the Independence Caveat callout in Section 0.6; a hierarchical model that estimates $\alpha(T)$ jointly would resolve it (see Next Steps).

Claims about “LLM decision-making” generally should be understood as hypotheses motivated by these specific findings, not as established generalizations.

0.8 Reproducibility

0.8.1 Data Snapshot

All results in this report are loaded from a frozen data snapshot in the data/ subdirectory, which is version-controlled. This protects the report from changes if the temperature study pipeline is re-run with different parameters.

The snapshot contains:

File	Description
`alpha_draws_T*.npz`	Posterior draws of α (4,000 per condition)
`ppc_T*.json`	Posterior predictive check results
`diagnostics_T*.txt`	CmdStan diagnostic output
`stan_data_T*.json`	Stan-ready data (for refitting)
`fit_summary.json`	Summary statistics across conditions
`primary_analysis.json`	Pre-computed monotonicity and slope statistics
`run_summary.json`	Pipeline metadata and configuration
`grid_results.json`	Prior predictive grid search results
`study_config.yaml`	Frozen copy of the study configuration

0.8.2 Refitting from Source

To reproduce the Stan fits from the included stan_data_T*.json files:

Show code

# Uncomment to refit from source data (requires CmdStanPy; ~30 min per condition)
#
# import cmdstanpy
# model = cmdstanpy.CmdStanModel(stan_file="models/m_01.stan")
#
# for t in [0.0, 0.3, 0.7, 1.0, 1.5]:
#     key = f"T{str(t).replace('.', '_')}"
#     fit = model.sample(
#         data=f"data/stan_data_{key}.json",
#         chains=4,
#         iter_warmup=1000,
#         iter_sampling=1000,
#         seed=42,
#     )
#     print(f"T={t}: alpha median = {np.median(fit.stan_variable('alpha')):.1f}")

0.9 Next Steps

This initial study establishes that temperature affects estimated SEU sensitivity. Several extensions are planned:

Position bias and consistency analysis. The data support analyses described in DESIGN.md §6.2–6.4, examining whether position bias or choice inconsistency varies with temperature. These diagnostics may reveal whether temperature affects how the LLM chooses, beyond the summary captured by $\alpha$.
Hierarchical temperature model. As discussed in the Independence Caveat in Section 0.6 and in the corresponding limitation above, the present analysis fits m_01 at each temperature independently, which does not account for the correlations induced by the shared claim pool. A natural extension models $\log \alpha(T) = \gamma_0 + \gamma_1 \cdot T$ with priors on $(\gamma_0, \gamma_1)$, directly estimating the slope with calibrated uncertainty and borrowing strength across conditions. The hierarchical model h_m01 developed in the foundational reports — and used as the inferential workhorse for the planned alignment study — provides the template for this extension.
Replication with different LLMs. The temperature parameter has model-specific effects. Repeating this study with different architectures would test the generality of the finding.
Prior sensitivity analysis. While the prior predictive analysis motivates $\text{Lognormal}(3.0, 0.75)$, a sensitivity analysis across nearby priors (e.g., $\text{Lognormal}(2.5, 1.0)$ and $\text{Lognormal}(3.5, 0.5)$) would characterize how much the conclusions depend on this specific choice. Given $M = 300$ observations per condition, we expect the posterior to be dominated by the likelihood, but this should be verified.

0.10 References

Caplin, Andrew, and Mark Dean. 2015. “Revealed Preference, Rational Inattention, and Costly Information Acquisition.” American Economic Review 105 (7): 2183–203.

Griffiths, Thomas L., Falk Lieder, and Noah D. Goodman. 2015. “Rational Use of Cognitive Resources: Levels of Analysis Between the Computational and the Algorithmic.” Topics in Cognitive Science 7 (2): 217–29.

Hey, John D., and Chris Orme. 1994. “Investigating Generalizations of Expected Utility Theory Using Experimental Data.” Econometrica 62 (6): 1291–326.

Lieder, Falk, and Thomas L. Griffiths. 2020. “Resource-Rational Analysis: Understanding Human Cognition as the Optimal Use of Limited Computational Resources.” Behavioral and Brain Sciences 43: e1.

McKelvey, Richard D., and Thomas R. Palfrey. 1995. “Quantal Response Equilibria for Normal Form Games.” Games and Economic Behavior 10 (1): 6–38.

Sims, Christopher A. 2003. “Implications of Rational Inattention.” Journal of Monetary Economics 50 (3): 665–90.

Talts, Sean, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2018. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” arXiv Preprint arXiv:1804.06788.

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:

@online{helzner2026,
  author = {Helzner, Jeff},
  title = {Temperature and {SEU} {Sensitivity:} {Initial} {Results}},
  date = {2026-05-12},
  url = {https://jeffhelzner.github.io/seu-sensitivity/applications/temperature_study/01_initial_study.html},
  langid = {en}
}

For attribution, please cite this work as:

Helzner, Jeff. 2026. “Temperature and SEU Sensitivity: Initial Results.” SEU Sensitivity Project, May 12. https://jeffhelzner.github.io/seu-sensitivity/applications/temperature_study/01_initial_study.html.

--- title: "Temperature and SEU Sensitivity: Initial Results" subtitle: "Application Report: Temperature Study 1" description: | An initial investigation of how LLM sampling temperature affects estimated sensitivity (α) to subjective expected utility maximization, using a calibrated prior and the m_01 model variant. categories: [applications, temperature, m_01] execute: cache: true --- ```{python} #| label: setup #| include: false import sys import os reports_root = os.path.normpath(os.path.join(os.getcwd(), '..', '..')) project_root = os.path.dirname(reports_root) sys.path.insert(0, reports_root) sys.path.insert(0, project_root) import numpy as np import json import warnings warnings.filterwarnings('ignore') import matplotlib.pyplot as plt # Use project plotting style from report_utils import set_seu_style, SEU_COLORS, SEU_PALETTE set_seu_style() # Data directory (frozen snapshot — immune to future pipeline runs) from pathlib import Path data_dir = Path("data") ``` ## Introduction The foundational reports established the SEU sensitivity model's theoretical properties ([Report 1](../../foundations/01_abstract_formulation.qmd)), concrete implementation ([Report 2](../../foundations/02_concrete_implementation.qmd)), and computational validity ([Reports 3–6](../../foundations/03_prior_analysis.qmd)). This report presents the first empirical application: investigating how the **sampling temperature** of a large language model affects its estimated sensitivity to subjective expected utility maximization. The motivation is straightforward. In the softmax choice model, the sensitivity parameter $\alpha$ governs how sharply choices track expected utility differences. Language models employ their own softmax temperature during token sampling, which controls the entropy of the next-token distribution. If the LLM's choice behavior is well-described by our model, then increasing the external sampling temperature is expected to be associated with *lower* estimated $\alpha$—the additional stochasticity in token generation should manifest as noisier, less EU-aligned choices. ::: {.callout-note} ## Theoretical Pathway The hypothesized relationship between token-sampling temperature and choice-level sensitivity $\alpha$ rests on an assumption about how noise propagates through the LLM's processing. Higher temperature increases entropy in token selection, which could affect choice behavior through multiple channels: (a) adding noise that propagates through to the final choice response, reducing effective $\alpha$; (b) affecting the LLM's intermediate reasoning in ways that reduce EU-alignment; or (c) affecting both the assessment phase (which generates embeddings) and the choice phase. The present design does not distinguish among these mechanisms—it tests only the aggregate effect of temperature on the estimated $\alpha$ parameter. ::: ::: {.callout-note} ## What This Report Covers This report presents initial results from a single data collection. The study design, prior calibration, model fitting, and primary analysis are documented here. Subsequent reports will address robustness checks and extensions. ::: ## Experimental Design {#sec-design} ### Task and Conditions We use an insurance claims triage task in which GPT-4o selects which claim to forward for investigation from a set of alternatives. Each alternative (claim) can lead to one of $K = 3$ possible consequences, and there are $R$ distinct alternatives in the pool (see Design Parameters below). In each decision problem, the LLM is presented with a subset of these alternatives. The LLM first *assesses* each claim individually (producing text that is then embedded), and subsequently makes a *choice* among the alternatives in a given problem. Five temperature levels define the experimental factor. Each level constitutes an independent data collection, analogous to a between-subjects design where each "subject" is a separate LLM run at a fixed temperature: ```{python} #| label: tbl-conditions #| tbl-cap: "Experimental conditions. Each temperature level constitutes a separate model fit." import pandas as pd conditions = pd.DataFrame({ 'Level': [1, 2, 3, 4, 5], 'Temperature': [0.0, 0.3, 0.7, 1.0, 1.5], 'Description': [ 'Deterministic (greedy decoding)', 'Low variance', 'Moderate variance (typical default)', 'High variance', 'Very high variance' ] }) conditions ``` ### Design Parameters ```{python} #| label: design-params #| echo: true # Load frozen study configuration import yaml with open(data_dir / "study_config.yaml") as f: config = yaml.safe_load(f) # Load run summary for pipeline details with open(data_dir / "run_summary.json") as f: run_summary = json.load(f) print(f"Study Design:") print(f" Decision problems (M): {config['num_problems']} base × {config['num_presentations']} presentations = {config['num_problems'] * config['num_presentations']}") print(f" Alternatives per problem: {config['min_alternatives']}–{config['max_alternatives']}") print(f" Consequences (K): {config['K']}") print(f" Embedding dimensions (D): {config['target_dim']}") print(f" Distinct alternatives (R): {run_summary['phases']['phase3_data_prep']['per_temperature']['0.0']['R']}") print(f" LLM model: {config['llm_model']}") print(f" Embedding model: {config['embedding_model']}") ``` Each of the 100 base problems is presented $P = 3$ times with claims shuffled to different positions, yielding $M = 300$ choice problems per temperature condition. This **position counterbalancing** design addresses the systematic position bias discovered in an earlier pilot study, where unparseable responses were silently mapped to position 0. Here, each claim appears in multiple positions across presentations, and any unparseable response is recorded as NA rather than assigned a default. ::: {.callout-note} ## Sample Size Rationale The sample size of $M = 300$ per condition was chosen to provide adequate posterior precision for detecting large differences in $\alpha$ (e.g., between $T = 0.0$ and $T = 1.5$) while remaining computationally tractable. Parameter recovery simulations (§4.1) confirm that $\alpha$ is well-identified at this scale (coverage $\approx 90\%$, low bias). The inability to distinguish $\alpha$ at $T = 0.3$ vs. $T = 0.7$ (see §6.2) reflects either a true null effect or insufficient power for fine-grained adjacent comparisons—the present design was not powered for the latter. ::: ::: {.callout-note} ## Choice Prompt The exact prompt text shown to the LLM when making choices is specified in the study configuration. The prompt presents the claim descriptions and asks the LLM to select which claim to forward for investigation. The full prompt template and related materials are available in the project codebase at `applications/temperature_study/prompts/`. ::: ### Feature Construction Alternative features are constructed through a two-stage process. First, the LLM assesses each claim on its own merits at the relevant temperature, producing a natural-language evaluation. These assessments are embedded using `text-embedding-3-small`, yielding high-dimensional vectors. Second, all embeddings across temperature conditions are pooled and projected via PCA to $D = 32$ dimensions. Crucially, PCA is fit once on the pooled set and then applied to each condition separately, ensuring a shared coordinate system without privileging any single temperature. ::: {.callout-note} ## PCA and Temperature Because assessments vary with temperature, the shared PCA basis may partly reflect temperature-induced variation in the embedding space, not only variation across claims. This could create an aliasing between temperature effects on *assessment quality* and temperature effects on *choice sensitivity*. The present design does not disentangle these mechanisms. ::: ```{python} #| label: pca-variance #| echo: false pca = run_summary['phases']['phase3_data_prep']['pca_summary'] cumvar = np.cumsum(pca['explained_variance_ratio']) print(f"PCA Summary:") print(f" Components retained: {pca['n_components']}") print(f" Total variance explained: {pca['total_explained_variance']:.1%}") print(f" First 5 components: {cumvar[4]:.1%}") print(f" First 10 components: {cumvar[9]:.1%}") ``` ### Data Quality ```{python} #| label: data-quality #| echo: false na = run_summary['phases']['phase2b_choices']['na_summary'] print(f"NA Summary:") print(f" Overall: {na['overall']['na']} / {na['overall']['total']} ({na['overall']['na_rate']:.1%})") for key, val in na['per_temperature'].items(): print(f" {key}: {val['na']} / {val['total']} ({val['na_rate']:.1%})") ``` No observations were lost to parsing failures at any temperature level—a substantial improvement over the earlier pilot, which required ad hoc imputation. ## Model and Prior Calibration {#sec-model} ### The m_01 Model Variant We fit the **m_01** model, which is structurally identical to the foundational m_0 model described in [Report 2](../../foundations/02_concrete_implementation.qmd). The only difference is the prior on the sensitivity parameter $\alpha$: | | m_0 (foundational) | m_01 (this study) | |---|---|---| | $\alpha$ prior | $\text{Lognormal}(0, 1)$ | $\text{Lognormal}(3.0, 0.75)$ | | Prior median | $\approx 1$ | $\approx 20$ | | Prior 90% CI | $[0.19, 5.0]$ | $[5.5, 67]$ | | All other priors | — | Identical to m_0 | The foundational prior $\text{Lognormal}(0, 1)$ was chosen for generality: it spans a wide range of sensitivity levels appropriate for exploring the model's behavior in simulation. However, when applied to LLM choice data—where we expect substantial sensitivity to EU differences—this prior places most of its mass well below the region where the data concentrate. The prior predictive analysis below motivates a more informative alternative. ### Prior Predictive Grid Search To select the m_01 prior, we conducted a grid search over 12 candidate lognormal hyperparameter pairs $(\mu, \sigma)$, evaluating each via prior predictive simulation against the actual study design. For each candidate prior, we: 1. Drew $N = 200$ values of $\alpha$ from $\text{Lognormal}(\mu, \sigma)$ 2. Simulated choice data from `m_0_sim.stan` using the actual study design ($M = 300$, $K = 3$, $D = 32$, $R = 30$) 3. Computed the **SEU-maximizer selection rate**: the fraction of problems in which the simulated agent chose the alternative with the highest expected utility ```{python} #| label: fig-grid-search #| fig-cap: "Prior predictive grid search results. Each point represents a candidate lognormal prior for α, evaluated by the SEU-maximizer selection rate it implies. The selected prior lognormal(3.0, 0.75) balances informativeness with sufficient coverage of the plausible parameter range." with open(data_dir / "grid_results.json") as f: grid = json.load(f) results = grid['results'] fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # Left: SEU rate by prior labels = [r['prior_label'] for r in results] means = [r['seu_rate_mean'] for r in results] q05s = [r['seu_rate_q05'] for r in results] q95s = [r['seu_rate_q95'] for r in results] y_pos = np.arange(len(labels)) errors = np.array([[m - q05, q95 - m] for m, q05, q95 in zip(means, q05s, q95s)]).T colors = [SEU_COLORS['accent'] if 'lognormal(3.0, 0.75)' in l else SEU_COLORS['primary'] for l in labels] axes[0].barh(y_pos, means, xerr=errors, color=colors, alpha=0.8, edgecolor='white', capsize=3) axes[0].set_yticks(y_pos) axes[0].set_yticklabels(labels, fontsize=9) axes[0].set_xlabel('SEU-Maximizer Selection Rate') axes[0].set_title('Prior-Implied SEU-Max Rate') axes[0].set_xlim(0, 1) # Right: prior density comparison selected = [r for r in results if 'lognormal(3.0, 0.75)' in r['prior_label']][0] foundational = [r for r in results if 'lognormal(0.0, 1.0)' in r['prior_label']][0] from scipy.stats import lognorm x = np.linspace(0.1, 150, 500) # m_0 prior: lognormal(0, 1) → scipy shape=1, scale=exp(0)=1 axes[1].plot(x, lognorm.pdf(x, s=1.0, scale=np.exp(0)), color=SEU_COLORS['secondary'], linewidth=2, label='m_0: Lognormal(0, 1)') # m_01 prior: lognormal(3.0, 0.75) axes[1].plot(x, lognorm.pdf(x, s=0.75, scale=np.exp(3.0)), color=SEU_COLORS['accent'], linewidth=2, label='m_01: Lognormal(3.0, 0.75)') axes[1].set_xlabel('α') axes[1].set_ylabel('Density') axes[1].set_title('Prior Comparison') axes[1].legend(fontsize=9) axes[1].set_xlim(0, 150) plt.tight_layout() plt.show() ``` The selected prior $\text{Lognormal}(3.0, 0.75)$ has the following properties: - **Median $\approx 20$**, placing prior mass in the region where we expect LLM sensitivity to lie - **90% CI $\approx [5.5, 67]$**, wide enough to accommodate substantial uncertainty - **Prior-implied SEU-max rate $\approx 78\%$**, consistent with the expectation that LLMs make reasonably EU-aligned choices - **No numerical issues**: the prior avoids the extreme $\alpha$ values that caused softmax overflow with wider priors ::: {.callout-note} ## Why Not the Foundational Prior? The m_0 prior $\text{Lognormal}(0, 1)$ was designed for generality in simulation studies. Its median of $\approx 1$ means most prior mass sits in a regime where choices are nearly random—appropriate for exploring model behavior, but a poor match for LLM data where we observe strong EU-alignment. With the actual study design ($M = 300$, $D = 32$), the foundational prior also places non-negligible mass on very large $\alpha$ values that cause numerical overflow in the softmax computation. The m_01 prior resolves both issues. ::: ## Model Validation {#sec-validation} Before fitting the model to empirical data, we validate that m_01's parameters are identifiable and its posterior is well-calibrated under the actual study design ($M = 300$, $K = 3$, $D = 32$, $R = 30$). The foundational reports ([Report 4](../../foundations/04_parameter_recovery.qmd), [Report 6](../../foundations/06_sbc_validation.qmd)) established these properties for m_0 with a smaller design ($M = 25$, $D = 5$); here we confirm they hold at the scale and prior specification used in this study. ```{python} #| label: validation-setup #| output: false from utils.study_design import StudyDesign # Load the actual study design from stan_data (any temperature — the design is identical) with open(data_dir / "stan_data_T0_0.json") as f: stan_data = json.load(f) # Construct a StudyDesign from the actual study data study = StudyDesign( M=stan_data["M"], K=stan_data["K"], D=stan_data["D"], R=stan_data["R"] ) study.w = [np.array(w_vec) for w_vec in stan_data["w"]] study.I = np.array(stan_data["I"]) print(f"Validation Study Design (from frozen data):") print(f" Decision problems (M): {study.M}") print(f" Consequences (K): {study.K}") print(f" Feature dimensions (D): {study.D}") print(f" Distinct alternatives (R): {study.R}") ``` ### Parameter Recovery {#sec-parameter-recovery} Parameter recovery asks: **can we recover known parameter values from simulated data?** We draw parameters from the m_01 prior, simulate choice data via `m_01_sim.stan`, fit `m_01.stan`, and compare posterior estimates to the true values across 20 iterations. ```{python} #| label: run-param-recovery #| output: false from analysis.parameter_recovery import ParameterRecovery import tempfile recovery_dir = tempfile.mkdtemp(prefix="param_recovery_m01_") recovery = ParameterRecovery( inference_model_path=os.path.join(project_root, "models", "m_01.stan"), sim_model_path=os.path.join(project_root, "models", "m_01_sim.stan"), study_design=study, output_dir=recovery_dir, n_mcmc_samples=1000, n_mcmc_chains=4, n_iterations=20, sim_hyperparams={'alpha_mean': 3.0, 'alpha_sd': 0.75} ) true_params_list, posterior_summaries = recovery.run() ``` ```{python} #| label: recovery-summary #| echo: false n_successful = len(true_params_list) print(f"Parameter Recovery Complete:") print(f" Successful iterations: {n_successful}") print(f" Parameters per iteration:") print(f" α (sensitivity): 1") print(f" β (feature weights): {study.K} × {study.D} = {study.K * study.D}") print(f" δ (utility increments): {study.K - 1}") ``` #### MCMC Diagnostics ```{python} #| label: recovery-mcmc-diagnostics #| echo: false rhat_max_list = [] ess_min_list = [] ess_col = None sample_cols = posterior_summaries[0].columns for candidate in ['ESS_bulk', 'N_Eff', 'ESS']: if candidate in sample_cols: ess_col = candidate break for s in posterior_summaries: param_rows = s[s.index.str.match(r'^(alpha|beta\[|delta\[|upsilon\[)')] if 'R_hat' in param_rows.columns: rhat_max_list.append(param_rows['R_hat'].max()) if ess_col is not None and ess_col in param_rows.columns: ess_min_list.append(param_rows[ess_col].min()) rhat_max = np.array(rhat_max_list) ess_min = np.array(ess_min_list) print("MCMC Diagnostic Summary Across Recovery Iterations:") print(f" Max R̂ per iteration:") print(f" Median: {np.median(rhat_max):.4f}") print(f" Maximum: {rhat_max.max():.4f}") print(f" Iterations with max R̂ > 1.01: {(rhat_max > 1.01).sum()}/{len(rhat_max)}") if len(ess_min) > 0: print(f" Min ESS ({ess_col}) per iteration:") print(f" Median: {np.median(ess_min):.0f}") print(f" Minimum: {ess_min.min():.0f}") ``` #### α Recovery ```{python} #| label: fig-alpha-recovery #| fig-cap: "Recovery of the sensitivity parameter α under the m_01 prior with the actual study design. Left: true vs. estimated values with identity line. Right: 90% credible intervals for each iteration, colored by whether they contain the true value." alpha_true = np.array([p['alpha'] for p in true_params_list]) alpha_mean = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries]) alpha_lower = np.array([s.loc['alpha', '5%'] for s in posterior_summaries]) alpha_upper = np.array([s.loc['alpha', '95%'] for s in posterior_summaries]) alpha_bias = np.mean(alpha_mean - alpha_true) alpha_rmse = np.sqrt(np.mean((alpha_mean - alpha_true)**2)) alpha_coverage = np.mean((alpha_true >= alpha_lower) & (alpha_true <= alpha_upper)) alpha_ci_width = np.mean(alpha_upper - alpha_lower) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # True vs Estimated ax = axes[0] ax.scatter(alpha_true, alpha_mean, alpha=0.7, s=60, c=SEU_COLORS['primary'], edgecolor='white') lims = [min(alpha_true.min(), alpha_mean.min()) * 0.9, max(alpha_true.max(), alpha_mean.max()) * 1.1] ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line') ax.set_xlim(lims) ax.set_ylim(lims) ax.set_xlabel('True α', fontsize=12) ax.set_ylabel('Estimated α (posterior mean)', fontsize=12) ax.set_title(f'α Recovery: Bias={alpha_bias:.2f}, RMSE={alpha_rmse:.2f}', fontsize=12) ax.legend() ax.set_aspect('equal') # Coverage plot ax = axes[1] for i in range(len(alpha_true)): covered = (alpha_true[i] >= alpha_lower[i]) & (alpha_true[i] <= alpha_upper[i]) color = 'forestgreen' if covered else 'crimson' ax.plot([i, i], [alpha_lower[i], alpha_upper[i]], color=color, linewidth=2, alpha=0.7) ax.scatter(i, alpha_mean[i], color=color, s=40, zorder=3) ax.scatter(np.arange(len(alpha_true)), alpha_true, color='black', s=60, marker='x', label='True value', zorder=4, linewidth=2) ax.set_xlabel('Iteration', fontsize=12) ax.set_ylabel('α', fontsize=12) ax.set_title(f'α: 90% Credible Intervals (Coverage = {alpha_coverage:.0%})', fontsize=12) ax.legend() ax.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` #### Recovery Metrics Summary ```{python} #| label: tbl-recovery-metrics #| tbl-cap: "Parameter recovery metrics for m_01 with the actual study design (M=300, K=3, D=32, R=30). The primary parameter of interest is α." import pandas as pd K, D = study.K, study.D # Beta recovery all_beta_bias = [] all_beta_rmse = [] all_beta_coverage = [] all_beta_ci_width = [] for k in range(K): for d in range(D): param_name = f"beta[{k+1},{d+1}]" bt = np.array([p['beta'][k][d] for p in true_params_list]) bm = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries]) bl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries]) bu = np.array([s.loc[param_name, '95%'] for s in posterior_summaries]) all_beta_bias.append(np.mean(bm - bt)) all_beta_rmse.append(np.sqrt(np.mean((bm - bt)**2))) all_beta_coverage.append(np.mean((bt >= bl) & (bt <= bu))) all_beta_ci_width.append(np.mean(bu - bl)) # Delta recovery all_delta_bias = [] all_delta_rmse = [] all_delta_coverage = [] all_delta_ci_width = [] for k in range(K - 1): param_name = f"delta[{k+1}]" dt = np.array([p['delta'][k] for p in true_params_list]) dm = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries]) dl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries]) du = np.array([s.loc[param_name, '95%'] for s in posterior_summaries]) all_delta_bias.append(np.mean(dm - dt)) all_delta_rmse.append(np.sqrt(np.mean((dm - dt)**2))) all_delta_coverage.append(np.mean((dt >= dl) & (dt <= du))) all_delta_ci_width.append(np.mean(du - dl)) # Scale references for relative metrics. α lives on a wide multiplicative # scale (Lognormal(3.0, 0.75) prior → true values typically ~5–100), so # absolute Bias and RMSE must be interpreted relative to the magnitude # of α; the same caveat is less relevant for β (zero-centred) and δ # (modest range), so we report relative metrics for α only. alpha_scale = float(np.mean(np.abs(alpha_true))) alpha_rel_bias = alpha_bias / alpha_scale alpha_rel_rmse = alpha_rmse / alpha_scale metrics = pd.DataFrame([ {'Parameter': 'α', 'Bias': f'{alpha_bias:.2f}', 'RMSE': f'{alpha_rmse:.2f}', 'Rel. Bias': f'{alpha_rel_bias:+.1%}', 'Rel. RMSE': f'{alpha_rel_rmse:.1%}', 'Coverage (90%)': f'{alpha_coverage:.0%}', 'Mean CI Width': f'{alpha_ci_width:.2f}'}, {'Parameter': f'β (mean over {K*D})', 'Bias': f'{np.mean(all_beta_bias):.3f}', 'RMSE': f'{np.mean(all_beta_rmse):.3f}', 'Rel. Bias': '—', 'Rel. RMSE': '—', 'Coverage (90%)': f'{np.mean(all_beta_coverage):.0%}', 'Mean CI Width': f'{np.mean(all_beta_ci_width):.2f}'}, {'Parameter': f'δ (mean over {K-1})', 'Bias': f'{np.mean(all_delta_bias):.3f}', 'RMSE': f'{np.mean(all_delta_rmse):.3f}', 'Rel. Bias': '—', 'Rel. RMSE': '—', 'Coverage (90%)': f'{np.mean(all_delta_coverage):.0%}', 'Mean CI Width': f'{np.mean(all_delta_ci_width):.2f}'}, ]) metrics ``` ::: {.callout-note} ## Interpreting Recovery Results The primary parameter of interest is $\alpha$ (sensitivity). The recovery table reports both **absolute** and **relative** metrics for $\alpha$, and the relative columns are the ones that should anchor interpretation. Because the prior places true $\alpha$ on a wide multiplicative scale (Lognormal$(3.0, 0.75)$ implies typical values in roughly the 5–100 range), absolute values of Bias and RMSE that may *look* large in the table can simultaneously be small relative to the magnitude of $\alpha$. This is why the scatter in the *α Recovery* figure can sit tightly on the identity line even when raw RMSE has a value of several units — the visual impression is on a relative scale, while raw RMSE is not. Operationally, we treat the model as fit for purpose when: - **Relative bias** is within roughly $\pm 10\%$ of the mean true $\alpha$; - **Relative RMSE** is comfortably below 25% of the mean true $\alpha$; - **90% credible-interval coverage** is approximately 85–95%. The β–δ coupling issue documented in the foundational reports ([Report 4](../../foundations/04_parameter_recovery.qmd)) is expected to persist — δ recovery benefits from additional identifying information (e.g., risky choices) that is not present in m_01. Since this study focuses on $\alpha$ estimation, the weaker recovery of $(\beta,\delta)$ does not compromise the primary analysis; we revisit the interpretive implications of this in the *Construct Validity* discussion in @sec-results. ::: ### Simulation-Based Calibration {#sec-sbc} SBC complements parameter recovery by asking whether the posterior **correctly represents uncertainty**. Rather than evaluating point estimates, SBC checks that the true parameter value occupies a *uniform* rank position within the posterior samples across many simulations. Non-uniformity indicates that the posterior is miscalibrated—too wide, too narrow, or biased. ::: {.callout-tip collapse="false"} ## A Reader's Guide to SBC Plots and Tests The basic logic of SBC [@talts2018] is simple. For each of $N$ simulations we (i) sample a "true" parameter value $\theta^\star$ from the prior, (ii) simulate a dataset using $\theta^\star$, (iii) fit the model and obtain $L$ posterior draws, and (iv) compute the **rank** of $\theta^\star$ among those $L$ draws. If the model is correctly specified and the sampler is calibrated, that rank should be uniformly distributed on $\{0, 1, \ldots, L\}$ across simulations. What to look for in the diagnostics: * **Rank histogram (uniform = good).** Each bin should contain roughly $N / B$ counts (here $\approx 10$ with $N=200$, $B=20$). The shape of any departure is itself diagnostic: * a **flat / uniform** histogram is evidence of a well-calibrated posterior; * a **U-shape** (excess mass at both ends) means the posterior is too *narrow* — credible intervals are over-confident; * an **inverted-U / dome** (excess mass in the middle) means the posterior is too *wide* — credible intervals are under-confident; * a **monotone ramp** (mass shifting toward 0 or toward $L$) means the posterior is *biased* upward or downward. * **ECDF with KS band.** The empirical CDF of the (normalised) ranks should track the diagonal. The shaded band is a 95% Kolmogorov–Smirnov envelope: if the ECDF stays inside the band for all $x$, we cannot reject uniformity at the 5% level. Excursions reveal the same signatures listed above (S-curves above the diagonal indicate bias; ECDFs that bow away from the diagonal indicate over- or under-confidence). * **Chi-square test.** Compares the histogram counts to the uniform expectation. With $N=200$ and $B=20$ the test has modest power (expected count $\approx 10$ per bin), so a non-significant $p$-value should *not* be over-interpreted as proof of calibration — it is one piece of evidence alongside visual inspection. * **KS test.** A continuous-rank analogue that is generally more powerful than the binned chi-square, particularly for smooth departures from uniformity (bias, spread). Together the two tests cover both global shape and bin-level fluctuations. The bottom line: a well-calibrated parameter shows a roughly flat histogram, an ECDF inside the KS band, and non-significant $p$-values; if any one of these fails, the *shape* of the failure points to whether the posterior is biased, too tight, or too diffuse. ::: ```{python} #| label: sbc-config #| echo: true # SBC configuration. # With M=300 and D=32 each fit is expensive, so we use 200 simulations # (vs 999 in the foundational reports). With 20 bins this gives ~10 # expected counts per bin — sufficient for meaningful rank histograms # and KS tests, though chi-square power is modest. n_sbc_sims = 200 n_sbc_chains = 1 sbc_thin = 4 n_sbc_samples = sbc_thin * n_sbc_sims # 800 total draws → 200 post-thinned sbc_max_rank = n_sbc_samples // sbc_thin # 200 (= L) sbc_n_bins = 20 sbc_expected_per_bin = n_sbc_sims / sbc_n_bins print("SBC Configuration:") print(f" Simulations (N): {n_sbc_sims}") print(f" MCMC samples per sim: {n_sbc_samples}") print(f" Chains: {n_sbc_chains}") print(f" Thinning factor: {sbc_thin}") print(f" Post-thinned draws (L): {sbc_max_rank}") print(f" Histogram bins: {sbc_n_bins}") print(f" Expected counts/bin: {sbc_expected_per_bin:.1f}") ``` ```{python} #| label: run-sbc #| output: false from analysis.sbc import SimulationBasedCalibration sbc_dir = tempfile.mkdtemp(prefix="sbc_m01_") sbc = SimulationBasedCalibration( sbc_model_path=os.path.join(project_root, "models", "m_01_sbc.stan"), study_design=study, output_dir=sbc_dir, n_sbc_sims=n_sbc_sims, n_mcmc_samples=n_sbc_samples, n_mcmc_chains=n_sbc_chains, thin=sbc_thin ) ranks, true_params_sbc = sbc.run() ``` ```{python} #| label: sbc-summary #| echo: false print(f"SBC Complete:") print(f" Simulations: {ranks.shape[0]}") print(f" Parameters: {ranks.shape[1]}") ``` #### MCMC Diagnostics ```{python} #| label: sbc-mcmc-diagnostics #| echo: false def load_sbc_diagnostics(output_dir, n_inspect=10): """Load ESS from saved SBC posterior summaries.""" ess_per_sim = [] for i in range(1, n_inspect + 1): csv_path = os.path.join(output_dir, "simulations", f"sim_{i}", "posterior_summary.csv") if os.path.exists(csv_path): df = pd.read_csv(csv_path, index_col=0) ess_col = None for col in ['N_Eff', 'ESS_bulk', 'ess_bulk']: if col in df.columns: ess_col = col break if ess_col is not None: param_rows = df[~df.index.str.startswith(('lp', 'energy', 'pars_', 'ranks_'))] ess_vals = param_rows[ess_col].dropna().values if len(ess_vals) > 0: ess_per_sim.append({ 'sim': i, 'min_ess': np.min(ess_vals), 'median_ess': np.median(ess_vals), }) return ess_per_sim sbc_ess = load_sbc_diagnostics(sbc_dir) print("SBC MCMC Diagnostics (from first 10 iterations):") if sbc_ess: min_ess = min(d['min_ess'] for d in sbc_ess) median_ess = np.median([d['median_ess'] for d in sbc_ess]) print(f" Minimum ESS across all params/sims: {min_ess:.0f}") print(f" Median ESS (across sims): {median_ess:.0f}") print(f" Post-thinning draws: {n_sbc_samples // sbc_thin}") print(f" ESS/draw ratio: {median_ess / n_sbc_samples:.2f}") else: print(" Diagnostic summaries not available.") ``` #### α Rank Distribution ```{python} #| label: fig-sbc-alpha #| fig-cap: "SBC rank histogram and ECDF for α under the m_01 prior with the actual study design. Left: rank histogram (uniform = well-calibrated). Right: ECDF with 95% KS confidence band." from scipy import stats fig, axes = plt.subplots(1, 2, figsize=(12, 4.5)) expected_count = n_sbc_sims / sbc_n_bins # Rank histogram ax = axes[0] ax.hist(ranks[:, 0], bins=sbc_n_bins, alpha=0.7, color=SEU_COLORS['primary'], edgecolor='white') ax.axhline(y=expected_count, color='red', linestyle='--', linewidth=2, label=f'Uniform (E={expected_count:.1f})') ax.set_xlabel('Rank', fontsize=11) ax.set_ylabel('Count', fontsize=11) ax.set_title('α Rank Histogram', fontsize=12) ax.legend() ax.grid(True, alpha=0.3) # ECDF ax = axes[1] ranks_norm = np.sort(ranks[:, 0]) / sbc_max_rank ecdf = np.arange(1, len(ranks_norm) + 1) / len(ranks_norm) ax.step(ranks_norm, ecdf, where='post', color=SEU_COLORS['primary'], linewidth=2, label='m_01') ax.plot([0, 1], [0, 1], 'r--', linewidth=2, label='Uniform') n_ecdf = len(ecdf) epsilon = np.sqrt(np.log(2/0.05) / (2 * n_ecdf)) ax.fill_between([0, 1], [0-epsilon, 1-epsilon], [0+epsilon, 1+epsilon], alpha=0.15, color='red', label=f'95% KS band (ε={epsilon:.3f})') ax.set_xlabel('Normalized Rank', fontsize=11) ax.set_ylabel('ECDF', fontsize=11) ax.set_title('α ECDF', fontsize=12) ax.legend(fontsize=9) ax.set_xlim(0, 1) ax.set_ylim(0, 1) ax.grid(True, alpha=0.3) plt.tight_layout() plt.show() # Formal tests chi2_alpha, p_alpha = stats.chisquare(np.histogram(ranks[:, 0], bins=sbc_n_bins)[0]) ks_alpha, ks_p_alpha = stats.kstest(ranks[:, 0] / sbc_max_rank, 'uniform') print(f"\nα Calibration Tests:") print(f" Chi-square: χ² = {chi2_alpha:.2f}, p = {p_alpha:.3f}") print(f" KS test: D = {ks_alpha:.3f}, p = {ks_p_alpha:.3f}") ``` #### δ Rank Distributions ```{python} #| label: fig-sbc-delta #| fig-cap: "SBC rank histograms for δ parameters. Departures from uniformity indicate calibration issues, typically reflecting the β–δ identification problem." K_val, D_val = study.K, study.D delta_start = 1 + K_val * D_val K_minus_1 = K_val - 1 fig, axes = plt.subplots(1, K_minus_1, figsize=(6 * K_minus_1, 4.5)) if K_minus_1 == 1: axes = [axes] for k in range(K_minus_1): param_idx = delta_start + k ax = axes[k] ax.hist(ranks[:, param_idx], bins=sbc_n_bins, alpha=0.7, color=SEU_COLORS['secondary'], edgecolor='white') ax.axhline(y=expected_count, color='red', linestyle='--', linewidth=2, label=f'Uniform (E={expected_count:.1f})') ax.set_xlabel('Rank', fontsize=11) ax.set_ylabel('Count', fontsize=11) ax.set_title(f'δ$_{k+1}$ Rank Histogram', fontsize=12) ax.legend() ax.grid(True, alpha=0.3) plt.tight_layout() plt.show() # Formal tests for delta print("δ Calibration Tests:") for k in range(K_minus_1): param_idx = delta_start + k chi2_d, p_d = stats.chisquare(np.histogram(ranks[:, param_idx], bins=sbc_n_bins)[0]) ks_d, ks_p_d = stats.kstest(ranks[:, param_idx] / sbc_max_rank, 'uniform') print(f" δ_{k+1}: χ² = {chi2_d:.2f}, p = {p_d:.3f} | KS D = {ks_d:.3f}, p = {ks_p_d:.3f}") ``` #### β Calibration Summary ```{python} #| label: fig-sbc-beta #| fig-cap: "Chi-square p-values for β parameter calibration. Parameters above the dashed line (p > 0.05) are consistent with uniform rank distributions." beta_start = 1 n_beta = K_val * D_val beta_pvals = [] for idx in range(beta_start, beta_start + n_beta): counts = np.histogram(ranks[:, idx], bins=sbc_n_bins)[0] _, p_val = stats.chisquare(counts) beta_pvals.append(p_val) fig, ax = plt.subplots(figsize=(10, 4.5)) x = np.arange(n_beta) colors = [SEU_COLORS['primary'] if p > 0.05 else SEU_COLORS['accent'] for p in beta_pvals] ax.bar(x, beta_pvals, color=colors, alpha=0.7, edgecolor='white') ax.axhline(y=0.05, color='red', linestyle='--', linewidth=2, label='α = 0.05') ax.set_xlabel('β Parameter Index', fontsize=11) ax.set_ylabel('Chi-square p-value', fontsize=11) ax.set_title('β Parameter Calibration', fontsize=12) ax.legend() ax.grid(True, alpha=0.3, axis='y') plt.tight_layout() plt.show() n_well_cal = sum(1 for p in beta_pvals if p > 0.05) print(f"\nβ Calibration: {n_well_cal}/{n_beta} parameters well-calibrated (p > 0.05)") print(f"Expected false rejections at α=0.05: {n_beta * 0.05:.1f}") ``` ::: {.callout-note} ## SBC Interpretation With 200 SBC simulations the formal chi-square tests have modest power (~10 expected counts per bin), so visual inspection of rank histograms is as important as the p-values. Specifically, with 200 simulations and 20 bins, the chi-square test has approximately 50% power to detect a 40% deviation from uniformity in any single bin—substantial miscalibration could go undetected. The key question is whether α shows approximately uniform ranks—confirming that the m_01 posterior correctly represents uncertainty about sensitivity at this study scale. The β–δ identification pattern from the foundational reports ([Report 6](../../foundations/06_sbc_validation.qmd)) is expected to manifest here as well, since m_01 (like m_0) uses only uncertain choices. This does not affect the validity of α estimation for the primary analysis. ::: ## Results {#sec-results} ### Loading Posterior Draws ```{python} #| label: load-posteriors #| output: false temperatures = [0.0, 0.3, 0.7, 1.0, 1.5] temp_labels = {t: f"T={t}" for t in temperatures} # Load alpha draws for each temperature alpha_draws = {} for t in temperatures: key = f"T{str(t).replace('.', '_')}" data = np.load(data_dir / f"alpha_draws_{key}.npz") alpha_draws[t] = data['alpha'] # Load pre-computed analysis results with open(data_dir / "primary_analysis.json") as f: analysis = json.load(f) # Load fit summary with open(data_dir / "fit_summary.json") as f: fit_summary = json.load(f) ``` ```{python} #| echo: false # Verify draws loaded correctly for t in temperatures: n = len(alpha_draws[t]) print(f" T={t}: {n:,} posterior draws loaded") ``` ### MCMC Diagnostics ```{python} #| label: tbl-diagnostics #| tbl-cap: "MCMC diagnostics for all five temperature conditions. All fits used 4 chains with 1,000 warmup and 1,000 sampling iterations each (4,000 post-warmup draws total)." diag_rows = [] for t in temperatures: key = f"T{str(t).replace('.', '_')}" with open(data_dir / f"diagnostics_{key}.txt") as f: diag_text = f.read() # Parse divergences if "No divergent transitions" in diag_text: n_div = 0 else: import re match = re.search(r'(\d+) of (\d+)', diag_text) n_div = int(match.group(1)) if match else 0 rhat_ok = "R-hat values satisfactory" in diag_text ess_ok = "effective sample size satisfactory" in diag_text ebfmi_ok = "E-BFMI satisfactory" in diag_text diag_rows.append({ 'Temperature': t, 'Divergences': f"{n_div}/4000", 'R̂': '✓' if rhat_ok else '✗', 'ESS': '✓' if ess_ok else '✗', 'E-BFMI': '✓' if ebfmi_ok else '✗', }) pd.DataFrame(diag_rows) ``` All conditions show clean diagnostics. The 1–2 divergent transitions at $T = 1.0$ and $T = 1.5$ (< 0.05% of transitions) are well within acceptable bounds and do not indicate systematic sampling difficulties. ### Posterior Summaries ```{python} #| label: tbl-posteriors #| tbl-cap: "Posterior summaries for the sensitivity parameter α at each temperature level. Intervals are 90% credible intervals." summary = analysis['summary_table'] rows = [] for s in summary: rows.append({ 'Temperature': s['temperature'], 'Median': f"{s['median']:.1f}", 'Mean': f"{s['mean']:.1f}", 'SD': f"{s['sd']:.1f}", '90% CI': f"[{s['ci_low']:.1f}, {s['ci_high']:.1f}]", }) pd.DataFrame(rows) ``` The estimates reveal a clear pattern: $\alpha$ is highest at $T = 0.0$ (greedy decoding) and lowest at $T = 1.5$ (high temperature), with intermediate conditions falling between these extremes. Note, however, the near-equality of the $T = 0.3$ and $T = 0.7$ estimates—a point we return to below. ### Forest Plot ```{python} #| label: fig-forest #| fig-cap: "Forest plot of posterior α distributions across temperature conditions. Points show posterior medians; thick bars span the 50% credible interval; thin bars span the 90% credible interval. Higher temperature is associated with lower estimated sensitivity." #| fig-height: 5 fig, ax = plt.subplots(figsize=(8, 5)) y_positions = np.arange(len(temperatures))[::-1] for i, t in enumerate(temperatures): draws = alpha_draws[t] median = np.median(draws) q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95]) y = y_positions[i] # 90% CI (thin line) ax.plot([q05, q95], [y, y], color=SEU_PALETTE[i], linewidth=1.5, alpha=0.7) # 50% CI (thick line) ax.plot([q25, q75], [y, y], color=SEU_PALETTE[i], linewidth=4, alpha=0.9) # Median (point) ax.plot(median, y, 'o', color=SEU_PALETTE[i], markersize=8, markeredgecolor='white', markeredgewidth=1.5, zorder=5) ax.set_yticks(y_positions) ax.set_yticklabels([f'T = {t}' for t in temperatures]) ax.set_xlabel('Sensitivity (α)') ax.set_title('Posterior Distributions of α by Temperature') ax.grid(axis='x', alpha=0.3) ax.grid(axis='y', alpha=0) plt.tight_layout() plt.show() ``` ### Posterior Densities ```{python} #| label: fig-density #| fig-cap: "Kernel density estimates of the posterior α distributions. The separation between T = 0.0 and T ≥ 1.0 is clear, while T = 0.3 and T = 0.7 overlap substantially." #| fig-height: 5 from scipy.stats import gaussian_kde fig, ax = plt.subplots(figsize=(8, 5)) for i, t in enumerate(temperatures): draws = alpha_draws[t] kde = gaussian_kde(draws) x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.1, 300) ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=SEU_PALETTE[i]) ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2, label=f'T = {t} (median = {np.median(draws):.0f})') ax.set_xlabel('Sensitivity (α)') ax.set_ylabel('Density') ax.set_title('Posterior Density of α') ax.legend(loc='upper right') plt.tight_layout() plt.show() ``` ### Posterior Predictive Checks We assess model adequacy via posterior predictive checks (PPCs). For each posterior draw of the parameters, we simulate a replicated dataset of the same size and structure as the observed data, then compare the observed and replicated datasets on three complementary test statistics: - **Log-likelihood (ll).** The sum of log-probabilities assigned by the model to the observed choices, $T_{\text{ll}} = \sum_{m=1}^{M} \log p(y_m \mid \theta)$. This is a global goodness-of-fit measure: if the model systematically assigns lower probability to real choices than to simulated ones, the observed $T_{\text{ll}}$ will fall in the tail of the replicated distribution. - **Modal choice frequency (modal).** For each decision problem, identify the alternative chosen most often across replications; the test statistic is the fraction of problems where the mode matches the observed choice. This targets whether the model correctly identifies the *most likely* choice in each problem, independent of the exact probabilities. - **Mean choice probability (prob).** The average predicted probability of the actually chosen alternative, $T_{\text{prob}} = M^{-1} \sum_{m=1}^{M} p(y_m \mid \theta)$. This measures whether the model's confidence in the observed choices is well-calibrated on average. The posterior predictive p-value for each statistic is the proportion of replicated datasets in which the test statistic equals or exceeds the observed value. Values near 0.5 indicate good calibration; values near 0 or 1 signal systematic misfit. ```{python} #| label: tbl-ppc #| tbl-cap: "Posterior predictive check p-values for each temperature condition. Values near 0.5 indicate good calibration; values near 0 or 1 indicate model misfit. Three test statistics are used: log-likelihood (ll), modal choice frequency (modal), and mean choice probability (prob)." ppc_rows = [] for t in temperatures: key = f"T{str(t).replace('.', '_')}" with open(data_dir / f"ppc_{key}.json") as f: ppc = json.load(f) pvals = ppc['p_values'] ppc_rows.append({ 'Temperature': t, 'Log-likelihood': f"{pvals['ll']:.3f}", 'Modal frequency': f"{pvals['modal']:.3f}", 'Mean probability': f"{pvals['prob']:.3f}", }) pd.DataFrame(ppc_rows) ``` All posterior predictive p-values fall comfortably within $[0.3, 0.65]$, indicating that the model provides an adequate description of the choice data at every temperature level. There is no evidence of systematic misfit. ## Monotonicity Analysis {#sec-monotonicity} The central question of this study is whether estimated sensitivity $\alpha$ decreases monotonically with temperature. We assess this at three levels of granularity. ### Global Slope To summarize the overall direction and magnitude of the temperature–sensitivity relationship, we compute a posterior distribution over a linear slope coefficient. For each posterior draw $s = 1, \ldots, S$, we have a vector of five $\alpha$ values—one per temperature condition. We fit a simple ordinary least squares regression $\alpha^{(s)}(T) = a^{(s)} + b^{(s)} \cdot T$ across the five temperature levels, yielding one slope $b^{(s)}$ per draw. The resulting $S$ slope values define a posterior distribution over $b = \Delta\alpha / \Delta T$. This is *not* a regression model in the usual sense—it is a derived quantity computed from the joint posterior. The slope summarizes whether, on the whole, higher temperature is associated with lower sensitivity. A slope whose posterior mass lies entirely below zero provides evidence for the directional hypothesis that temperature reduces EU sensitivity: ```{python} #| label: fig-slope #| fig-cap: "Posterior distribution of the slope Δα/ΔT from a draw-wise linear fit. The entire 90% credible interval lies below zero, providing strong evidence that α decreases with temperature." slopes = analysis['slope'] # Recompute the slope draws for the density plot temp_array = np.array(temperatures) slope_draws = [] for draw_idx in range(len(alpha_draws[temperatures[0]])): alphas_at_draw = np.array([alpha_draws[t][draw_idx] for t in temperatures]) # OLS slope: cov(T, α) / var(T) b = np.cov(temp_array, alphas_at_draw)[0, 1] / np.var(temp_array) slope_draws.append(b) slope_draws = np.array(slope_draws) fig, ax = plt.subplots(figsize=(8, 4)) kde = gaussian_kde(slope_draws) x_grid = np.linspace(np.percentile(slope_draws, 0.5), np.percentile(slope_draws, 99.5), 300) ax.fill_between(x_grid, kde(x_grid), alpha=0.3, color=SEU_COLORS['primary']) ax.plot(x_grid, kde(x_grid), color=SEU_COLORS['primary'], linewidth=2) median_slope = np.median(slope_draws) ax.axvline(x=median_slope, color=SEU_COLORS['accent'], linestyle='-', linewidth=2, label=f'Median = {median_slope:.1f}') ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5, label='No effect') # Shade 90% CI q05, q95 = np.percentile(slope_draws, [5, 95]) mask = (x_grid >= q05) & (x_grid <= q95) ax.fill_between(x_grid[mask], kde(x_grid[mask]), alpha=0.15, color=SEU_COLORS['accent']) ax.axvline(x=q05, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6) ax.axvline(x=q95, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6) ax.set_xlabel('Slope (Δα / ΔT)') ax.set_ylabel('Density') ax.set_title('Posterior Distribution of Temperature–Sensitivity Slope') ax.legend() plt.tight_layout() plt.show() print(f"Slope summary:") print(f" Median: {median_slope:.1f}") print(f" 90% CI: [{q05:.1f}, {q95:.1f}]") print(f" P(slope < 0): {np.mean(slope_draws < 0):.4f}") ``` The posterior probability that the slope is negative exceeds 0.99, providing strong evidence for the directional hypothesis: higher temperature is associated with lower sensitivity to expected utility. ### Pairwise Comparisons The global slope averages over all five conditions and may mask local structure (e.g., two adjacent temperatures that produce indistinguishable $\alpha$ values). To examine the ordering at finer resolution, we compute pairwise posterior probabilities. For each pair of temperature conditions $(T_i, T_j)$ with $T_i < T_j$, we calculate $$ P\bigl(\alpha(T_i) > \alpha(T_j) \mid \text{data}\bigr) = \frac{1}{S} \sum_{s=1}^{S} \mathbb{1}\bigl[\alpha^{(s)}(T_i) > \alpha^{(s)}(T_j)\bigr], $$ where the indicator function is evaluated draw-by-draw from the *independent* posteriors at each temperature. ::: {.callout-warning} ## Independence Caveat: Design-Induced Correlations These posteriors are independent *conditional on the shared design*, but they are not independent *unconditionally*. The reason is worth spelling out, because the same point will recur whenever multiple conditions share a common stimulus pool (as in the planned alignment study). The five temperature conditions in this study draw their decision problems from a *single* fixed pool of $R = 30$ insurance claims. Whatever is idiosyncratic about that particular set of 30 claims — the embedding geometry that happens to fall out of those texts, the particular spread of "true" expected utilities across alternatives, the typical gap $\Delta$ between the best and second-best option — shapes all five fits simultaneously. A claim pool that happens to contain very large utility gaps will inflate every estimated $\alpha$ relative to a pool with smaller gaps; a pool that happens to be nearly degenerate in some embedding direction will deflate every $\alpha$. None of this idiosyncrasy is captured by fitting m_01 five separate times: each fit treats its own dataset as the only source of evidence, so the five posterior summaries look statistically unrelated even though they actually share a common, unmodelled nuisance. The practical consequence is that pairwise probabilities $P(\alpha(T_i) > \alpha(T_j))$ that are computed from independent posteriors slightly *overstate* how confidently we can resolve between-temperature differences — the very feature of the design that boosts apparent precision (every condition uses the same claims, so within-condition noise is reduced) also induces a positive between-condition correlation that the independent-fits analysis ignores. A hierarchical model jointly estimating $\alpha(T)$ — for example $\log \alpha(T_c) = \gamma_0 + \gamma_1 T_c + \eta_c$ with a small random effect $\eta_c$ — addresses this on two fronts. First, it makes the *common* effect of the shared claim pool a single estimated quantity ($\gamma_0$), so design-induced shifts cancel out of between-condition contrasts in a principled way. Second, it borrows strength across conditions for the *systematic* effect of interest ($\gamma_1$), giving a single, calibrated uncertainty statement about the temperature–sensitivity relationship that respects the positive correlation across conditions. We return to this issue in the *Limitations* section below and propose the corresponding hierarchical extension in *Next Steps*; the foundational `h_m01` reports introduce exactly this kind of model and use it as the inferential workhorse for the planned alignment study. ::: Values near 1.0 indicate that the lower-temperature condition almost certainly yields higher sensitivity; values near 0.5 indicate that the two conditions are effectively indistinguishable. ```{python} #| label: tbl-pairwise #| tbl-cap: "Posterior probability that α is higher at the lower temperature in each pair, treating the per-temperature posteriors as if independent. Values above 0.95 indicate apparent separation under that simplifying assumption; values near 0.5 indicate indistinguishability. See the *Independence Caveat* above: shared stimuli induce between-condition correlations that the independent-fits comparison does not capture, so the apparent separation is likely an upper bound on the evidence." pairs = analysis['pairwise_comparisons'] pair_rows = [] for key, prob in pairs.items(): t1, t2 = key.split('_vs_') strength = '●●●' if prob > 0.95 else ('●●' if prob > 0.8 else ('●' if prob > 0.65 else '○')) pair_rows.append({ 'Comparison': f'α(T={t1}) > α(T={t2})', 'P': f'{prob:.3f}', 'Evidence': strength, }) pd.DataFrame(pair_rows) ``` The pairwise analysis reveals a structured pattern (interpret these probabilities in light of the *Independence Caveat* above; they treat the per-temperature posteriors as independent and so likely overstate the true between-condition evidence): - **Apparent separation** ($P > 0.95$ under independence): $T = 0.0$ vs $T = 1.0$ and $T = 1.5$ - **Moderate apparent separation** ($P > 0.8$): $T = 0.3$ vs $T \geq 1.0$; $T = 0.7$ vs $T \geq 1.0$ - **Weak/no separation** ($P \approx 0.5$): $T = 0.3$ vs $T = 0.7$ ### Strict Monotonicity The pairwise analysis examines each adjacent or non-adjacent pair in isolation. A stronger test asks whether the *entire* ordering $\alpha(T = 0.0) > \alpha(T = 0.3) > \alpha(T = 0.7) > \alpha(T = 1.0) > \alpha(T = 1.5)$ holds simultaneously. We estimate the joint probability of strict monotonicity by computing, for each posterior draw, whether the five $\alpha$ values are strictly decreasing: $$ P(\text{strict monotonicity} \mid \text{data}) = \frac{1}{S} \sum_{s=1}^{S} \prod_{j=1}^{4} \mathbb{1}\bigl[\alpha^{(s)}(T_j) > \alpha^{(s)}(T_{j+1})\bigr]. $$ Because this requires *every* adjacent pair to satisfy the ordering in the *same* draw, the joint probability is necessarily smaller than any single pairwise probability—a single weakly-ordered pair drives it down. ```{python} #| label: monotonicity #| echo: true # P(α(T=0.0) > α(T=0.3) > α(T=0.7) > α(T=1.0) > α(T=1.5)) — joint probability n_draws = len(alpha_draws[0.0]) strictly_decreasing = 0 for i in range(n_draws): vals = [alpha_draws[t][i] for t in temperatures] if all(vals[j] > vals[j+1] for j in range(len(vals)-1)): strictly_decreasing += 1 p_mono = strictly_decreasing / n_draws print(f"P(α strictly decreasing across all T): {p_mono:.4f}") ``` The probability of *strict* monotonicity across all five levels is modest, at approximately 0.12. This is driven entirely by the near-equality of $\alpha$ at $T = 0.3$ and $T = 0.7$, where the pairwise probability is essentially a coin flip ($P = 0.48$). Collapsing these two levels yields a coarser ordering that holds with much higher confidence: ```{python} #| label: coarse-monotonicity #| echo: true # Coarser test: α(T=0.0) > mean(α(T=0.3), α(T=0.7)) > α(T=1.0) > α(T=1.5) coarse_decreasing = 0 for i in range(n_draws): a00 = alpha_draws[0.0][i] a_mid = (alpha_draws[0.3][i] + alpha_draws[0.7][i]) / 2 a10 = alpha_draws[1.0][i] a15 = alpha_draws[1.5][i] if a00 > a_mid > a10 > a15: coarse_decreasing += 1 p_coarse = coarse_decreasing / n_draws print(f"P(α(0.0) > mean(α(0.3), α(0.7)) > α(1.0) > α(1.5)): {p_coarse:.4f}") ``` ## Discussion {#sec-discussion} ### Summary of Findings The results support the directional hypothesis: increasing LLM temperature decreases estimated sensitivity to expected utility maximization. The evidence is strongest at the global level—the posterior slope $\Delta\alpha / \Delta T$ is negative with high confidence—and weakest at fine-grained adjacent comparisons. ```{python} #| label: fig-summary #| fig-cap: "Summary of the temperature–sensitivity relationship. Left: Posterior medians with 90% credible intervals. Right: Pairwise posterior probabilities P(α_i > α_j) as a heatmap." fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # Left: point estimates with CIs medians = [np.median(alpha_draws[t]) for t in temperatures] q05s = [np.percentile(alpha_draws[t], 5) for t in temperatures] q95s = [np.percentile(alpha_draws[t], 95) for t in temperatures] axes[0].errorbar(temperatures, medians, yerr=[np.array(medians) - np.array(q05s), np.array(q95s) - np.array(medians)], fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8, capsize=5, capthick=1.5) axes[0].set_xlabel('Temperature') axes[0].set_ylabel('Sensitivity (α)') axes[0].set_title('α vs. Temperature') axes[0].set_xticks(temperatures) # Right: pairwise heatmap n_temps = len(temperatures) heatmap = np.full((n_temps, n_temps), np.nan) for key, prob in pairs.items(): t1, t2 = key.split('_vs_') i = temperatures.index(float(t1)) j = temperatures.index(float(t2)) heatmap[i, j] = prob heatmap[j, i] = 1 - prob np.fill_diagonal(heatmap, 0.5) im = axes[1].imshow(heatmap, cmap='RdYlGn', vmin=0, vmax=1, aspect='equal') axes[1].set_xticks(range(n_temps)) axes[1].set_xticklabels([f'{t}' for t in temperatures]) axes[1].set_yticks(range(n_temps)) axes[1].set_yticklabels([f'{t}' for t in temperatures]) axes[1].set_xlabel('Temperature (column)') axes[1].set_ylabel('Temperature (row)') axes[1].set_title('P(α_row > α_col)') # Annotate cells for i in range(n_temps): for j in range(n_temps): if not np.isnan(heatmap[i, j]): color = 'white' if heatmap[i, j] > 0.8 or heatmap[i, j] < 0.2 else 'black' axes[1].text(j, i, f'{heatmap[i, j]:.2f}', ha='center', va='center', fontsize=9, color=color) plt.colorbar(im, ax=axes[1], shrink=0.8) plt.tight_layout() plt.show() ``` ### Interpreting the Non-Separation at $T = 0.3$ and $T = 0.7$ The inability to distinguish $\alpha$ at these two temperatures admits several interpretations: 1. **Threshold effect.** The relationship between temperature and sensitivity may not be linear. There may be a regime ($T \lesssim 0.7$) where the LLM's choice behavior is relatively stable, with sensitivity declining more sharply only at higher temperatures. 2. **Insufficient statistical power.** With $M = 300$ observations per condition, the study may lack power to resolve small differences in $\alpha$. The posteriors at $T = 0.3$ and $T = 0.7$ have standard deviations of $\sim 15$–$19$, comparable to the difference between their medians ($\sim 1.5$). 3. **Genuine non-monotonicity.** It is possible that the true relationship is not strictly monotonic at fine scales, even if the overall trend is negative. These interpretations are not mutually exclusive, and the current data do not distinguish among them. ### Model Adequacy The posterior predictive checks provide no evidence of misfit at any temperature level. This is reassuring but not conclusive—PPC p-values assess whether the model *can* reproduce summary statistics of the data, not whether the model captures the *mechanism* by which temperature affects choices. The m_01 model treats each temperature condition as an independent estimation problem; it does not model how temperature enters the choice process. ### Construct Validity of $\alpha$ For a JDM audience, the claim that $\alpha$ measures "sensitivity to subjective expected utility maximization" requires careful interpretation. The parameter $\alpha$ controls the sharpness of the softmax choice distribution—higher $\alpha$ means choices are more concentrated on high-EU alternatives. But this interpretation assumes the model's *other* parameters ($\beta$, $\delta$) correctly capture the subjective utilities. Given the acknowledged $\beta$–$\delta$ identification issues, what does $\alpha$ actually measure? Operationally, $\alpha$ measures the *consistency* with which the agent chooses in the direction implied by the fitted utility function. This is related to, but not identical to, "sensitivity to EU maximization" in the strong sense that would require the utility parameters to be correctly identified. #### Do we need to identify $\beta$ and $\delta$ to interpret $\alpha$? The honest answer is: it depends on the strength of the claim being made. It is worth distinguishing three layers of interpretation, in order of increasing demand on identification. 1. **Internal consistency (weakest).** $\alpha$ is identified up to a scale that is fixed by the parameterisation, *given* the fitted utility function $u_\theta(c) = \delta^\top c + \text{(features)}$. A higher $\alpha$ in one condition than another means the agent's choices are more concentrated on the alternative that the fitted $u_\theta$ ranks highest. This is a within-model claim about choice consistency and *does not require* $\beta$ and $\delta$ to be correctly identified — it only requires that $u_\theta$ rank the alternatives in roughly the right order (which the posterior predictive checks support globally). 2. **Comparative claims across conditions (the present use case).** To interpret a *change* in $\alpha$ across a manipulation (here, temperature), we need the fitted utility function to be reasonably stable across conditions, so that the same notion of "EU maximization" anchors the comparison. The m_01 design uses the same $R = 30$ insurance claims at every temperature and the same embedding-based features, which makes this assumption defensible: any miscalibration in $(\beta, \delta)$ tends to act as a *shared confound* that does not move differentially across the temperature conditions. Under this assumption — which is empirically supported by the within-model posterior predictive checks at every temperature — comparative ordering claims about $\alpha$ are warranted even though $\beta$ and $\delta$ are not separately identified. 3. **Absolute claims about the *level* of EU rationality (strongest).** Claims of the form "the LLM is rational with sensitivity $\alpha \approx 30$" are interpretable only relative to the fitted utility $u_\theta$. Without independent identification of $\beta$ and $\delta$ one cannot certify that $u_\theta$ is the *true* subjective utility, so the absolute number $\alpha$ is best read as a sensitivity to *the model-implied utility ranking*, not as a context-free measure of EU rationality. We deliberately avoid making absolute claims of this third type in the present report. The temperature study lives in layer (2) and is well-supported. By contrast, claims that rely on layer (3) — for example, comparing $\alpha$ estimated under m_0 / m_01 against an exogenous benchmark, or treating $\alpha$ as a transferable property of the agent — should be flagged. #### Implications for the m_0 family and downstream studies The present analysis uses m_01 (uncertain choices only). Models in the m_0 family share the same $\beta$–$\delta$ coupling. The implications ripple outward as follows: * **Within-application comparative analyses** (the present temperature study, the temperature × EU prompt study, the temperature × risky alternatives study) use shared designs and within-application comparisons, so they live in layer (2) and are robust to the partial identification. * **Cross-application comparisons** of *absolute* $\alpha$ values (e.g., GPT-4o on insurance vs. Claude on insurance vs. GPT-4o on Ellsberg) live closer to layer (3) and should be interpreted with care: differences in domain, embedding geometry, and answer-format noise can produce different model-implied utility functions, which in turn change what $\alpha$ is sensitive *to*. We recommend presenting such cross-application contrasts as comparisons of *patterns* (e.g., signs and orderings of effects) rather than of numerical $\alpha$ values. * **The planned alignment study** (see [`README.md`](../../../README.md)) is, on first inspection, more demanding because it asks how $\alpha$ estimated for an LLM relates to a *normative* benchmark of EU rationality. This is closest to a layer-(3) claim, and it deserves to be flagged honestly. The alignment study is built on the hierarchical model `h_m01`, which is the m_01 likelihood extended to $J$ experimental cells: cell-specific $\alpha_j$ and $\beta_j$ with a shared $\delta$, fitted jointly via a regression on $\log \alpha$. Crucially, `h_m01` inherits m_01's reliance on *uncertain* choices only — it does **not** introduce risky alternatives, so the underlying $\beta$–$\delta$ coupling is *not* resolved by moving to the hierarchical model. What `h_m01` does buy is (a) sharing $\delta$ across cells (so any single subjective utility scaling is at least common to all comparisons) and (b) framing alignment-relevant manipulations as regression contrasts on $\log \alpha$, which are layer-(2) comparative claims by construction. The strongest defensible reading of the alignment study is therefore that it is a layer-(2) design that uses the hierarchical structure to make comparative effects across alignment-manipulating conditions properly calibrated, *not* a layer-(3) certification of EU rationality in absolute terms. Genuinely escaping the partial-identification caveat would require moving to a model in the m_1 / m_2 / m_3 family — i.e., one that augments the design with risky alternatives whose consequences are partially observable, which is what gives $\delta$ separate identifying information in the foundational reports. We flag this as an open design question for the alignment study. In short: the partial identification is a real interpretive constraint, but it does not undermine the present temperature study or the broader applications programme — provided that conclusions are framed as comparative rather than absolute, and that the alignment study is explicitly scoped as a contrast study on $\log \alpha$ rather than a calibration of agent-level EU rationality. The operational consequences for the alignment-study pre-registration, hypothesis framing, and follow-up sequencing are spelled out in §0.5 of [`prompts/hierarchical_alignment_study_plan.md`](../../../prompts/hierarchical_alignment_study_plan.md). ### Relation to the Foundational Framework These results can be interpreted through the lens of [Report 1](../../foundations/01_abstract_formulation.qmd): - The **monotonicity property** (Theorem 1) tells us that for any fixed value function, higher $\alpha$ implies higher probability of choosing value-maximizing alternatives. Our finding that $\alpha$ decreases with temperature thus means the LLM's choices become *less* concentrated on EU-maximizing alternatives as temperature increases. - The **rate of convergence** (Theorem 5) tells us that the effect of $\alpha$ on choice probabilities depends on the gap $\Delta$ between the best and second-best expected utilities. Problems with larger utility gaps will show temperature effects more readily. - The **commitment–performance distinction** discussed in Report 1 provides a natural framing: if we take EU maximization as the LLM's "commitment" (the normative standard its training approximates), then temperature controls the *performance* noise around that commitment. ### Connection to the JDM Literature The relationship between temperature and choice consistency connects to several strands of the human decision-making literature: - **Error theories of choice.** The softmax (logit) choice model is a standard *Fechner-style* stochastic specification — one of the workhorse error structures examined alongside the random-preference and "trembling-hand" alternatives by Hey and Orme [-@hey1994] in their experimental comparison of stochastic specifications for choice under risk. The $\alpha$ parameter plays the same role as inverse temperature (or *precision*) in random utility models, and our finding that external sampling noise reduces effective $\alpha$ parallels work on how cognitive load and time pressure increase choice inconsistency in human subjects. - **Quantal response equilibrium.** In game-theoretic settings, the softmax choice rule defines quantal response equilibrium [@mckelvey1995], where players choose strategies with probabilities that increase in expected payoff. The $\alpha$ parameter governs the "rationality" of players in this framework. Our results suggest that LLM temperature has an analogous effect to noise in human strategic reasoning. - **Resource-rational analysis.** From the perspective of resource-rational analysis [@griffiths2015; @lieder2020], choice noise may reflect optimal information processing under cognitive constraints. The token-sampling temperature could be viewed as a proxy for "cognitive effort" — lower temperature corresponds to more deterministic (less resource-intensive) output generation. This framing also connects to *rational inattention* models [@sims2003; @caplin2015] in which agents optimally trade off decision quality against information-processing costs. These connections are suggestive rather than formal, but they situate the temperature–$\alpha$ relationship within broader theoretical frameworks for understanding noisy choice. ### Limitations and Generalizability These results establish the temperature–sensitivity relationship for **GPT-4o on the insurance claims triage task**. Several important limitations should constrain interpretation: 1. **Single LLM.** The temperature parameter has model-specific effects. Whether this relationship generalizes to other LLM architectures (Claude, Llama, etc.) is an open empirical question. Subsequent reports in this series ([Claude × Insurance](../claude_insurance_study/01_claude_insurance_study.qmd), [GPT-4o × Ellsberg](../gpt4o_ellsberg_study/01_gpt4o_ellsberg_study.qmd)) provide cross-LLM evidence. 2. **Single task domain.** The insurance claims triage task is a specific operationalization of decision-making under uncertainty. How well findings generalize to other domains (gambles, consumer choice, medical decisions) depends on task-specific factors that may interact with temperature. 3. **Single embedding model.** The feature construction uses `text-embedding-3-small`. Different embedding models might produce different utility landscapes and thus different temperature–$\alpha$ relationships. 4. **Prior sensitivity.** While the prior predictive analysis motivates $\text{Lognormal}(3.0, 0.75)$, we have not conducted a formal sensitivity analysis demonstrating that conclusions are robust to nearby prior specifications. Given that $M = 300$ observations per condition is substantial, we expect the prior to be swamped by the data—but this should be verified. A robustness analysis is planned for subsequent work. 5. **Design-induced correlations.** Because the same $R = 30$ claims appear at every temperature, the five $\alpha$ estimates share correlations that the independent-fits approach does not model. The mechanism and its consequences for the pairwise comparisons are discussed in the *Independence Caveat* callout in @sec-monotonicity; a hierarchical model that estimates $\alpha(T)$ jointly would resolve it (see *Next Steps*). Claims about "LLM decision-making" generally should be understood as hypotheses motivated by these specific findings, not as established generalizations. ## Reproducibility {#sec-reproducibility} ### Data Snapshot All results in this report are loaded from a frozen data snapshot in the `data/` subdirectory, which is version-controlled. This protects the report from changes if the temperature study pipeline is re-run with different parameters. The snapshot contains: | File | Description | |------|-------------| | `alpha_draws_T*.npz` | Posterior draws of α (4,000 per condition) | | `ppc_T*.json` | Posterior predictive check results | | `diagnostics_T*.txt` | CmdStan diagnostic output | | `stan_data_T*.json` | Stan-ready data (for refitting) | | `fit_summary.json` | Summary statistics across conditions | | `primary_analysis.json` | Pre-computed monotonicity and slope statistics | | `run_summary.json` | Pipeline metadata and configuration | | `grid_results.json` | Prior predictive grid search results | | `study_config.yaml` | Frozen copy of the study configuration | ### Refitting from Source To reproduce the Stan fits from the included `stan_data_T*.json` files: ```{python} #| label: refit-example #| eval: false #| echo: true # Uncomment to refit from source data (requires CmdStanPy; ~30 min per condition) # # import cmdstanpy # model = cmdstanpy.CmdStanModel(stan_file="models/m_01.stan") # # for t in [0.0, 0.3, 0.7, 1.0, 1.5]: # key = f"T{str(t).replace('.', '_')}" # fit = model.sample( # data=f"data/stan_data_{key}.json", # chains=4, # iter_warmup=1000, # iter_sampling=1000, # seed=42, # ) # print(f"T={t}: alpha median = {np.median(fit.stan_variable('alpha')):.1f}") ``` ## Next Steps This initial study establishes that temperature affects estimated SEU sensitivity. Several extensions are planned: 1. **Position bias and consistency analysis.** The data support analyses described in DESIGN.md §6.2–6.4, examining whether position bias or choice inconsistency varies with temperature. These diagnostics may reveal whether temperature affects *how* the LLM chooses, beyond the summary captured by $\alpha$. 2. **Hierarchical temperature model.** As discussed in the *Independence Caveat* in @sec-monotonicity and in the corresponding limitation above, the present analysis fits m_01 at each temperature independently, which does not account for the correlations induced by the shared claim pool. A natural extension models $\log \alpha(T) = \gamma_0 + \gamma_1 \cdot T$ with priors on $(\gamma_0, \gamma_1)$, directly estimating the slope with calibrated uncertainty and borrowing strength across conditions. The hierarchical model `h_m01` developed in the foundational reports — and used as the inferential workhorse for the planned alignment study — provides the template for this extension. 3. **Replication with different LLMs.** The temperature parameter has model-specific effects. Repeating this study with different architectures would test the generality of the finding. 4. **Prior sensitivity analysis.** While the prior predictive analysis motivates $\text{Lognormal}(3.0, 0.75)$, a sensitivity analysis across nearby priors (e.g., $\text{Lognormal}(2.5, 1.0)$ and $\text{Lognormal}(3.5, 0.5)$) would characterize how much the conclusions depend on this specific choice. Given $M = 300$ observations per condition, we expect the posterior to be dominated by the likelihood, but this should be verified. ## References ::: {#refs} :::

0.1 Introduction

0.2 Experimental Design

0.2.1 Task and Conditions

0.2.2 Design Parameters

0.2.3 Feature Construction

0.2.4 Data Quality

0.3 Model and Prior Calibration

0.3.1 The m_01 Model Variant

0.3.2 Prior Predictive Grid Search

0.4 Model Validation

0.4.1 Parameter Recovery

0.4.1.1 MCMC Diagnostics

0.4.1.2 α Recovery

0.4.1.3 Recovery Metrics Summary

0.4.2 Simulation-Based Calibration

0.4.2.1 MCMC Diagnostics

0.4.2.2 α Rank Distribution

0.4.2.3 δ Rank Distributions

0.4.2.4 β Calibration Summary

0.5 Results

0.5.1 Loading Posterior Draws

0.5.2 MCMC Diagnostics

0.5.3 Posterior Summaries

0.5.4 Forest Plot

0.5.5 Posterior Densities

0.5.6 Posterior Predictive Checks

0.6 Monotonicity Analysis

0.6.1 Global Slope

0.6.2 Pairwise Comparisons

0.6.3 Strict Monotonicity

0.7 Discussion

0.7.1 Summary of Findings

0.7.2 Interpreting the Non-Separation at \(T = 0.3\) and \(T = 0.7\)

0.7.3 Model Adequacy

0.7.4 Construct Validity of \(\alpha\)

0.7.4.1 Do we need to identify \(\beta\) and \(\delta\) to interpret \(\alpha\)?

0.7.4.2 Implications for the m_0 family and downstream studies

0.7.5 Relation to the Foundational Framework

0.7.6 Connection to the JDM Literature

0.7.7 Limitations and Generalizability

0.8 Reproducibility

0.8.1 Data Snapshot

0.8.2 Refitting from Source

0.9 Next Steps

0.10 References

Reuse

Citation