Temperature and SEU Sensitivity: Claude × Insurance Study

Jeff Helzner

Temperature and SEU Sensitivity: Claude × Insurance Study

Application Report: Claude × Insurance (Cell 2,1)

applications

temperature

insurance

m_01

anthropic

factorial

An investigation of how LLM sampling temperature affects estimated sensitivity (α) to subjective expected utility maximisation, using insurance claims triage (K=3) and Claude 3.5 Sonnet (Anthropic). This study isolates the LLM effect by pairing Claude with the same task domain used in the initial temperature study.

Author

Jeff Helzner

Published

May 12, 2026

0.1 Introduction

The initial temperature study (Report 1) found a clear negative relationship between LLM sampling temperature and estimated sensitivity $\alpha$, using GPT-4o on an insurance claims triage task ($K = 3$). When both the task and LLM were changed simultaneously in the Ellsberg study (Report 2), the relationship was not replicated — but the confounded design could not attribute the non-replication to either factor alone.

This study is one of two new cells in a $2 \times 2$ factorial design (LLM × Task) that disentangles the contributions of each factor. Specifically, this study pairs Claude 3.5 Sonnet (Anthropic) with the insurance claims triage task — holding the task constant relative to the initial study while varying the LLM, and holding the LLM constant relative to the Ellsberg study while varying the task.

	Insurance (K=3)	Ellsberg (K=4)
GPT-4o	Initial study ✓	Cell (1,2) — new
Claude	This study	Ellsberg study ✓

Summary of Findings

The monotonic temperature–α relationship observed in the initial temperature study (GPT-4o, insurance) was not replicated with Claude 3.5 Sonnet on the same insurance task. The posterior slope is near zero ($\Delta\alpha / \Delta T \approx -3$, $P(\text{slope} < 0) \approx 0.56$), and the α estimates show a non-monotonic pattern. By contrast, the initial GPT-4o study found slope $\approx -25$ with $P(\text{slope} < 0) > 0.99$. Combined with the Ellsberg study results, this suggests the LLM is the dominant factor: Claude 3.5 Sonnet does not exhibit the temperature–sensitivity relationship that GPT-4o does, regardless of task domain.

0.2 Experimental Design

0.2.1 Task and Conditions

We use the same insurance claims triage task as the initial temperature study: Claude 3.5 Sonnet selects which insurance claim to prioritise from a set of alternatives. Each claim has $K = 3$ possible consequences (denial, partial approval, full approval). The LLM first assesses each claim individually (producing text that is then embedded), and subsequently makes a choice among the claims in a given problem.

Five temperature levels define the between-condition factor:

Show code

conditions = pd.DataFrame({
    'Level': [1, 2, 3, 4, 5],
    'Temperature': [0.0, 0.2, 0.5, 0.8, 1.0],
    'Description': [
        'Deterministic (greedy decoding)',
        'Low variance',
        'Moderate variance',
        'High variance',
        'Maximum (Anthropic API limit)'
    ]
})
conditions

Table 1: Experimental conditions. Each temperature level constitutes a separate model fit.

	Level	Temperature	Description
0	1	0.0	Deterministic (greedy decoding)
1	2	0.2	Low variance
2	3	0.5	Moderate variance
3	4	0.8	High variance
4	5	1.0	Maximum (Anthropic API limit)

Temperature Range and Grid Choice

The Anthropic API supports temperature values in $[0.0, 1.0]$, compared to OpenAI’s wider $[0.0, 2.0]$ range. We adopt the same temperature grid as the Ellsberg study ($T \in \{0.0, 0.2, 0.5, 0.8, 1.0\}$) to enable direct comparison within the Claude row of the factorial design. The initial temperature study (GPT-4o) used $T \in \{0.0, 0.3, 0.7, 1.0, 1.5\}$. Because the grid points and range differ across providers, quantitative slope comparisons should be interpreted with care: the narrower Anthropic range (absolute span = 1.0 vs. 1.5) may reduce statistical power to detect a temperature effect, and the absolute temperature values may not correspond to equivalent levels of next-token entropy across providers.

0.2.2 Design Parameters

Show code

import yaml

with open(data_dir / "study_config.yaml") as f:
    config = yaml.safe_load(f)

with open(data_dir / "run_summary.json") as f:
    run_summary = json.load(f)

print(f"Study Design:")
print(f"  Decision problems (M):      {config['num_problems']} base × {config['num_presentations']} presentations = {config['num_problems'] * config['num_presentations']}")
print(f"  Alternatives per problem:    {config['min_alternatives']}–{config['max_alternatives']}")
print(f"  Consequences (K):            {config['K']}")
print(f"  Embedding dimensions (D):    {config['target_dim']}")
print(f"  Distinct alternatives (R):   {run_summary['phases']['phase3_data_prep']['per_temperature']['0.0']['R']}")
print(f"  LLM model:                   {config['llm_model']}")
print(f"  Embedding model:             {config['embedding_model']}")
print(f"  Provider:                    {config['provider']}")

Study Design:
  Decision problems (M):      100 base × 3 presentations = 300
  Alternatives per problem:    2–4
  Consequences (K):            3
  Embedding dimensions (D):    32
  Distinct alternatives (R):   30
  LLM model:                   claude-sonnet-4-20250514
  Embedding model:             text-embedding-3-small
  Provider:                    anthropic

Each of the 100 base problems is presented $P = 3$ times with alternatives shuffled to different positions, yielding approximately $M = 300$ observations per temperature condition. This position counterbalancing design addresses systematic position bias.

0.2.3 Feature Construction

Alternative features are constructed through the same two-stage process used in the initial temperature study. First, Claude 3.5 Sonnet assesses each insurance claim at the relevant temperature, producing a natural-language evaluation. These assessments are embedded using text-embedding-3-small (OpenAI), yielding high-dimensional vectors. Second, all embeddings across temperature conditions are pooled and projected via PCA to $D = 32$ dimensions.

PCA Summary:
  Components retained: 32
  Total variance explained: 86.3%
  First 5 components: 36.1%
  First 10 components: 54.3%

The PCA variance profile is comparable to the initial temperature study’s embeddings, confirming that the feature construction captures a similar proportion of the embedding space for Claude as it does for GPT-4o. This rules out gross differences in feature-space geometry as an explanation for any divergence in results.

0.2.4 Data Quality

NA Summary:
  Overall: 17 / 1500 (1.1%)
  T=0.0: 3 / 300 (1.0%)
  T=0.2: 5 / 300 (1.7%)
  T=0.5: 3 / 300 (1.0%)
  T=0.8: 4 / 300 (1.3%)
  T=1.0: 2 / 300 (0.7%)

NA rates are uniformly low across all temperature conditions, with no systematic trend suggesting temperature-dependent data quality issues. The rates are comparable to those in the initial temperature study, confirming that Claude’s response completeness is not a confound.

0.2.5 Comparison with Other Factorial Cells

Show code

comparison = pd.DataFrame({
    'Parameter': ['LLM', 'Task domain', 'Consequences (K)',
                  'Alternatives (R)', 'Observations per T',
                  'Temperature range', 'Stan model'],
    'Cell (1,1) Initial': ['GPT-4o', 'Insurance triage', '3',
                           '30', '~300', '[0.0, 0.3, 0.7, 1.0, 1.5]', 'm_01'],
    'Cell (2,1) This study': ['Claude 3.5 Sonnet', 'Insurance triage', '3',
                              '30', '~300', '[0.0, 0.2, 0.5, 0.8, 1.0]', 'm_01'],
    'Cell (2,2) Ellsberg': ['Claude 3.5 Sonnet', 'Ellsberg gambles', '4',
                            '30', '~300', '[0.0, 0.2, 0.5, 0.8, 1.0]', 'm_02'],
})
comparison

Table 2: Design comparison across the 2×2 factorial. This study (Cell 2,1) shares the insurance task with Cell (1,1) and shares Claude with Cell (2,2).

	Parameter	Cell (1,1) Initial	Cell (2,1) This study	Cell (2,2) Ellsberg
0	LLM	GPT-4o	Claude 3.5 Sonnet	Claude 3.5 Sonnet
1	Task domain	Insurance triage	Insurance triage	Ellsberg gambles
2	Consequences (K)	3	3	4
3	Alternatives (R)	30	30	30
4	Observations per T	~300	~300	~300
5	Temperature range	[0.0, 0.3, 0.7, 1.0, 1.5]	[0.0, 0.2, 0.5, 0.8, 1.0]	[0.0, 0.2, 0.5, 0.8, 1.0]
6	Stan model	m_01	m_01	m_02

0.3 Model and Prior Calibration

0.3.1 The m_01 Model Variant

We fit the m_01 model — the same variant used in the initial temperature study. The prior on $\alpha$ is calibrated for the insurance triage task’s $K = 3$ consequence space:

	m_0 (foundational)	m_01 (this study & initial)	m_02 (Ellsberg)
$\alpha$ prior	$\text{Lognormal}(0, 1)$	$\text{Lognormal}(3.0, 0.75)$	$\text{Lognormal}(3.5, 0.75)$
Prior median	$\approx 1$	$\approx 20$	$\approx 33$
Prior 90% CI	$[0.19, 5.0]$	$[5.5, 67]$	$[10, 124]$
$K$	generic	3	4

Using the same model and prior as the initial study ensures that any difference in results is attributable to the LLM change (Claude vs. GPT-4o), not to modelling differences.

0.4 Model Validation

0.4.1 Parameter Recovery

We validate that m_01’s parameters are identifiable under this study’s design ($M \approx 300$, $K = 3$, $D = 32$, $R = 30$) via 20 iterations of parameter recovery.

Show code

recovery_dir = os.path.join(project_root, "results", "parameter_recovery", "claude_insurance_recovery")
recovery_summary_dir = os.path.join(recovery_dir, "recovery_summary")

with open(os.path.join(recovery_summary_dir, "recovery_statistics.json")) as f:
    recovery_stats = json.load(f)

true_params_path = os.path.join(recovery_dir, "all_true_parameters.json")
with open(true_params_path) as f:
    all_true_params = json.load(f)

posterior_summaries = []
true_params_list = []
for i in range(1, 21):
    iter_dir = os.path.join(recovery_dir, f"iteration_{i}")
    summary_path = os.path.join(iter_dir, "posterior_summary.csv")
    if os.path.exists(summary_path):
        df = pd.read_csv(summary_path, index_col=0)
        posterior_summaries.append(df)
        true_params_list.append(all_true_params[i - 1])

n_successful = len(posterior_summaries)
print(f"Loaded {n_successful} recovery iterations")

Show code

alpha_true = np.array([p['alpha'] for p in true_params_list])
alpha_mean = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries])
alpha_lower = np.array([s.loc['alpha', '5%'] for s in posterior_summaries])
alpha_upper = np.array([s.loc['alpha', '95%'] for s in posterior_summaries])

alpha_bias = np.mean(alpha_mean - alpha_true)
alpha_rmse = np.sqrt(np.mean((alpha_mean - alpha_true)**2))
alpha_coverage = np.mean((alpha_true >= alpha_lower) & (alpha_true <= alpha_upper))
alpha_ci_width = np.mean(alpha_upper - alpha_lower)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax = axes[0]
ax.scatter(alpha_true, alpha_mean, alpha=0.7, s=60, c=SEU_COLORS['primary'], edgecolor='white')
lims = [min(alpha_true.min(), alpha_mean.min()) * 0.9,
        max(alpha_true.max(), alpha_mean.max()) * 1.1]
ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line')
ax.set_xlim(lims)
ax.set_ylim(lims)
ax.set_xlabel('True α', fontsize=12)
ax.set_ylabel('Estimated α (posterior mean)', fontsize=12)
ax.set_title(f'α Recovery: Bias={alpha_bias:.2f}, RMSE={alpha_rmse:.2f}', fontsize=12)
ax.legend()
ax.set_aspect('equal')

ax = axes[1]
for i in range(len(alpha_true)):
    covered = (alpha_true[i] >= alpha_lower[i]) & (alpha_true[i] <= alpha_upper[i])
    color = 'forestgreen' if covered else 'crimson'
    ax.plot([i, i], [alpha_lower[i], alpha_upper[i]], color=color, linewidth=2, alpha=0.7)
    ax.scatter(i, alpha_mean[i], color=color, s=40, zorder=3)

ax.scatter(np.arange(len(alpha_true)), alpha_true, color='black', s=60, marker='x',
           label='True value', zorder=4, linewidth=2)
ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('α', fontsize=12)
ax.set_title(f'α: 90% Credible Intervals (Coverage = {alpha_coverage:.0%})', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Figure 1: Recovery of the sensitivity parameter α under the m_01 prior with the Claude × Insurance design (K=3). Left: true vs. estimated values with identity line. Right: 90% credible intervals for each iteration, coloured by whether they contain the true value.

Show code

K_val = 3
D_val = 32

# Beta recovery
all_beta_bias = []
all_beta_rmse = []
all_beta_coverage = []
all_beta_ci_width = []

for k in range(K_val):
    for d in range(D_val):
        param_name = f"beta[{k+1},{d+1}]"
        try:
            bt = np.array([p['beta'][k][d] for p in true_params_list])
            bm = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries])
            bl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
            bu = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])

            all_beta_bias.append(np.mean(bm - bt))
            all_beta_rmse.append(np.sqrt(np.mean((bm - bt)**2)))
            all_beta_coverage.append(np.mean((bt >= bl) & (bt <= bu)))
            all_beta_ci_width.append(np.mean(bu - bl))
        except (KeyError, IndexError):
            pass

# Delta recovery
all_delta_bias = []
all_delta_rmse = []
all_delta_coverage = []
all_delta_ci_width = []

for k in range(K_val - 1):
    param_name = f"delta[{k+1}]"
    try:
        dt = np.array([p['delta'][k] for p in true_params_list])
        dm = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries])
        dl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
        du = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])

        all_delta_bias.append(np.mean(dm - dt))
        all_delta_rmse.append(np.sqrt(np.mean((dm - dt)**2)))
        all_delta_coverage.append(np.mean((dt >= dl) & (dt <= du)))
        all_delta_ci_width.append(np.mean(du - dl))
    except (KeyError, IndexError):
        pass

# Scale references for relative metrics. α lives on a wide multiplicative
# scale (Lognormal(3.0, 0.75) prior → true values typically ~5–100), so
# absolute Bias and RMSE must be interpreted relative to the magnitude
# of α; the same caveat is less relevant for β (zero-centred) and δ
# (modest range), so we report relative metrics for α only.
alpha_scale = float(np.mean(np.abs(alpha_true)))
alpha_rel_bias = alpha_bias / alpha_scale
alpha_rel_rmse = alpha_rmse / alpha_scale

metrics = pd.DataFrame([
    {'Parameter': 'α', 'Bias': f'{alpha_bias:.2f}', 'RMSE': f'{alpha_rmse:.2f}',
     'Rel. Bias': f'{alpha_rel_bias:+.1%}', 'Rel. RMSE': f'{alpha_rel_rmse:.1%}',
     'Coverage (90%)': f'{alpha_coverage:.0%}', 'Mean CI Width': f'{alpha_ci_width:.2f}'},
    {'Parameter': f'β (mean over {K_val*D_val})',
     'Bias': f'{np.mean(all_beta_bias):.3f}' if all_beta_bias else '—',
     'RMSE': f'{np.mean(all_beta_rmse):.3f}' if all_beta_rmse else '—',
     'Rel. Bias': '—', 'Rel. RMSE': '—',
     'Coverage (90%)': f'{np.mean(all_beta_coverage):.0%}' if all_beta_coverage else '—',
     'Mean CI Width': f'{np.mean(all_beta_ci_width):.2f}' if all_beta_ci_width else '—'},
    {'Parameter': f'δ (mean over {K_val-1})',
     'Bias': f'{np.mean(all_delta_bias):.3f}' if all_delta_bias else '—',
     'RMSE': f'{np.mean(all_delta_rmse):.3f}' if all_delta_rmse else '—',
     'Rel. Bias': '—', 'Rel. RMSE': '—',
     'Coverage (90%)': f'{np.mean(all_delta_coverage):.0%}' if all_delta_coverage else '—',
     'Mean CI Width': f'{np.mean(all_delta_ci_width):.2f}' if all_delta_ci_width else '—'},
])
metrics

Table 3: Parameter recovery metrics for m_01 with the Claude × Insurance design (M≈300, K=3, D=32, R=30). Column structure mirrors the corresponding table in the initial temperature study to enable direct comparison; the primary parameter of interest is α.

	Parameter	Bias	RMSE	Rel. Bias	Rel. RMSE	Coverage (90%)	Mean CI Width
0	α	4.83	9.72	+26.0%	52.3%	90%	27.97
1	β (mean over 96)	-0.003	0.873	—	—	89%	2.85
2	δ (mean over 2)	-0.000	0.110	—	—	95%	0.33

Comparing Recovery Metrics with the Initial Temperature Study

A natural question is why the absolute α metrics in this table (Bias, RMSE, Mean CI Width) differ from those in the corresponding table of the initial temperature study. The difference is not attributable to LLM-specific calibration: parameter recovery in both studies uses purely synthetic data drawn from the same prior and the same study design (M=300, K=3, D=32, R=30, i.i.d. Normal features). True α values are sampled from the same Lognormal$(3.0, 0.75)$ prior, and no Claude or GPT-4o response data enters the simulation. The two recoveries differ only in the random seeds of their 20 iterations.

Because true α lies on a wide multiplicative scale (≈ 5–100 under the prior), absolute Bias, RMSE, and CI Width scale roughly with the magnitude of the α realisations drawn in any given recovery run. Anchoring interpretation on the relative columns (Rel. Bias, Rel. RMSE) and on Coverage removes this source of run-to-run variability and is the appropriate basis for cross-study comparison. On those metrics the two recoveries are comparable, and α recovery is fit for purpose for this study under the same operational thresholds used in the initial study (relative bias within roughly $\pm 10\%$, relative RMSE comfortably below 25%, 90% credible-interval coverage approximately 85–95%).

The β–δ coupling discussed in the foundational reports is expected to persist here as it does in the initial study; since this study targets α, the weaker recovery of $(\beta,\delta)$ does not compromise the primary analysis.

SBC Not Performed

Simulation-based calibration (SBC) was not performed for this cell. The m_01 model passed SBC in the initial temperature study report. SBC validates the model’s computational faithfulness (i.e., whether the posterior computation recovers the prior under simulated data), which depends on the model structure and prior specification — both of which are identical here. The study-specific parameter recovery analysis (above) provides validation that the model is identifiable under this study’s actual data characteristics, including any differences in Claude’s embedding distributions relative to GPT-4o.

0.5 Results

0.5.1 Loading Posterior Draws

Show code

temperatures = [0.0, 0.2, 0.5, 0.8, 1.0]
temp_labels = {t: f"T={t}" for t in temperatures}

alpha_draws = {}
for t in temperatures:
    key = f"T{str(t).replace('.', '_')}"
    data = np.load(data_dir / f"alpha_draws_{key}.npz")
    alpha_draws[t] = data['alpha']

with open(data_dir / "primary_analysis.json") as f:
    analysis = json.load(f)

with open(data_dir / "fit_summary.json") as f:
    fit_summary = json.load(f)

  T=0.0: 4,000 posterior draws loaded
  T=0.2: 4,000 posterior draws loaded
  T=0.5: 4,000 posterior draws loaded
  T=0.8: 4,000 posterior draws loaded
  T=1.0: 4,000 posterior draws loaded

0.5.2 MCMC Diagnostics

Show code

diag_rows = []
for t in temperatures:
    key = f"T{str(t).replace('.', '_')}"
    with open(data_dir / f"diagnostics_{key}.txt") as f:
        diag_text = f.read()

    if "No divergent transitions" in diag_text:
        n_div = 0
    else:
        match = re.search(r'(\d+) of (\d+)', diag_text)
        n_div = int(match.group(1)) if match else 0

    rhat_ok = "R-hat values satisfactory" in diag_text or "R_hat" not in diag_text.replace("R-hat values satisfactory", "")
    ess_ok = "effective sample size satisfactory" in diag_text
    ebfmi_ok = "E-BFMI satisfactory" in diag_text

    diag_rows.append({
        'Temperature': t,
        'Divergences': f"{n_div}/4000",
        'R̂': '✓' if rhat_ok else '✗',
        'ESS': '✓' if ess_ok else '✗',
        'E-BFMI': '✓' if ebfmi_ok else '✗',
    })

pd.DataFrame(diag_rows)

Table 4: MCMC diagnostics for all five temperature conditions. All fits used 4 chains with 1,000 warmup and 1,000 sampling iterations each (4,000 post-warmup draws total).

	Temperature	Divergences	R̂	ESS	E-BFMI
0	0.0	1/4000	✓	✓	✓
1	0.2	4/4000	✓	✓	✓
2	0.5	1/4000	✓	✓	✓
3	0.8	2/4000	✓	✓	✓
4	1.0	2/4000	✓	✓	✓

0.5.3 Posterior Summaries

Show code

summary = analysis['summary_table']

rows = []
for s in summary:
    rows.append({
        'Temperature': s['temperature'],
        'Median': f"{s['median']:.1f}",
        'Mean': f"{s['mean']:.1f}",
        'SD': f"{s['sd']:.1f}",
        '90% CI': f"[{s['ci_low']:.1f}, {s['ci_high']:.1f}]",
    })

pd.DataFrame(rows)

Table 5: Posterior summaries for the sensitivity parameter α at each temperature level. Intervals are 90% credible intervals.

	Temperature	Median	Mean	SD	90% CI
0	0.0	70.5	74.1	21.6	[44.6, 113.4]
1	0.2	52.6	54.9	14.9	[35.8, 81.4]
2	0.5	73.2	77.2	23.1	[47.0, 121.5]
3	0.8	70.6	74.0	22.0	[45.1, 115.3]
4	1.0	55.5	57.4	14.2	[37.0, 83.7]

The α estimates show a non-monotonic pattern: a dip at $T = 0.2$, a rise at $T = 0.5$ and $T = 0.8$, then another dip at $T = 1.0$. This echoes the oscillating pattern observed in the Ellsberg study with Claude.

0.5.4 Forest Plot

Show code

fig, ax = plt.subplots(figsize=(8, 5))

y_positions = np.arange(len(temperatures))[::-1]

for i, t in enumerate(temperatures):
    draws = alpha_draws[t]
    median = np.median(draws)
    q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95])

    y = y_positions[i]
    ax.plot([q05, q95], [y, y], color=SEU_PALETTE[i], linewidth=1.5, alpha=0.7)
    ax.plot([q25, q75], [y, y], color=SEU_PALETTE[i], linewidth=4, alpha=0.9)
    ax.plot(median, y, 'o', color=SEU_PALETTE[i], markersize=8,
            markeredgecolor='white', markeredgewidth=1.5, zorder=5)

ax.set_yticks(y_positions)
ax.set_yticklabels([f'T = {t}' for t in temperatures])
ax.set_xlabel('Sensitivity (α)')
ax.set_title('Posterior Distributions of α by Temperature')
ax.grid(axis='x', alpha=0.3)
ax.grid(axis='y', alpha=0)

plt.tight_layout()
plt.show()

Figure 2: Forest plot of posterior α distributions across temperature conditions. Points show posterior medians; thick bars span the 50% credible interval; thin bars span the 90% credible interval.

0.5.5 Posterior Densities

Show code

from scipy.stats import gaussian_kde

fig, ax = plt.subplots(figsize=(8, 5))

for i, t in enumerate(temperatures):
    draws = alpha_draws[t]
    kde = gaussian_kde(draws)
    x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.1, 300)
    ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=SEU_PALETTE[i])
    ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2,
            label=f'T = {t} (median = {np.median(draws):.0f})')

ax.set_xlabel('Sensitivity (α)')
ax.set_ylabel('Density')
ax.set_title('Posterior Density of α')
ax.legend(loc='upper right')

plt.tight_layout()
plt.show()

Figure 3: Kernel density estimates of the posterior α distributions. The posteriors overlap heavily, with no clear ordering by temperature.

0.5.6 Posterior Predictive Checks

Show code

ppc_rows = []
for t in temperatures:
    key = f"T{str(t).replace('.', '_')}"
    with open(data_dir / f"ppc_{key}.json") as f:
        ppc = json.load(f)

    pvals = ppc['p_values']
    ppc_rows.append({
        'Temperature': t,
        'Log-likelihood': f"{pvals['ll']:.3f}",
        'Modal frequency': f"{pvals['modal']:.3f}",
        'Mean probability': f"{pvals['prob']:.3f}",
    })

pd.DataFrame(ppc_rows)

Table 6: Posterior predictive check p-values for each temperature condition. Values near 0.5 indicate good calibration.

	Temperature	Log-likelihood	Modal frequency	Mean probability
0	0.0	0.393	0.596	0.451
1	0.2	0.438	0.656	0.541
2	0.5	0.397	0.567	0.443
3	0.8	0.407	0.633	0.485
4	1.0	0.439	0.572	0.503

All posterior predictive p-values fall within $[0.3, 0.7]$, indicating adequate model fit at every temperature level.

0.6 Monotonicity Analysis

0.6.1 Global Slope

Show code

slopes = analysis['slope']

temp_array = np.array(temperatures)
slope_draws = []
for draw_idx in range(len(alpha_draws[temperatures[0]])):
    alphas_at_draw = np.array([alpha_draws[t][draw_idx] for t in temperatures])
    b = np.cov(temp_array, alphas_at_draw)[0, 1] / np.var(temp_array)
    slope_draws.append(b)
slope_draws = np.array(slope_draws)

fig, ax = plt.subplots(figsize=(8, 4))

kde = gaussian_kde(slope_draws)
x_grid = np.linspace(np.percentile(slope_draws, 0.5), np.percentile(slope_draws, 99.5), 300)
ax.fill_between(x_grid, kde(x_grid), alpha=0.3, color=SEU_COLORS['primary'])
ax.plot(x_grid, kde(x_grid), color=SEU_COLORS['primary'], linewidth=2)

median_slope = np.median(slope_draws)
ax.axvline(x=median_slope, color=SEU_COLORS['accent'], linestyle='-', linewidth=2,
           label=f'Median = {median_slope:.1f}')
ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5, label='No effect')

q05, q95 = np.percentile(slope_draws, [5, 95])
mask = (x_grid >= q05) & (x_grid <= q95)
ax.fill_between(x_grid[mask], kde(x_grid[mask]), alpha=0.15, color=SEU_COLORS['accent'])
ax.axvline(x=q05, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6)
ax.axvline(x=q95, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6)

ax.set_xlabel('Slope (Δα / ΔT)')
ax.set_ylabel('Density')
ax.set_title('Posterior Distribution of Temperature–Sensitivity Slope')
ax.legend()

plt.tight_layout()
plt.show()

print(f"Slope summary:")
print(f"  Median:  {median_slope:.1f}")
print(f"  90% CI:  [{q05:.1f}, {q95:.1f}]")
print(f"  P(slope < 0): {np.mean(slope_draws < 0):.3f}")

Figure 4: Posterior distribution of the slope Δα/ΔT. The distribution is centred near zero, providing essentially no evidence for a negative relationship.

Slope summary:
  Median:  -3.6
  90% CI:  [-53.6, 38.5]
  P(slope < 0): 0.560

0.6.2 Pairwise Comparisons

Show code

pairs = analysis['pairwise_comparisons']

pair_rows = []
for key, prob in pairs.items():
    t1, t2 = key.split('_vs_')
    if prob > 0.95:
        strength = '●●● (strong)'
    elif prob > 0.8:
        strength = '●● (moderate)'
    elif prob > 0.65:
        strength = '● (weak)'
    elif prob < 0.35:
        strength = '○ (reversed)'
    else:
        strength = '— (indistinguishable)'
    pair_rows.append({
        'Comparison': f'α(T={t1}) > α(T={t2})',
        'P': f'{prob:.3f}',
        'Evidence': strength,
    })

pd.DataFrame(pair_rows)

Table 7: Posterior probability that α is higher at the lower temperature in each pair.

	Comparison	P	Evidence
0	α(T=0.0) > α(T=0.2)	0.782	● (weak)
1	α(T=0.0) > α(T=0.5)	0.463	— (indistinguishable)
2	α(T=0.0) > α(T=0.8)	0.501	— (indistinguishable)
3	α(T=0.0) > α(T=1.0)	0.743	● (weak)
4	α(T=0.2) > α(T=0.5)	0.191	○ (reversed)
5	α(T=0.2) > α(T=0.8)	0.225	○ (reversed)
6	α(T=0.2) > α(T=1.0)	0.435	— (indistinguishable)
7	α(T=0.5) > α(T=0.8)	0.539	— (indistinguishable)
8	α(T=0.5) > α(T=1.0)	0.768	● (weak)
9	α(T=0.8) > α(T=1.0)	0.732	● (weak)

0.6.3 Strict Monotonicity

Show code

n_draws = len(alpha_draws[0.0])
strictly_decreasing = 0

for i in range(n_draws):
    vals = [alpha_draws[t][i] for t in temperatures]
    if all(vals[j] > vals[j+1] for j in range(len(vals)-1)):
        strictly_decreasing += 1

p_mono = strictly_decreasing / n_draws
print(f"P(α strictly decreasing across all T): {p_mono:.4f}")

P(α strictly decreasing across all T): 0.0077

0.7 Comparison with Initial Temperature Study

This comparison isolates the LLM effect: both studies use the same insurance triage task and m_01 model, but different LLMs.

Show code

initial_data_dir = Path("..") / "temperature_study" / "data"

with open(initial_data_dir / "primary_analysis.json") as f:
    initial_analysis = json.load(f)

initial_temps = [s['temperature'] for s in initial_analysis['summary_table']]
initial_medians = [s['median'] for s in initial_analysis['summary_table']]
initial_lows = [s['ci_low'] for s in initial_analysis['summary_table']]
initial_highs = [s['ci_high'] for s in initial_analysis['summary_table']]

this_medians = [s['median'] for s in analysis['summary_table']]
this_lows = [s['ci_low'] for s in analysis['summary_table']]
this_highs = [s['ci_high'] for s in analysis['summary_table']]

fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True)

ax = axes[0]
ax.errorbar(initial_temps, initial_medians,
            yerr=[np.array(initial_medians) - np.array(initial_lows),
                  np.array(initial_highs) - np.array(initial_medians)],
            fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8,
            capsize=5, capthick=1.5)
ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('GPT-4o × Insurance (Initial Study)\nP(slope < 0) > 0.99')
ax.set_xticks(initial_temps)

ax = axes[1]
ax.errorbar(temperatures, this_medians,
            yerr=[np.array(this_medians) - np.array(this_lows),
                  np.array(this_highs) - np.array(this_medians)],
            fmt='o-', color=SEU_COLORS['accent'], linewidth=2, markersize=8,
            capsize=5, capthick=1.5)
ax.set_xlabel('Temperature')
ax.set_title(f'Claude × Insurance (This Study)\nP(slope < 0) ≈ {analysis["slope"]["p_negative"]:.2f}')
ax.set_xticks(temperatures)

plt.tight_layout()
plt.show()

Figure 5: Cross-study comparison isolating the LLM effect. Left: GPT-4o (initial study) shows clear monotonic decline. Right: Claude (this study) shows no systematic trend. Both use the insurance triage task with K=3.

Show code

# Handle different field names in primary_analysis.json
initial_slope_val = initial_analysis['slope'].get('median', initial_analysis['slope'].get('slope', 'N/A'))
initial_p_neg = initial_analysis['slope'].get('p_negative', None)

cross = pd.DataFrame([
    {'Study': 'GPT-4o × Insurance (initial)',
     'Slope median': f"{initial_slope_val:.1f}" if isinstance(initial_slope_val, (int, float)) else initial_slope_val,
     'Slope 90% CI': f"[{initial_analysis['slope']['ci_low']:.1f}, {initial_analysis['slope']['ci_high']:.1f}]",
     'P(slope < 0)': f"{initial_p_neg:.3f}" if initial_p_neg else '> 0.99',
     'P(strict mono)': f"{initial_analysis['monotonicity_prob']:.3f}"},
    {'Study': 'Claude × Insurance (this)',
     'Slope median': f"{analysis['slope']['median']:.1f}",
     'Slope 90% CI': f"[{analysis['slope']['ci_low']:.1f}, {analysis['slope']['ci_high']:.1f}]",
     'P(slope < 0)': f"{analysis['slope']['p_negative']:.3f}",
     'P(strict mono)': f"{analysis['monotonicity_prob']:.3f}"},
])
cross

Table 8: Cross-study comparison isolating the LLM effect (same task, different LLMs).

	Study	Slope median	Slope 90% CI	P(slope < 0)	P(strict mono)
0	GPT-4o × Insurance (initial)	-24.6	[-52.4, -6.7]	> 0.99	0.125
1	Claude × Insurance (this)	-2.9	[-42.9, 30.8]	0.560	0.008

0.7.1 Formal Cross-Study Slope Comparison

To move beyond visual comparison, we compute the posterior probability that the GPT-4o slope is more negative than the Claude slope. Because the two studies were fit independently, we draw from each study’s slope posterior via Monte Carlo and compute the overlap.

Show code

# Load GPT-4o alpha draws from the initial study
initial_temps_grid = [0.0, 0.3, 0.7, 1.0, 1.5]
initial_alpha_draws = {}
for t in initial_temps_grid:
    key = f"T{str(t).replace('.', '_')}"
    data = np.load(initial_data_dir / f"alpha_draws_{key}.npz")
    initial_alpha_draws[t] = data['alpha']

# Compute GPT-4o slope posterior
initial_temp_array = np.array(initial_temps_grid)
n_draws_initial = len(initial_alpha_draws[initial_temps_grid[0]])
initial_slope_draws = np.empty(n_draws_initial)
for i in range(n_draws_initial):
    alphas_i = np.array([initial_alpha_draws[t][i] for t in initial_temps_grid])
    initial_slope_draws[i] = np.cov(initial_temp_array, alphas_i)[0, 1] / np.var(initial_temp_array)

# Use minimum draw count for paired comparison
n_compare = min(len(slope_draws), len(initial_slope_draws))
p_gpt4o_more_negative = np.mean(initial_slope_draws[:n_compare] < slope_draws[:n_compare])

print(f"Cross-study slope comparison:")
print(f"  GPT-4o slope: median = {np.median(initial_slope_draws):.1f}, "
      f"90% CI = [{np.percentile(initial_slope_draws, 5):.1f}, {np.percentile(initial_slope_draws, 95):.1f}]")
print(f"  Claude slope:  median = {np.median(slope_draws):.1f}, "
      f"90% CI = [{np.percentile(slope_draws, 5):.1f}, {np.percentile(slope_draws, 95):.1f}]")
print(f"  P(GPT-4o slope < Claude slope) = {p_gpt4o_more_negative:.3f}")
print(f"  (i.e., probability that GPT-4o has a more negative temperature effect)")

Cross-study slope comparison:
  GPT-4o slope: median = -30.8, 90% CI = [-65.5, -8.3]
  Claude slope:  median = -3.6, 90% CI = [-53.6, 38.5]
  P(GPT-4o slope < Claude slope) = 0.817
  (i.e., probability that GPT-4o has a more negative temperature effect)

This provides a formal posterior quantity for the core claim that the temperature–sensitivity relationship differs between LLMs.

Temperature Range Caveat for Slope Comparison

The GPT-4o slope is estimated over a wider temperature range ($\Delta T = 1.5$) than the Claude slope ($\Delta T = 1.0$). If the GPT-4o relationship is nonlinear (with steeper decline at higher temperatures), the wider range inflates the GPT-4o slope magnitude. However, even restricting attention to the qualitative contrast — clear monotonic decline vs. flat oscillation — the LLM effect is unambiguous.

0.7.2 Characterising the Oscillatory Pattern

The non-monotonic pattern in Claude’s α estimates — a dip at $T = 0.2$, rise at $T = 0.5$ and $T = 0.8$, dip at $T = 1.0$ — echoes the oscillation seen in the Ellsberg study. Is this a genuine feature of Claude’s behaviour, or merely posterior noise around a flat line?

Show code

# Compute the range (max - min) of alpha across temperatures for each posterior draw
alpha_range_draws = np.empty(n_draws)
for i in range(n_draws):
    vals = np.array([alpha_draws[t][i] for t in temperatures])
    alpha_range_draws[i] = np.max(vals) - np.min(vals)

# Compare against a "flat" baseline: what range would we expect if all
# temperatures had the same true alpha? Use the overall mean draw as reference.
overall_mean_draws = np.mean([alpha_draws[t] for t in temperatures], axis=0)

# Summary statistics
median_range = np.median(alpha_range_draws)
q05_range, q95_range = np.percentile(alpha_range_draws, [5, 95])

# Check pairwise probabilities for notable reversals
notable_reversals = []
for key, prob in pairs.items():
    t1, t2 = key.split('_vs_')
    if prob < 0.35:
        notable_reversals.append((t1, t2, prob))
    elif prob > 0.65:
        pass  # expected direction, not a reversal

print(f"Oscillation analysis:")
print(f"  Posterior range of α across temperatures:")
print(f"    Median: {median_range:.1f}")
print(f"    90% CI: [{q05_range:.1f}, {q95_range:.1f}]")
print()
print(f"  Pairwise probabilities near or below 0.50 (potential reversals):")
for key, prob in pairs.items():
    t1, t2 = key.split('_vs_')
    if prob < 0.50:
        print(f"    P(α(T={t1}) > α(T={t2})) = {prob:.3f}  [reversed from monotonic prediction]")
print()
print(f"  Notable reversals (P < 0.35): {len(notable_reversals)}")
for t1, t2, p in notable_reversals:
    print(f"    P(α(T={t1}) > α(T={t2})) = {p:.3f}")

Oscillation analysis:
  Posterior range of α across temperatures:
    Median: 45.7
    90% CI: [20.6, 91.4]

  Pairwise probabilities near or below 0.50 (potential reversals):
    P(α(T=0.0) > α(T=0.5)) = 0.463  [reversed from monotonic prediction]
    P(α(T=0.2) > α(T=0.5)) = 0.191  [reversed from monotonic prediction]
    P(α(T=0.2) > α(T=0.8)) = 0.225  [reversed from monotonic prediction]
    P(α(T=0.2) > α(T=1.0)) = 0.435  [reversed from monotonic prediction]

  Notable reversals (P < 0.35): 2
    P(α(T=0.2) > α(T=0.5)) = 0.191
    P(α(T=0.2) > α(T=0.8)) = 0.225

Assessment:
  Pairwise comparisons near chance (0.35-0.65): 4/10
  Any pairwise comparison reaching notable level (>0.80 or <0.20): True
  → Some pairwise comparisons suggest the oscillation may have a genuine component,
    warranting further investigation.

0.7.3 Summary Heatmap

Show code

pairs = analysis['pairwise_comparisons']

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

medians = [np.median(alpha_draws[t]) for t in temperatures]
q05s_plot = [np.percentile(alpha_draws[t], 5) for t in temperatures]
q95s_plot = [np.percentile(alpha_draws[t], 95) for t in temperatures]

axes[0].errorbar(temperatures, medians,
                 yerr=[np.array(medians) - np.array(q05s_plot),
                       np.array(q95s_plot) - np.array(medians)],
                 fmt='o-', color=SEU_COLORS['accent'], linewidth=2, markersize=8,
                 capsize=5, capthick=1.5)
axes[0].set_xlabel('Temperature')
axes[0].set_ylabel('Sensitivity (α)')
axes[0].set_title('α vs. Temperature')
axes[0].set_xticks(temperatures)

n_temps = len(temperatures)
heatmap = np.full((n_temps, n_temps), np.nan)

for key, prob in pairs.items():
    t1, t2 = key.split('_vs_')
    i = temperatures.index(float(t1))
    j = temperatures.index(float(t2))
    heatmap[i, j] = prob
    heatmap[j, i] = 1 - prob

np.fill_diagonal(heatmap, 0.5)

im = axes[1].imshow(heatmap, cmap='RdYlGn', vmin=0, vmax=1, aspect='equal')
axes[1].set_xticks(range(n_temps))
axes[1].set_xticklabels([f'{t}' for t in temperatures])
axes[1].set_yticks(range(n_temps))
axes[1].set_yticklabels([f'{t}' for t in temperatures])
axes[1].set_xlabel('Temperature (column)')
axes[1].set_ylabel('Temperature (row)')
axes[1].set_title('P(α_row > α_col)')

for i in range(n_temps):
    for j in range(n_temps):
        if not np.isnan(heatmap[i, j]):
            color = 'white' if heatmap[i, j] > 0.8 or heatmap[i, j] < 0.2 else 'black'
            axes[1].text(j, i, f'{heatmap[i, j]:.2f}', ha='center', va='center',
                        fontsize=9, color=color)

plt.colorbar(im, ax=axes[1], shrink=0.8)

plt.tight_layout()
plt.show()

Figure 6: Left: α vs. temperature showing the flat, non-monotonic pattern. Right: pairwise posterior probabilities P(α_row > α_col).

0.8 Discussion

0.8.1 Summary of Findings

Claude 3.5 Sonnet shows essentially no temperature–sensitivity relationship on the insurance triage task:

No slope. The posterior slope $\Delta\alpha / \Delta T$ has a median of $\approx -3$, with the 90% CI spanning both positive and negative values. $P(\text{slope} < 0) \approx 0.56$ — barely above chance.
Non-monotonic pattern. The α estimates oscillate: dip at $T = 0.2$, rise at $T = 0.5$ and $T = 0.8$, then dip again at $T = 1.0$. This mirrors the oscillatory pattern seen in the Ellsberg study with Claude. However, the oscillation analysis in Section 0.7.2 shows that no individual pairwise comparison reaches a conventionally notable level, suggesting the non-monotonic pattern is consistent with posterior noise around a flat function rather than a genuine oscillatory structure.
Near-zero strict monotonicity. $P(\alpha \text{ strictly decreasing across all five temperatures}) = 0.008$.
Good model fit. Posterior predictive checks show no evidence of misfit, confirming the pattern reflects the data, not a modelling artefact.

0.8.2 Implications for the Factorial Design

This cell, combined with the initial study (GPT-4o × Insurance), provides the clearest isolation of the LLM effect: same task, same model, different LLM. The contrast is stark:

GPT-4o: slope $\approx -25$, $P(\text{slope} < 0) > 0.99$
Claude: slope $\approx -3$, $P(\text{slope} < 0) \approx 0.56$

The formal cross-study comparison (Section 0.7.1) provides a direct posterior quantity for this contrast. This strongly suggests the temperature–sensitivity effect is LLM-specific rather than a universal property of temperature scaling. The full $2 \times 2$ factorial analysis (factorial synthesis report) formalises this comparison by computing the interaction between LLM and task factors.

0.8.3 Why Does the Effect Differ Across LLMs?

The central finding of this study — that Claude does not exhibit the temperature–sensitivity relationship observed with GPT-4o — invites the question of why the two LLMs diverge. Several candidate explanations merit consideration, though the current data cannot definitively adjudicate among them. We label these as post-hoc hypotheses to be tested in future work.

Baseline decision noise. If Claude 3.5 Sonnet already operates with relatively high internal decision noise at $T = 0.0$ (relative to GPT-4o), increasing temperature may have less room to degrade choice consistency. In the quantal response framework, this would correspond to Claude starting from a lower “effective α” even at its most deterministic setting, with temperature adding noise to a process already near a noise floor. The absolute α levels observed here — which cluster in a relatively narrow range — are consistent with this hypothesis, though they are not conclusive.

Assessment-stage vs. choice-stage effects. Temperature affects both the assessment text (which determines the embeddings and hence the features entering the model) and potentially the choice itself. If Claude produces more stereotyped or less temperature-sensitive assessments than GPT-4o across the temperature range, the features may not vary enough to drive α differences. The comparable PCA variance profiles across studies (noted in Section 0.2) suggest that gross differences in embedding geometry are not the explanation, but more fine-grained measures — such as the mean pairwise distance between embeddings at each temperature — could reveal subtler differences in assessment diversity.

Training and alignment differences. Claude’s post-training process (Constitutional AI, RLHF) may produce decision-making behaviour that is more robust to temperature perturbation than GPT-4o’s. If Claude’s alignment training effectively regularises its outputs toward consistent, policy-compliant responses, temperature increases might modulate surface-level linguistic variety without substantially affecting the underlying decision process. This would be consistent with the observed pattern in which Claude produces varied text (different assessment wordings across temperatures) without changing which alternative it selects.

Non-equivalent temperature parameterisation. The external temperature parameter may not correspond to equivalent levels of next-token entropy across providers. Anthropic and OpenAI implement temperature scaling in architectures that differ in layer count, attention mechanism, and post-training procedures. A temperature of $T = 0.5$ in one model may correspond to a very different effective sampling entropy than $T = 0.5$ in the other. While this complicates quantitative slope comparisons, it cannot fully explain the qualitative contrast: GPT-4o shows a clear, strong monotonic decline whereas Claude shows no discernible trend whatsoever.

These hypotheses are not mutually exclusive and may interact. Testing them would require access to model internals (e.g., next-token entropy distributions at each temperature level) or controlled experiments manipulating assessment diversity independently of temperature. The key empirical contribution of this study is establishing that the LLM-specificity is robust: it holds when the task is held constant.

0.8.4 Connection to JDM Literature

The finding that different LLMs respond differently to temperature manipulation resonates with a substantial literature on individual differences in stochastic choice. In human decision-making, individuals vary in the degree to which their choices approximate expected utility maximisation, and this heterogeneity is a persistent finding across paradigms (Hey & Orme, 1994; Wilcox, 2011). The sensitivity parameter α in our softmax framework is analogous to the “trembling hand” or noise parameter in stochastic choice models: some decision-makers exhibit consistently high α (near-deterministic EU maximisation) while others show low α (near-random choice).

From this perspective, LLMs may be viewed as exhibiting model-specific “decision styles” — with GPT-4o showing a clear temperature-dependent gradient in decision noise while Claude maintains relatively stable sensitivity. This parallels human findings in which some individuals’ choice consistency is highly malleable (e.g., through time pressure or cognitive load) while others’ is stable (Busemeyer & Townsend, 1993). The LLM-specificity finding suggests that temperature is not a universal dial for decision quality, but rather interacts with model-specific properties in ways that require empirical characterisation for each LLM family.

0.8.5 Implications for Applied LLM Deployment

The LLM-specific nature of the temperature–sensitivity relationship has practical consequences for users who tune temperature to control decision quality across LLM providers. A practitioner who observes that lowering temperature improves GPT-4o’s choice consistency cannot assume the same calibration will transfer to Claude — or to any other model. Temperature tuning strategies must be validated empirically for each provider, particularly in high-stakes decision-support applications.

0.8.6 Limitations

Several limitations should be noted:

Temperature range and grid comparability. The Anthropic API constrains temperature to $[0.0, 1.0]$, whereas the initial GPT-4o study used $[0.0, 1.5]$. The narrower range reduces statistical power to detect a temperature effect and complicates quantitative slope comparisons. Additionally, the absolute temperature values may not correspond to equivalent next-token entropy levels across providers. While the qualitative contrast (clear monotonic decline vs. flat pattern) is unambiguous, precise numerical comparisons of slope magnitudes should be interpreted cautiously.
Prior sensitivity. The m_01 prior on α is Lognormal(3.0, 0.75), inherited from the initial temperature study. We did not conduct a formal prior sensitivity analysis for this cell. However, with $M \approx 300$ observations per condition and α posteriors that are substantially narrower than the prior, the likelihood dominates the prior contribution. A formal check under alternative priors (e.g., Lognormal(3.0, 1.0) or Lognormal(2.5, 0.75)) would provide additional reassurance that the null finding is prior-robust, though we expect the conclusion to be unchanged given the data volume.
SBC inherited, not re-run. Simulation-Based Calibration was not performed on this dataset. We rely on the SBC validation from the initial GPT-4o temperature study (Initial Study), which used the same m_01.stan model and the same α ~ LogNormal(3.0, 0.75) prior; SBC tests the sampler under the prior-and-likelihood and is therefore inherited validly across applications that share both. The phrase “SBC by proxy” used in earlier drafts refers to this inheritance and should not be read as evidence that an SBC was conducted on Claude data. Direct validation under this study’s conditions comes from the parameter recovery analysis above.
Post-hoc design. The Claude insurance cell was added after the initial GPT-4o temperature finding, as part of the factorial extension; the design and analysis plan were finalised reactively rather than pre-registered before data collection. Within this study we describe the m_01 fits and slope test as confirmatory in the narrow sense that the model, prior, and slope contrast were fixed before fitting; the broader factorial framing is exploratory and is treated as such in the factorial synthesis report.
Single embedding model. All studies in the factorial design use text-embedding-3-small (OpenAI) for embedding construction. Different embedding models might yield different feature spaces and hence different α estimates, though the consistency of the PCA variance profile across studies suggests robustness to this choice.
Position bias. While the $3\times$ position counterbalancing design addresses systematic position effects, we did not conduct a separate analysis of whether Claude’s position bias patterns differ from GPT-4o’s. Differential position sensitivity could in principle contribute to the observed differences, though it seems unlikely to explain the qualitative absence of a temperature trend.
Formal interaction test deferred. The full 2×2 interaction test (LLM × Task) is computed in the factorial synthesis report rather than here, to avoid redundancy and ensure consistency across cells.

0.9 Reproducibility

Codebase version: 21d6883 (git commit hash at time of analysis)

0.9.1 Data Snapshot

File	Description
`alpha_draws_T*.npz`	Posterior draws of α (4,000 per condition)
`ppc_T*.json`	Posterior predictive check results
`diagnostics_T*.txt`	CmdStan diagnostic output
`stan_data_T*.json`	Stan-ready data (for refitting)
`fit_summary.json`	Summary statistics across conditions
`primary_analysis.json`	Pre-computed monotonicity and slope statistics
`run_summary.json`	Pipeline metadata and configuration
`study_config.yaml`	Frozen copy of the study configuration

0.9.2 Refitting from Source

Show code

# Uncomment to refit from source data (requires CmdStanPy)
#
# import cmdstanpy
# model = cmdstanpy.CmdStanModel(stan_file="models/m_01.stan")
#
# for t in [0.0, 0.2, 0.5, 0.8, 1.0]:
#     key = f"T{str(t).replace('.', '_')}"
#     fit = model.sample(
#         data=f"data/stan_data_{key}.json",
#         chains=4,
#         iter_warmup=1000,
#         iter_sampling=1000,
#         seed=42,
#     )
#     print(f"T={t}: alpha median = {np.median(fit.stan_variable('alpha')):.1f}")

0.10 References

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:

@online{helzner2026,
  author = {Helzner, Jeff},
  title = {Temperature and {SEU} {Sensitivity:} {Claude} × {Insurance}
    {Study}},
  date = {2026-05-12},
  url = {https://jeffhelzner.github.io/seu-sensitivity/applications/claude_insurance_study/01_claude_insurance_study.html},
  langid = {en}
}

For attribution, please cite this work as:

Helzner, Jeff. 2026. “Temperature and SEU Sensitivity: Claude × Insurance Study.” SEU Sensitivity Project, May 12. https://jeffhelzner.github.io/seu-sensitivity/applications/claude_insurance_study/01_claude_insurance_study.html.

--- title: "Temperature and SEU Sensitivity: Claude × Insurance Study" subtitle: "Application Report: Claude × Insurance (Cell 2,1)" description: | An investigation of how LLM sampling temperature affects estimated sensitivity (α) to subjective expected utility maximisation, using insurance claims triage (K=3) and Claude 3.5 Sonnet (Anthropic). This study isolates the LLM effect by pairing Claude with the same task domain used in the initial temperature study. categories: [applications, temperature, insurance, m_01, anthropic, factorial] execute: cache: true --- ```{python} #| label: setup #| include: false import sys import os reports_root = os.path.normpath(os.path.join(os.getcwd(), '..', '..')) project_root = os.path.dirname(reports_root) sys.path.insert(0, reports_root) sys.path.insert(0, project_root) import numpy as np import json import re import warnings warnings.filterwarnings('ignore') import matplotlib.pyplot as plt import pandas as pd from report_utils import set_seu_style, SEU_COLORS, SEU_PALETTE set_seu_style() from pathlib import Path data_dir = Path("data") ``` ## Introduction {#sec-introduction} The initial temperature study ([Report 1](../temperature_study/01_initial_study.qmd)) found a clear negative relationship between LLM sampling temperature and estimated sensitivity $\alpha$, using **GPT-4o** on an **insurance claims triage** task ($K = 3$). When both the task and LLM were changed simultaneously in the Ellsberg study ([Report 2](../ellsberg_study/01_ellsberg_study.qmd)), the relationship was **not replicated** — but the confounded design could not attribute the non-replication to either factor alone. This study is one of two new cells in a $2 \times 2$ factorial design (LLM × Task) that disentangles the contributions of each factor. Specifically, this study pairs **Claude 3.5 Sonnet** (Anthropic) with the **insurance claims triage** task — holding the task constant relative to the initial study while varying the LLM, and holding the LLM constant relative to the Ellsberg study while varying the task. | | Insurance (K=3) | Ellsberg (K=4) | |---|---|---| | **GPT-4o** | Initial study ✓ | Cell (1,2) — new | | **Claude** | **This study** | Ellsberg study ✓ | ::: {.callout-important} ## Summary of Findings The monotonic temperature–α relationship observed in the initial temperature study (GPT-4o, insurance) was **not replicated** with Claude 3.5 Sonnet on the same insurance task. The posterior slope is near zero ($\Delta\alpha / \Delta T \approx -3$, $P(\text{slope} < 0) \approx 0.56$), and the α estimates show a non-monotonic pattern. By contrast, the initial GPT-4o study found slope $\approx -25$ with $P(\text{slope} < 0) > 0.99$. Combined with the Ellsberg study results, this suggests the **LLM** is the dominant factor: Claude 3.5 Sonnet does not exhibit the temperature–sensitivity relationship that GPT-4o does, regardless of task domain. ::: ## Experimental Design {#sec-design} ### Task and Conditions We use the same insurance claims triage task as the initial temperature study: Claude 3.5 Sonnet selects which insurance claim to prioritise from a set of alternatives. Each claim has $K = 3$ possible consequences (denial, partial approval, full approval). The LLM first *assesses* each claim individually (producing text that is then embedded), and subsequently makes a *choice* among the claims in a given problem. Five temperature levels define the between-condition factor: ```{python} #| label: tbl-conditions #| tbl-cap: "Experimental conditions. Each temperature level constitutes a separate model fit." conditions = pd.DataFrame({ 'Level': [1, 2, 3, 4, 5], 'Temperature': [0.0, 0.2, 0.5, 0.8, 1.0], 'Description': [ 'Deterministic (greedy decoding)', 'Low variance', 'Moderate variance', 'High variance', 'Maximum (Anthropic API limit)' ] }) conditions ``` ::: {.callout-note} ## Temperature Range and Grid Choice The Anthropic API supports temperature values in $[0.0, 1.0]$, compared to OpenAI's wider $[0.0, 2.0]$ range. We adopt the same temperature grid as the Ellsberg study ($T \in \{0.0, 0.2, 0.5, 0.8, 1.0\}$) to enable direct comparison within the Claude row of the factorial design. The initial temperature study (GPT-4o) used $T \in \{0.0, 0.3, 0.7, 1.0, 1.5\}$. Because the grid points and range differ across providers, quantitative slope comparisons should be interpreted with care: the narrower Anthropic range (absolute span = 1.0 vs. 1.5) may reduce statistical power to detect a temperature effect, and the absolute temperature values may not correspond to equivalent levels of next-token entropy across providers. ::: ### Design Parameters ```{python} #| label: design-params #| echo: true import yaml with open(data_dir / "study_config.yaml") as f: config = yaml.safe_load(f) with open(data_dir / "run_summary.json") as f: run_summary = json.load(f) print(f"Study Design:") print(f" Decision problems (M): {config['num_problems']} base × {config['num_presentations']} presentations = {config['num_problems'] * config['num_presentations']}") print(f" Alternatives per problem: {config['min_alternatives']}–{config['max_alternatives']}") print(f" Consequences (K): {config['K']}") print(f" Embedding dimensions (D): {config['target_dim']}") print(f" Distinct alternatives (R): {run_summary['phases']['phase3_data_prep']['per_temperature']['0.0']['R']}") print(f" LLM model: {config['llm_model']}") print(f" Embedding model: {config['embedding_model']}") print(f" Provider: {config['provider']}") ``` Each of the 100 base problems is presented $P = 3$ times with alternatives shuffled to different positions, yielding approximately $M = 300$ observations per temperature condition. This **position counterbalancing** design addresses systematic position bias. ### Feature Construction Alternative features are constructed through the same two-stage process used in the initial temperature study. First, Claude 3.5 Sonnet assesses each insurance claim at the relevant temperature, producing a natural-language evaluation. These assessments are embedded using `text-embedding-3-small` (OpenAI), yielding high-dimensional vectors. Second, all embeddings across temperature conditions are pooled and projected via PCA to $D = 32$ dimensions. ```{python} #| label: pca-variance #| echo: false pca = run_summary['phases']['phase3_data_prep']['pca_summary'] cumvar = np.cumsum(pca['explained_variance_ratio']) print(f"PCA Summary:") print(f" Components retained: {pca['n_components']}") print(f" Total variance explained: {pca['total_explained_variance']:.1%}") print(f" First 5 components: {cumvar[4]:.1%}") print(f" First 10 components: {cumvar[9]:.1%}") ``` The PCA variance profile is comparable to the initial temperature study's embeddings, confirming that the feature construction captures a similar proportion of the embedding space for Claude as it does for GPT-4o. This rules out gross differences in feature-space geometry as an explanation for any divergence in results. ### Data Quality ```{python} #| label: data-quality #| echo: false na = run_summary['phases']['phase2b_choices']['na_summary'] print(f"NA Summary:") print(f" Overall: {na['overall']['na']} / {na['overall']['total']} ({na['overall']['na_rate']:.1%})") for key, val in na['per_temperature'].items(): print(f" {key}: {val['na']} / {val['total']} ({val['na_rate']:.1%})") ``` NA rates are uniformly low across all temperature conditions, with no systematic trend suggesting temperature-dependent data quality issues. The rates are comparable to those in the initial temperature study, confirming that Claude's response completeness is not a confound. ### Comparison with Other Factorial Cells ```{python} #| label: tbl-design-comparison #| tbl-cap: "Design comparison across the 2×2 factorial. This study (Cell 2,1) shares the insurance task with Cell (1,1) and shares Claude with Cell (2,2)." comparison = pd.DataFrame({ 'Parameter': ['LLM', 'Task domain', 'Consequences (K)', 'Alternatives (R)', 'Observations per T', 'Temperature range', 'Stan model'], 'Cell (1,1) Initial': ['GPT-4o', 'Insurance triage', '3', '30', '~300', '[0.0, 0.3, 0.7, 1.0, 1.5]', 'm_01'], 'Cell (2,1) This study': ['Claude 3.5 Sonnet', 'Insurance triage', '3', '30', '~300', '[0.0, 0.2, 0.5, 0.8, 1.0]', 'm_01'], 'Cell (2,2) Ellsberg': ['Claude 3.5 Sonnet', 'Ellsberg gambles', '4', '30', '~300', '[0.0, 0.2, 0.5, 0.8, 1.0]', 'm_02'], }) comparison ``` ## Model and Prior Calibration {#sec-model} ### The m_01 Model Variant We fit the **m_01** model — the same variant used in the initial temperature study. The prior on $\alpha$ is calibrated for the insurance triage task's $K = 3$ consequence space: | | m_0 (foundational) | m_01 (this study & initial) | m_02 (Ellsberg) | |---|---|---|---| | $\alpha$ prior | $\text{Lognormal}(0, 1)$ | $\text{Lognormal}(3.0, 0.75)$ | $\text{Lognormal}(3.5, 0.75)$ | | Prior median | $\approx 1$ | $\approx 20$ | $\approx 33$ | | Prior 90% CI | $[0.19, 5.0]$ | $[5.5, 67]$ | $[10, 124]$ | | $K$ | generic | 3 | 4 | Using the same model and prior as the initial study ensures that any difference in results is attributable to the LLM change (Claude vs. GPT-4o), not to modelling differences. ## Model Validation {#sec-validation} ### Parameter Recovery {#sec-parameter-recovery} We validate that m_01's parameters are identifiable under this study's design ($M \approx 300$, $K = 3$, $D = 32$, $R = 30$) via 20 iterations of parameter recovery. ```{python} #| label: load-recovery #| output: false recovery_dir = os.path.join(project_root, "results", "parameter_recovery", "claude_insurance_recovery") recovery_summary_dir = os.path.join(recovery_dir, "recovery_summary") with open(os.path.join(recovery_summary_dir, "recovery_statistics.json")) as f: recovery_stats = json.load(f) true_params_path = os.path.join(recovery_dir, "all_true_parameters.json") with open(true_params_path) as f: all_true_params = json.load(f) posterior_summaries = [] true_params_list = [] for i in range(1, 21): iter_dir = os.path.join(recovery_dir, f"iteration_{i}") summary_path = os.path.join(iter_dir, "posterior_summary.csv") if os.path.exists(summary_path): df = pd.read_csv(summary_path, index_col=0) posterior_summaries.append(df) true_params_list.append(all_true_params[i - 1]) n_successful = len(posterior_summaries) print(f"Loaded {n_successful} recovery iterations") ``` ```{python} #| label: fig-alpha-recovery #| fig-cap: "Recovery of the sensitivity parameter α under the m_01 prior with the Claude × Insurance design (K=3). Left: true vs. estimated values with identity line. Right: 90% credible intervals for each iteration, coloured by whether they contain the true value." alpha_true = np.array([p['alpha'] for p in true_params_list]) alpha_mean = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries]) alpha_lower = np.array([s.loc['alpha', '5%'] for s in posterior_summaries]) alpha_upper = np.array([s.loc['alpha', '95%'] for s in posterior_summaries]) alpha_bias = np.mean(alpha_mean - alpha_true) alpha_rmse = np.sqrt(np.mean((alpha_mean - alpha_true)**2)) alpha_coverage = np.mean((alpha_true >= alpha_lower) & (alpha_true <= alpha_upper)) alpha_ci_width = np.mean(alpha_upper - alpha_lower) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) ax = axes[0] ax.scatter(alpha_true, alpha_mean, alpha=0.7, s=60, c=SEU_COLORS['primary'], edgecolor='white') lims = [min(alpha_true.min(), alpha_mean.min()) * 0.9, max(alpha_true.max(), alpha_mean.max()) * 1.1] ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line') ax.set_xlim(lims) ax.set_ylim(lims) ax.set_xlabel('True α', fontsize=12) ax.set_ylabel('Estimated α (posterior mean)', fontsize=12) ax.set_title(f'α Recovery: Bias={alpha_bias:.2f}, RMSE={alpha_rmse:.2f}', fontsize=12) ax.legend() ax.set_aspect('equal') ax = axes[1] for i in range(len(alpha_true)): covered = (alpha_true[i] >= alpha_lower[i]) & (alpha_true[i] <= alpha_upper[i]) color = 'forestgreen' if covered else 'crimson' ax.plot([i, i], [alpha_lower[i], alpha_upper[i]], color=color, linewidth=2, alpha=0.7) ax.scatter(i, alpha_mean[i], color=color, s=40, zorder=3) ax.scatter(np.arange(len(alpha_true)), alpha_true, color='black', s=60, marker='x', label='True value', zorder=4, linewidth=2) ax.set_xlabel('Iteration', fontsize=12) ax.set_ylabel('α', fontsize=12) ax.set_title(f'α: 90% Credible Intervals (Coverage = {alpha_coverage:.0%})', fontsize=12) ax.legend() ax.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` ```{python} #| label: tbl-recovery-metrics #| tbl-cap: "Parameter recovery metrics for m_01 with the Claude × Insurance design (M≈300, K=3, D=32, R=30). Column structure mirrors the corresponding table in the initial temperature study to enable direct comparison; the primary parameter of interest is α." K_val = 3 D_val = 32 # Beta recovery all_beta_bias = [] all_beta_rmse = [] all_beta_coverage = [] all_beta_ci_width = [] for k in range(K_val): for d in range(D_val): param_name = f"beta[{k+1},{d+1}]" try: bt = np.array([p['beta'][k][d] for p in true_params_list]) bm = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries]) bl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries]) bu = np.array([s.loc[param_name, '95%'] for s in posterior_summaries]) all_beta_bias.append(np.mean(bm - bt)) all_beta_rmse.append(np.sqrt(np.mean((bm - bt)**2))) all_beta_coverage.append(np.mean((bt >= bl) & (bt <= bu))) all_beta_ci_width.append(np.mean(bu - bl)) except (KeyError, IndexError): pass # Delta recovery all_delta_bias = [] all_delta_rmse = [] all_delta_coverage = [] all_delta_ci_width = [] for k in range(K_val - 1): param_name = f"delta[{k+1}]" try: dt = np.array([p['delta'][k] for p in true_params_list]) dm = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries]) dl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries]) du = np.array([s.loc[param_name, '95%'] for s in posterior_summaries]) all_delta_bias.append(np.mean(dm - dt)) all_delta_rmse.append(np.sqrt(np.mean((dm - dt)**2))) all_delta_coverage.append(np.mean((dt >= dl) & (dt <= du))) all_delta_ci_width.append(np.mean(du - dl)) except (KeyError, IndexError): pass # Scale references for relative metrics. α lives on a wide multiplicative # scale (Lognormal(3.0, 0.75) prior → true values typically ~5–100), so # absolute Bias and RMSE must be interpreted relative to the magnitude # of α; the same caveat is less relevant for β (zero-centred) and δ # (modest range), so we report relative metrics for α only. alpha_scale = float(np.mean(np.abs(alpha_true))) alpha_rel_bias = alpha_bias / alpha_scale alpha_rel_rmse = alpha_rmse / alpha_scale metrics = pd.DataFrame([ {'Parameter': 'α', 'Bias': f'{alpha_bias:.2f}', 'RMSE': f'{alpha_rmse:.2f}', 'Rel. Bias': f'{alpha_rel_bias:+.1%}', 'Rel. RMSE': f'{alpha_rel_rmse:.1%}', 'Coverage (90%)': f'{alpha_coverage:.0%}', 'Mean CI Width': f'{alpha_ci_width:.2f}'}, {'Parameter': f'β (mean over {K_val*D_val})', 'Bias': f'{np.mean(all_beta_bias):.3f}' if all_beta_bias else '—', 'RMSE': f'{np.mean(all_beta_rmse):.3f}' if all_beta_rmse else '—', 'Rel. Bias': '—', 'Rel. RMSE': '—', 'Coverage (90%)': f'{np.mean(all_beta_coverage):.0%}' if all_beta_coverage else '—', 'Mean CI Width': f'{np.mean(all_beta_ci_width):.2f}' if all_beta_ci_width else '—'}, {'Parameter': f'δ (mean over {K_val-1})', 'Bias': f'{np.mean(all_delta_bias):.3f}' if all_delta_bias else '—', 'RMSE': f'{np.mean(all_delta_rmse):.3f}' if all_delta_rmse else '—', 'Rel. Bias': '—', 'Rel. RMSE': '—', 'Coverage (90%)': f'{np.mean(all_delta_coverage):.0%}' if all_delta_coverage else '—', 'Mean CI Width': f'{np.mean(all_delta_ci_width):.2f}' if all_delta_ci_width else '—'}, ]) metrics ``` ::: {.callout-note} ## Comparing Recovery Metrics with the Initial Temperature Study A natural question is why the absolute α metrics in this table (Bias, RMSE, Mean CI Width) differ from those in the corresponding table of the [initial temperature study](../temperature_study/01_initial_study.qmd#sec-parameter-recovery). The difference is **not** attributable to LLM-specific calibration: parameter recovery in both studies uses *purely synthetic data* drawn from the same prior and the same study design (M=300, K=3, D=32, R=30, i.i.d. Normal features). True α values are sampled from the same Lognormal$(3.0, 0.75)$ prior, and no Claude or GPT-4o response data enters the simulation. The two recoveries differ only in the random seeds of their 20 iterations. Because true α lies on a wide multiplicative scale (≈ 5–100 under the prior), absolute Bias, RMSE, and CI Width scale roughly with the magnitude of the α realisations drawn in any given recovery run. Anchoring interpretation on the **relative** columns (Rel. Bias, Rel. RMSE) and on **Coverage** removes this source of run-to-run variability and is the appropriate basis for cross-study comparison. On those metrics the two recoveries are comparable, and α recovery is fit for purpose for this study under the same operational thresholds used in the initial study (relative bias within roughly $\pm 10\%$, relative RMSE comfortably below 25%, 90% credible-interval coverage approximately 85–95%). The β–δ coupling discussed in the [foundational reports](../../foundations/04_parameter_recovery.qmd) is expected to persist here as it does in the initial study; since this study targets α, the weaker recovery of $(\beta,\delta)$ does not compromise the primary analysis. ::: ::: {.callout-note} ## SBC Not Performed Simulation-based calibration (SBC) was not performed for this cell. The m_01 model passed SBC in the initial temperature study report. SBC validates the *model's* computational faithfulness (i.e., whether the posterior computation recovers the prior under simulated data), which depends on the model structure and prior specification — both of which are identical here. The study-specific parameter recovery analysis (above) provides validation that the model is identifiable under this study's actual data characteristics, including any differences in Claude's embedding distributions relative to GPT-4o. ::: ## Results {#sec-results} ### Loading Posterior Draws ```{python} #| label: load-posteriors #| output: false temperatures = [0.0, 0.2, 0.5, 0.8, 1.0] temp_labels = {t: f"T={t}" for t in temperatures} alpha_draws = {} for t in temperatures: key = f"T{str(t).replace('.', '_')}" data = np.load(data_dir / f"alpha_draws_{key}.npz") alpha_draws[t] = data['alpha'] with open(data_dir / "primary_analysis.json") as f: analysis = json.load(f) with open(data_dir / "fit_summary.json") as f: fit_summary = json.load(f) ``` ```{python} #| echo: false for t in temperatures: n = len(alpha_draws[t]) print(f" T={t}: {n:,} posterior draws loaded") ``` ### MCMC Diagnostics ```{python} #| label: tbl-diagnostics #| tbl-cap: "MCMC diagnostics for all five temperature conditions. All fits used 4 chains with 1,000 warmup and 1,000 sampling iterations each (4,000 post-warmup draws total)." diag_rows = [] for t in temperatures: key = f"T{str(t).replace('.', '_')}" with open(data_dir / f"diagnostics_{key}.txt") as f: diag_text = f.read() if "No divergent transitions" in diag_text: n_div = 0 else: match = re.search(r'(\d+) of (\d+)', diag_text) n_div = int(match.group(1)) if match else 0 rhat_ok = "R-hat values satisfactory" in diag_text or "R_hat" not in diag_text.replace("R-hat values satisfactory", "") ess_ok = "effective sample size satisfactory" in diag_text ebfmi_ok = "E-BFMI satisfactory" in diag_text diag_rows.append({ 'Temperature': t, 'Divergences': f"{n_div}/4000", 'R̂': '✓' if rhat_ok else '✗', 'ESS': '✓' if ess_ok else '✗', 'E-BFMI': '✓' if ebfmi_ok else '✗', }) pd.DataFrame(diag_rows) ``` ### Posterior Summaries ```{python} #| label: tbl-posteriors #| tbl-cap: "Posterior summaries for the sensitivity parameter α at each temperature level. Intervals are 90% credible intervals." summary = analysis['summary_table'] rows = [] for s in summary: rows.append({ 'Temperature': s['temperature'], 'Median': f"{s['median']:.1f}", 'Mean': f"{s['mean']:.1f}", 'SD': f"{s['sd']:.1f}", '90% CI': f"[{s['ci_low']:.1f}, {s['ci_high']:.1f}]", }) pd.DataFrame(rows) ``` The α estimates show a non-monotonic pattern: a dip at $T = 0.2$, a rise at $T = 0.5$ and $T = 0.8$, then another dip at $T = 1.0$. This echoes the oscillating pattern observed in the Ellsberg study with Claude. ### Forest Plot ```{python} #| label: fig-forest #| fig-cap: "Forest plot of posterior α distributions across temperature conditions. Points show posterior medians; thick bars span the 50% credible interval; thin bars span the 90% credible interval." #| fig-height: 5 fig, ax = plt.subplots(figsize=(8, 5)) y_positions = np.arange(len(temperatures))[::-1] for i, t in enumerate(temperatures): draws = alpha_draws[t] median = np.median(draws) q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95]) y = y_positions[i] ax.plot([q05, q95], [y, y], color=SEU_PALETTE[i], linewidth=1.5, alpha=0.7) ax.plot([q25, q75], [y, y], color=SEU_PALETTE[i], linewidth=4, alpha=0.9) ax.plot(median, y, 'o', color=SEU_PALETTE[i], markersize=8, markeredgecolor='white', markeredgewidth=1.5, zorder=5) ax.set_yticks(y_positions) ax.set_yticklabels([f'T = {t}' for t in temperatures]) ax.set_xlabel('Sensitivity (α)') ax.set_title('Posterior Distributions of α by Temperature') ax.grid(axis='x', alpha=0.3) ax.grid(axis='y', alpha=0) plt.tight_layout() plt.show() ``` ### Posterior Densities ```{python} #| label: fig-density #| fig-cap: "Kernel density estimates of the posterior α distributions. The posteriors overlap heavily, with no clear ordering by temperature." #| fig-height: 5 from scipy.stats import gaussian_kde fig, ax = plt.subplots(figsize=(8, 5)) for i, t in enumerate(temperatures): draws = alpha_draws[t] kde = gaussian_kde(draws) x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.1, 300) ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=SEU_PALETTE[i]) ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2, label=f'T = {t} (median = {np.median(draws):.0f})') ax.set_xlabel('Sensitivity (α)') ax.set_ylabel('Density') ax.set_title('Posterior Density of α') ax.legend(loc='upper right') plt.tight_layout() plt.show() ``` ### Posterior Predictive Checks ```{python} #| label: tbl-ppc #| tbl-cap: "Posterior predictive check p-values for each temperature condition. Values near 0.5 indicate good calibration." ppc_rows = [] for t in temperatures: key = f"T{str(t).replace('.', '_')}" with open(data_dir / f"ppc_{key}.json") as f: ppc = json.load(f) pvals = ppc['p_values'] ppc_rows.append({ 'Temperature': t, 'Log-likelihood': f"{pvals['ll']:.3f}", 'Modal frequency': f"{pvals['modal']:.3f}", 'Mean probability': f"{pvals['prob']:.3f}", }) pd.DataFrame(ppc_rows) ``` All posterior predictive p-values fall within $[0.3, 0.7]$, indicating adequate model fit at every temperature level. ## Monotonicity Analysis {#sec-monotonicity} ### Global Slope ```{python} #| label: fig-slope #| fig-cap: "Posterior distribution of the slope Δα/ΔT. The distribution is centred near zero, providing essentially no evidence for a negative relationship." slopes = analysis['slope'] temp_array = np.array(temperatures) slope_draws = [] for draw_idx in range(len(alpha_draws[temperatures[0]])): alphas_at_draw = np.array([alpha_draws[t][draw_idx] for t in temperatures]) b = np.cov(temp_array, alphas_at_draw)[0, 1] / np.var(temp_array) slope_draws.append(b) slope_draws = np.array(slope_draws) fig, ax = plt.subplots(figsize=(8, 4)) kde = gaussian_kde(slope_draws) x_grid = np.linspace(np.percentile(slope_draws, 0.5), np.percentile(slope_draws, 99.5), 300) ax.fill_between(x_grid, kde(x_grid), alpha=0.3, color=SEU_COLORS['primary']) ax.plot(x_grid, kde(x_grid), color=SEU_COLORS['primary'], linewidth=2) median_slope = np.median(slope_draws) ax.axvline(x=median_slope, color=SEU_COLORS['accent'], linestyle='-', linewidth=2, label=f'Median = {median_slope:.1f}') ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5, label='No effect') q05, q95 = np.percentile(slope_draws, [5, 95]) mask = (x_grid >= q05) & (x_grid <= q95) ax.fill_between(x_grid[mask], kde(x_grid[mask]), alpha=0.15, color=SEU_COLORS['accent']) ax.axvline(x=q05, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6) ax.axvline(x=q95, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6) ax.set_xlabel('Slope (Δα / ΔT)') ax.set_ylabel('Density') ax.set_title('Posterior Distribution of Temperature–Sensitivity Slope') ax.legend() plt.tight_layout() plt.show() print(f"Slope summary:") print(f" Median: {median_slope:.1f}") print(f" 90% CI: [{q05:.1f}, {q95:.1f}]") print(f" P(slope < 0): {np.mean(slope_draws < 0):.3f}") ``` ### Pairwise Comparisons ```{python} #| label: tbl-pairwise #| tbl-cap: "Posterior probability that α is higher at the lower temperature in each pair." pairs = analysis['pairwise_comparisons'] pair_rows = [] for key, prob in pairs.items(): t1, t2 = key.split('_vs_') if prob > 0.95: strength = '●●● (strong)' elif prob > 0.8: strength = '●● (moderate)' elif prob > 0.65: strength = '● (weak)' elif prob < 0.35: strength = '○ (reversed)' else: strength = '— (indistinguishable)' pair_rows.append({ 'Comparison': f'α(T={t1}) > α(T={t2})', 'P': f'{prob:.3f}', 'Evidence': strength, }) pd.DataFrame(pair_rows) ``` ### Strict Monotonicity ```{python} #| label: monotonicity #| echo: true n_draws = len(alpha_draws[0.0]) strictly_decreasing = 0 for i in range(n_draws): vals = [alpha_draws[t][i] for t in temperatures] if all(vals[j] > vals[j+1] for j in range(len(vals)-1)): strictly_decreasing += 1 p_mono = strictly_decreasing / n_draws print(f"P(α strictly decreasing across all T): {p_mono:.4f}") ``` ## Comparison with Initial Temperature Study {#sec-comparison} This comparison isolates the **LLM effect**: both studies use the same insurance triage task and m_01 model, but different LLMs. ```{python} #| label: fig-cross-study #| fig-cap: "Cross-study comparison isolating the LLM effect. Left: GPT-4o (initial study) shows clear monotonic decline. Right: Claude (this study) shows no systematic trend. Both use the insurance triage task with K=3." initial_data_dir = Path("..") / "temperature_study" / "data" with open(initial_data_dir / "primary_analysis.json") as f: initial_analysis = json.load(f) initial_temps = [s['temperature'] for s in initial_analysis['summary_table']] initial_medians = [s['median'] for s in initial_analysis['summary_table']] initial_lows = [s['ci_low'] for s in initial_analysis['summary_table']] initial_highs = [s['ci_high'] for s in initial_analysis['summary_table']] this_medians = [s['median'] for s in analysis['summary_table']] this_lows = [s['ci_low'] for s in analysis['summary_table']] this_highs = [s['ci_high'] for s in analysis['summary_table']] fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True) ax = axes[0] ax.errorbar(initial_temps, initial_medians, yerr=[np.array(initial_medians) - np.array(initial_lows), np.array(initial_highs) - np.array(initial_medians)], fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8, capsize=5, capthick=1.5) ax.set_xlabel('Temperature') ax.set_ylabel('Sensitivity (α)') ax.set_title('GPT-4o × Insurance (Initial Study)\nP(slope < 0) > 0.99') ax.set_xticks(initial_temps) ax = axes[1] ax.errorbar(temperatures, this_medians, yerr=[np.array(this_medians) - np.array(this_lows), np.array(this_highs) - np.array(this_medians)], fmt='o-', color=SEU_COLORS['accent'], linewidth=2, markersize=8, capsize=5, capthick=1.5) ax.set_xlabel('Temperature') ax.set_title(f'Claude × Insurance (This Study)\nP(slope < 0) ≈ {analysis["slope"]["p_negative"]:.2f}') ax.set_xticks(temperatures) plt.tight_layout() plt.show() ``` ```{python} #| label: tbl-cross-study #| tbl-cap: "Cross-study comparison isolating the LLM effect (same task, different LLMs)." # Handle different field names in primary_analysis.json initial_slope_val = initial_analysis['slope'].get('median', initial_analysis['slope'].get('slope', 'N/A')) initial_p_neg = initial_analysis['slope'].get('p_negative', None) cross = pd.DataFrame([ {'Study': 'GPT-4o × Insurance (initial)', 'Slope median': f"{initial_slope_val:.1f}" if isinstance(initial_slope_val, (int, float)) else initial_slope_val, 'Slope 90% CI': f"[{initial_analysis['slope']['ci_low']:.1f}, {initial_analysis['slope']['ci_high']:.1f}]", 'P(slope < 0)': f"{initial_p_neg:.3f}" if initial_p_neg else '> 0.99', 'P(strict mono)': f"{initial_analysis['monotonicity_prob']:.3f}"}, {'Study': 'Claude × Insurance (this)', 'Slope median': f"{analysis['slope']['median']:.1f}", 'Slope 90% CI': f"[{analysis['slope']['ci_low']:.1f}, {analysis['slope']['ci_high']:.1f}]", 'P(slope < 0)': f"{analysis['slope']['p_negative']:.3f}", 'P(strict mono)': f"{analysis['monotonicity_prob']:.3f}"}, ]) cross ``` ### Formal Cross-Study Slope Comparison {#sec-formal-comparison} To move beyond visual comparison, we compute the posterior probability that the GPT-4o slope is more negative than the Claude slope. Because the two studies were fit independently, we draw from each study's slope posterior via Monte Carlo and compute the overlap. ```{python} #| label: formal-slope-comparison #| echo: true # Load GPT-4o alpha draws from the initial study initial_temps_grid = [0.0, 0.3, 0.7, 1.0, 1.5] initial_alpha_draws = {} for t in initial_temps_grid: key = f"T{str(t).replace('.', '_')}" data = np.load(initial_data_dir / f"alpha_draws_{key}.npz") initial_alpha_draws[t] = data['alpha'] # Compute GPT-4o slope posterior initial_temp_array = np.array(initial_temps_grid) n_draws_initial = len(initial_alpha_draws[initial_temps_grid[0]]) initial_slope_draws = np.empty(n_draws_initial) for i in range(n_draws_initial): alphas_i = np.array([initial_alpha_draws[t][i] for t in initial_temps_grid]) initial_slope_draws[i] = np.cov(initial_temp_array, alphas_i)[0, 1] / np.var(initial_temp_array) # Use minimum draw count for paired comparison n_compare = min(len(slope_draws), len(initial_slope_draws)) p_gpt4o_more_negative = np.mean(initial_slope_draws[:n_compare] < slope_draws[:n_compare]) print(f"Cross-study slope comparison:") print(f" GPT-4o slope: median = {np.median(initial_slope_draws):.1f}, " f"90% CI = [{np.percentile(initial_slope_draws, 5):.1f}, {np.percentile(initial_slope_draws, 95):.1f}]") print(f" Claude slope: median = {np.median(slope_draws):.1f}, " f"90% CI = [{np.percentile(slope_draws, 5):.1f}, {np.percentile(slope_draws, 95):.1f}]") print(f" P(GPT-4o slope < Claude slope) = {p_gpt4o_more_negative:.3f}") print(f" (i.e., probability that GPT-4o has a more negative temperature effect)") ``` This provides a formal posterior quantity for the core claim that the temperature–sensitivity relationship differs between LLMs. ::: {.callout-note} ## Temperature Range Caveat for Slope Comparison The GPT-4o slope is estimated over a wider temperature range ($\Delta T = 1.5$) than the Claude slope ($\Delta T = 1.0$). If the GPT-4o relationship is nonlinear (with steeper decline at higher temperatures), the wider range inflates the GPT-4o slope magnitude. However, even restricting attention to the qualitative contrast — clear monotonic decline vs. flat oscillation — the LLM effect is unambiguous. ::: ### Characterising the Oscillatory Pattern {#sec-oscillation} The non-monotonic pattern in Claude's α estimates — a dip at $T = 0.2$, rise at $T = 0.5$ and $T = 0.8$, dip at $T = 1.0$ — echoes the oscillation seen in the Ellsberg study. Is this a genuine feature of Claude's behaviour, or merely posterior noise around a flat line? ```{python} #| label: oscillation-analysis #| echo: true # Compute the range (max - min) of alpha across temperatures for each posterior draw alpha_range_draws = np.empty(n_draws) for i in range(n_draws): vals = np.array([alpha_draws[t][i] for t in temperatures]) alpha_range_draws[i] = np.max(vals) - np.min(vals) # Compare against a "flat" baseline: what range would we expect if all # temperatures had the same true alpha? Use the overall mean draw as reference. overall_mean_draws = np.mean([alpha_draws[t] for t in temperatures], axis=0) # Summary statistics median_range = np.median(alpha_range_draws) q05_range, q95_range = np.percentile(alpha_range_draws, [5, 95]) # Check pairwise probabilities for notable reversals notable_reversals = [] for key, prob in pairs.items(): t1, t2 = key.split('_vs_') if prob < 0.35: notable_reversals.append((t1, t2, prob)) elif prob > 0.65: pass # expected direction, not a reversal print(f"Oscillation analysis:") print(f" Posterior range of α across temperatures:") print(f" Median: {median_range:.1f}") print(f" 90% CI: [{q05_range:.1f}, {q95_range:.1f}]") print() print(f" Pairwise probabilities near or below 0.50 (potential reversals):") for key, prob in pairs.items(): t1, t2 = key.split('_vs_') if prob < 0.50: print(f" P(α(T={t1}) > α(T={t2})) = {prob:.3f} [reversed from monotonic prediction]") print() print(f" Notable reversals (P < 0.35): {len(notable_reversals)}") for t1, t2, p in notable_reversals: print(f" P(α(T={t1}) > α(T={t2})) = {p:.3f}") ``` ```{python} #| label: oscillation-interpretation #| echo: false # Check if any pairwise comparison reaches conventional "notable" threshold any_notable = any(prob > 0.80 or prob < 0.20 for prob in pairs.values()) n_near_chance = sum(1 for prob in pairs.values() if 0.35 <= prob <= 0.65) n_total_pairs = len(pairs) print(f"Assessment:") print(f" Pairwise comparisons near chance (0.35-0.65): {n_near_chance}/{n_total_pairs}") print(f" Any pairwise comparison reaching notable level (>0.80 or <0.20): {any_notable}") if not any_notable: print(f" → The oscillatory pattern is consistent with posterior noise around a flat function.") print(f" No individual pairwise comparison provides strong evidence for a genuine non-monotonic") print(f" structure. The most parsimonious interpretation is that temperature has approximately") print(f" zero effect on Claude's sensitivity for this task.") else: print(f" → Some pairwise comparisons suggest the oscillation may have a genuine component,") print(f" warranting further investigation.") ``` ### Summary Heatmap ```{python} #| label: fig-summary #| fig-cap: "Left: α vs. temperature showing the flat, non-monotonic pattern. Right: pairwise posterior probabilities P(α_row > α_col)." pairs = analysis['pairwise_comparisons'] fig, axes = plt.subplots(1, 2, figsize=(12, 5)) medians = [np.median(alpha_draws[t]) for t in temperatures] q05s_plot = [np.percentile(alpha_draws[t], 5) for t in temperatures] q95s_plot = [np.percentile(alpha_draws[t], 95) for t in temperatures] axes[0].errorbar(temperatures, medians, yerr=[np.array(medians) - np.array(q05s_plot), np.array(q95s_plot) - np.array(medians)], fmt='o-', color=SEU_COLORS['accent'], linewidth=2, markersize=8, capsize=5, capthick=1.5) axes[0].set_xlabel('Temperature') axes[0].set_ylabel('Sensitivity (α)') axes[0].set_title('α vs. Temperature') axes[0].set_xticks(temperatures) n_temps = len(temperatures) heatmap = np.full((n_temps, n_temps), np.nan) for key, prob in pairs.items(): t1, t2 = key.split('_vs_') i = temperatures.index(float(t1)) j = temperatures.index(float(t2)) heatmap[i, j] = prob heatmap[j, i] = 1 - prob np.fill_diagonal(heatmap, 0.5) im = axes[1].imshow(heatmap, cmap='RdYlGn', vmin=0, vmax=1, aspect='equal') axes[1].set_xticks(range(n_temps)) axes[1].set_xticklabels([f'{t}' for t in temperatures]) axes[1].set_yticks(range(n_temps)) axes[1].set_yticklabels([f'{t}' for t in temperatures]) axes[1].set_xlabel('Temperature (column)') axes[1].set_ylabel('Temperature (row)') axes[1].set_title('P(α_row > α_col)') for i in range(n_temps): for j in range(n_temps): if not np.isnan(heatmap[i, j]): color = 'white' if heatmap[i, j] > 0.8 or heatmap[i, j] < 0.2 else 'black' axes[1].text(j, i, f'{heatmap[i, j]:.2f}', ha='center', va='center', fontsize=9, color=color) plt.colorbar(im, ax=axes[1], shrink=0.8) plt.tight_layout() plt.show() ``` ## Discussion {#sec-discussion} ### Summary of Findings Claude 3.5 Sonnet shows essentially **no temperature–sensitivity relationship** on the insurance triage task: 1. **No slope.** The posterior slope $\Delta\alpha / \Delta T$ has a median of $\approx -3$, with the 90% CI spanning both positive and negative values. $P(\text{slope} < 0) \approx 0.56$ — barely above chance. 2. **Non-monotonic pattern.** The α estimates oscillate: dip at $T = 0.2$, rise at $T = 0.5$ and $T = 0.8$, then dip again at $T = 1.0$. This mirrors the oscillatory pattern seen in the Ellsberg study with Claude. However, the oscillation analysis in @sec-oscillation shows that no individual pairwise comparison reaches a conventionally notable level, suggesting the non-monotonic pattern is consistent with posterior noise around a flat function rather than a genuine oscillatory structure. 3. **Near-zero strict monotonicity.** $P(\alpha \text{ strictly decreasing across all five temperatures}) = 0.008$. 4. **Good model fit.** Posterior predictive checks show no evidence of misfit, confirming the pattern reflects the data, not a modelling artefact. ### Implications for the Factorial Design This cell, combined with the initial study (GPT-4o × Insurance), provides the **clearest isolation of the LLM effect**: same task, same model, different LLM. The contrast is stark: - **GPT-4o**: slope $\approx -25$, $P(\text{slope} < 0) > 0.99$ - **Claude**: slope $\approx -3$, $P(\text{slope} < 0) \approx 0.56$ The formal cross-study comparison (@sec-formal-comparison) provides a direct posterior quantity for this contrast. This strongly suggests the temperature–sensitivity effect is **LLM-specific** rather than a universal property of temperature scaling. The full $2 \times 2$ factorial analysis ([factorial synthesis report](../factorial_synthesis/01_factorial_synthesis.qmd)) formalises this comparison by computing the interaction between LLM and task factors. ### Why Does the Effect Differ Across LLMs? {#sec-why-differ} The central finding of this study — that Claude does not exhibit the temperature–sensitivity relationship observed with GPT-4o — invites the question of *why* the two LLMs diverge. Several candidate explanations merit consideration, though the current data cannot definitively adjudicate among them. We label these as post-hoc hypotheses to be tested in future work. **Baseline decision noise.** If Claude 3.5 Sonnet already operates with relatively high internal decision noise at $T = 0.0$ (relative to GPT-4o), increasing temperature may have less room to degrade choice consistency. In the quantal response framework, this would correspond to Claude starting from a lower "effective α" even at its most deterministic setting, with temperature adding noise to a process already near a noise floor. The absolute α levels observed here — which cluster in a relatively narrow range — are consistent with this hypothesis, though they are not conclusive. **Assessment-stage vs. choice-stage effects.** Temperature affects both the assessment text (which determines the embeddings and hence the features entering the model) and potentially the choice itself. If Claude produces more stereotyped or less temperature-sensitive assessments than GPT-4o across the temperature range, the features may not vary enough to drive α differences. The comparable PCA variance profiles across studies (noted in @sec-design) suggest that gross differences in embedding geometry are not the explanation, but more fine-grained measures — such as the mean pairwise distance between embeddings at each temperature — could reveal subtler differences in assessment diversity. **Training and alignment differences.** Claude's post-training process (Constitutional AI, RLHF) may produce decision-making behaviour that is more robust to temperature perturbation than GPT-4o's. If Claude's alignment training effectively regularises its outputs toward consistent, policy-compliant responses, temperature increases might modulate surface-level linguistic variety without substantially affecting the underlying decision process. This would be consistent with the observed pattern in which Claude produces varied text (different assessment wordings across temperatures) without changing *which* alternative it selects. **Non-equivalent temperature parameterisation.** The external temperature parameter may not correspond to equivalent levels of next-token entropy across providers. Anthropic and OpenAI implement temperature scaling in architectures that differ in layer count, attention mechanism, and post-training procedures. A temperature of $T = 0.5$ in one model may correspond to a very different effective sampling entropy than $T = 0.5$ in the other. While this complicates quantitative slope comparisons, it cannot fully explain the qualitative contrast: GPT-4o shows a clear, strong monotonic decline whereas Claude shows no discernible trend whatsoever. These hypotheses are not mutually exclusive and may interact. Testing them would require access to model internals (e.g., next-token entropy distributions at each temperature level) or controlled experiments manipulating assessment diversity independently of temperature. The key empirical contribution of this study is establishing that the LLM-specificity is robust: it holds when the task is held constant. ### Connection to JDM Literature The finding that different LLMs respond differently to temperature manipulation resonates with a substantial literature on individual differences in stochastic choice. In human decision-making, individuals vary in the degree to which their choices approximate expected utility maximisation, and this heterogeneity is a persistent finding across paradigms (Hey & Orme, 1994; Wilcox, 2011). The sensitivity parameter α in our softmax framework is analogous to the "trembling hand" or noise parameter in stochastic choice models: some decision-makers exhibit consistently high α (near-deterministic EU maximisation) while others show low α (near-random choice). From this perspective, LLMs may be viewed as exhibiting model-specific "decision styles" — with GPT-4o showing a clear temperature-dependent gradient in decision noise while Claude maintains relatively stable sensitivity. This parallels human findings in which some individuals' choice consistency is highly malleable (e.g., through time pressure or cognitive load) while others' is stable (Busemeyer & Townsend, 1993). The LLM-specificity finding suggests that temperature is not a universal dial for decision quality, but rather interacts with model-specific properties in ways that require empirical characterisation for each LLM family. ### Implications for Applied LLM Deployment The LLM-specific nature of the temperature–sensitivity relationship has practical consequences for users who tune temperature to control decision quality across LLM providers. A practitioner who observes that lowering temperature improves GPT-4o's choice consistency cannot assume the same calibration will transfer to Claude — or to any other model. Temperature tuning strategies must be validated empirically for each provider, particularly in high-stakes decision-support applications. ### Limitations {#sec-limitations} Several limitations should be noted: 1. **Temperature range and grid comparability.** The Anthropic API constrains temperature to $[0.0, 1.0]$, whereas the initial GPT-4o study used $[0.0, 1.5]$. The narrower range reduces statistical power to detect a temperature effect and complicates quantitative slope comparisons. Additionally, the absolute temperature values may not correspond to equivalent next-token entropy levels across providers. While the qualitative contrast (clear monotonic decline vs. flat pattern) is unambiguous, precise numerical comparisons of slope magnitudes should be interpreted cautiously. 2. **Prior sensitivity.** The m_01 prior on α is Lognormal(3.0, 0.75), inherited from the initial temperature study. We did not conduct a formal prior sensitivity analysis for this cell. However, with $M \approx 300$ observations per condition and α posteriors that are substantially narrower than the prior, the likelihood dominates the prior contribution. A formal check under alternative priors (e.g., Lognormal(3.0, 1.0) or Lognormal(2.5, 0.75)) would provide additional reassurance that the null finding is prior-robust, though we expect the conclusion to be unchanged given the data volume. 3. **SBC inherited, not re-run.** Simulation-Based Calibration was *not* performed on this dataset. We rely on the SBC validation from the initial GPT-4o temperature study ([Initial Study](../temperature_study/01_initial_study.qmd)), which used the same `m_01.stan` model and the same `α ~ LogNormal(3.0, 0.75)` prior; SBC tests the sampler under the prior-and-likelihood and is therefore inherited validly across applications that share both. The phrase "SBC by proxy" used in earlier drafts refers to this inheritance and should not be read as evidence that an SBC was conducted on Claude data. Direct validation under this study's conditions comes from the parameter recovery analysis above. 4. **Post-hoc design.** The Claude insurance cell was added *after* the initial GPT-4o temperature finding, as part of the factorial extension; the design and analysis plan were finalised reactively rather than pre-registered before data collection. Within this study we describe the m_01 fits and slope test as confirmatory in the narrow sense that the model, prior, and slope contrast were fixed before fitting; the broader factorial framing is exploratory and is treated as such in the [factorial synthesis report](../factorial_synthesis/01_factorial_synthesis.qmd). 5. **Single embedding model.** All studies in the factorial design use text-embedding-3-small (OpenAI) for embedding construction. Different embedding models might yield different feature spaces and hence different α estimates, though the consistency of the PCA variance profile across studies suggests robustness to this choice. 6. **Position bias.** While the $3\times$ position counterbalancing design addresses systematic position effects, we did not conduct a separate analysis of whether Claude's position bias patterns differ from GPT-4o's. Differential position sensitivity could in principle contribute to the observed differences, though it seems unlikely to explain the qualitative absence of a temperature trend. 7. **Formal interaction test deferred.** The full 2×2 interaction test (LLM × Task) is computed in the [factorial synthesis report](../factorial_synthesis/01_factorial_synthesis.qmd) rather than here, to avoid redundancy and ensure consistency across cells. ## Reproducibility {#sec-reproducibility} **Codebase version:** `21d6883` (git commit hash at time of analysis) ### Data Snapshot | File | Description | |------|-------------| | `alpha_draws_T*.npz` | Posterior draws of α (4,000 per condition) | | `ppc_T*.json` | Posterior predictive check results | | `diagnostics_T*.txt` | CmdStan diagnostic output | | `stan_data_T*.json` | Stan-ready data (for refitting) | | `fit_summary.json` | Summary statistics across conditions | | `primary_analysis.json` | Pre-computed monotonicity and slope statistics | | `run_summary.json` | Pipeline metadata and configuration | | `study_config.yaml` | Frozen copy of the study configuration | ### Refitting from Source ```{python} #| label: refit-example #| eval: false #| echo: true # Uncomment to refit from source data (requires CmdStanPy) # # import cmdstanpy # model = cmdstanpy.CmdStanModel(stan_file="models/m_01.stan") # # for t in [0.0, 0.2, 0.5, 0.8, 1.0]: # key = f"T{str(t).replace('.', '_')}" # fit = model.sample( # data=f"data/stan_data_{key}.json", # chains=4, # iter_warmup=1000, # iter_sampling=1000, # seed=42, # ) # print(f"T={t}: alpha median = {np.median(fit.stan_variable('alpha')):.1f}") ``` ## References ::: {#refs} :::

	m_0 (foundational)	m_01 (this study & initial)	m_02 (Ellsberg)
\(\alpha\) prior	\(\text{Lognormal}(0, 1)\)	\(\text{Lognormal}(3.0, 0.75)\)	\(\text{Lognormal}(3.5, 0.75)\)
Prior median	\(\approx 1\)	\(\approx 20\)	\(\approx 33\)
Prior 90% CI	\([0.19, 5.0]\)	\([5.5, 67]\)	\([10, 124]\)
\(K\)	generic	3	4