---
title: "Temperature and SEU Sensitivity: GPT-4o × Ellsberg Study"
subtitle: "Application Report: GPT-4o × Ellsberg (Cell 1,2)"
description: |
An investigation of how LLM sampling temperature affects estimated
sensitivity (α) to subjective expected utility maximisation, using
Ellsberg-style gambles (K=4) and GPT-4o (OpenAI).
This study isolates the task effect by pairing GPT-4o with the Ellsberg
gamble domain used in the Ellsberg study.
categories: [applications, temperature, ellsberg, m_02, openai, factorial]
execute:
cache: true
---
```{python}
#| label: setup
#| include: false
import sys
import os
reports_root = os.path.normpath(os.path.join(os.getcwd(), '..', '..'))
project_root = os.path.dirname(reports_root)
sys.path.insert(0, reports_root)
sys.path.insert(0, project_root)
import numpy as np
import json
import re
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import pandas as pd
from report_utils import set_seu_style, SEU_COLORS, SEU_PALETTE
set_seu_style()
from pathlib import Path
data_dir = Path("data")
```
## Introduction {#sec-introduction}
The initial temperature study ([Report 1](../temperature_study/01_initial_study.qmd)) found a clear negative relationship between LLM sampling temperature and estimated sensitivity $\alpha$, using **GPT-4o** on **insurance claims triage** ($K = 3$). The Ellsberg study ([Report 2](../ellsberg_study/01_ellsberg_study.qmd)) then changed **both** the LLM (to Claude 3.5 Sonnet) and the task (to Ellsberg gambles, $K = 4$) simultaneously, finding a weaker effect — but the confounded design made it impossible to disentangle the two factors.
This study completes the $2 \times 2$ factorial design by pairing **GPT-4o** with the **Ellsberg gamble** task — holding the LLM constant relative to the initial study while varying the task, and holding the task constant relative to the Ellsberg study while varying the LLM.
| | Insurance (K=3) | Ellsberg (K=4) |
|---|---|---|
| **GPT-4o** | Initial study ✓ | **This study** |
| **Claude** | Cell (2,1) — new | Ellsberg study ✓ |
::: {.callout-important}
## Summary of Findings
GPT-4o shows a **clear negative temperature–α relationship** on the Ellsberg gamble task ($\Delta\alpha / \Delta T \approx -38$, $P(\text{slope} < 0) \approx 0.98$), **replicating** the pattern observed in the initial temperature study with insurance claims triage. Combined with the Claude × Insurance cell (which shows no such relationship), this confirms that the **LLM** is the dominant factor: GPT-4o exhibits the temperature–sensitivity effect regardless of task domain, while Claude does not.
:::
## Experimental Design {#sec-design}
### Task and Conditions
We use the Ellsberg gamble task from the [Ellsberg study](../ellsberg_study/01_ellsberg_study.qmd): GPT-4o evaluates gambles involving urns with unknown colour compositions. The alternative pool consists of 30 Ellsberg-style urn gambles, each with $K = 4$ possible monetary consequences (\$0, \$1, \$2, \$3) corresponding to drawing different-coloured balls from urns of 60–120 balls. The gambles are organized into three ambiguity tiers:
- **Tier 1 (No ambiguity; E01–E08):** 8 gambles with fully specified ball counts — pure risk with known objective probabilities.
- **Tier 2 (Moderate ambiguity; E09–E20):** 12 gambles with partially bounded counts ("at least"/"at most" constraints); urn totals always stated.
- **Tier 3 (High ambiguity; E21–E30):** 10 gambles where only one colour's count is known; the remainder is an unknown mixture — closest to Ellsberg's (1961) original setup.
The LLM first *assesses* each gamble individually, and then makes a *choice* among alternatives in each problem. The gamble set is identical to that used in the Claude × Ellsberg study, ensuring that any differences in results are attributable to the LLM, not the stimuli.
Five temperature levels define the between-condition factor:
```{python}
#| label: tbl-conditions
#| tbl-cap: "Experimental conditions. Each temperature level constitutes a separate model fit."
conditions = pd.DataFrame({
'Level': [1, 2, 3, 4, 5],
'Temperature': [0.0, 0.3, 0.7, 1.0, 1.5],
'Description': [
'Deterministic (greedy decoding)',
'Low variance',
'Moderate variance',
'High variance',
'Very high variance'
]
})
conditions
```
::: {.callout-note}
## Temperature Range
The OpenAI API supports temperature values in $[0.0, 2.0]$, allowing a wider range than the $[0.0, 1.0]$ supported by Anthropic. The temperature values $\{0.0, 0.3, 0.7, 1.0, 1.5\}$ match those used in the initial temperature study (GPT-4o × Insurance), enabling direct comparison.
:::
### Design Parameters
```{python}
#| label: design-params
#| echo: true
import yaml
with open(data_dir / "study_config.yaml") as f:
config = yaml.safe_load(f)
with open(data_dir / "run_summary.json") as f:
run_summary = json.load(f)
print(f"Study Design:")
print(f" Decision problems (M): {config['num_problems']} base × {config['num_presentations']} presentations = {config['num_problems'] * config['num_presentations']}")
print(f" Alternatives per problem: {config['min_alternatives']}–{config['max_alternatives']}")
print(f" Consequences (K): {config['K']}")
print(f" Embedding dimensions (D): {config['target_dim']}")
print(f" Distinct alternatives (R): {run_summary['phases']['phase3_data_prep']['per_temperature']['0.0']['R']}")
print(f" LLM model: {config['llm_model']}")
print(f" Embedding model: {config['embedding_model']}")
print(f" Provider: {config['provider']}")
```
Each of the 100 base problems is presented $P = 3$ times with alternatives shuffled to different positions, yielding approximately $M = 300$ observations per temperature condition. The position counterbalancing mitigates any systematic position bias (e.g., a tendency to select the first-listed alternative), which was identified as a potential confound in early pilot work.
### Feature Construction
Alternative features are constructed through a two-stage embedding process. First, GPT-4o assesses each Ellsberg gamble at the relevant temperature, producing a natural-language evaluation. These assessments are embedded using `text-embedding-3-small` (OpenAI), yielding high-dimensional vectors. All embeddings across temperature conditions are pooled and projected via PCA to $D = 32$ dimensions.
```{python}
#| label: pca-variance
#| echo: false
pca = run_summary['phases']['phase3_data_prep']['pca_summary']
cumvar = np.cumsum(pca['explained_variance_ratio'])
print(f"PCA Summary:")
print(f" Components retained: {pca['n_components']}")
print(f" Total variance explained: {pca['total_explained_variance']:.1%}")
print(f" First 5 components: {cumvar[4]:.1%}")
print(f" First 10 components: {cumvar[9]:.1%}")
```
The PCA reduction to $D = 32$ components captures the large majority of variance in GPT-4o's Ellsberg gamble assessments. The fact that relatively few components account for most of the variance indicates that the embedding space, while nominally high-dimensional, has a moderate effective dimensionality — consistent with the assessment texts varying along a limited number of semantic axes related to gamble value, risk, and ambiguity.
### Data Quality
```{python}
#| label: data-quality
#| echo: false
na = run_summary['phases']['phase2b_choices']['na_summary']
print(f"NA Summary:")
print(f" Overall: {na['overall']['na']} / {na['overall']['total']} ({na['overall']['na_rate']:.1%})")
for key, val in na['per_temperature'].items():
print(f" {key}: {val['na']} / {val['total']} ({val['na_rate']:.1%})")
```
The overall NA rate is low (well under 1%), comparable to the rates observed in the other factorial cells. NA responses (cases where the LLM failed to produce a parseable choice) are excluded from analysis; at these rates, missingness is unlikely to introduce systematic bias.
### Comparison with Other Factorial Cells
```{python}
#| label: tbl-design-comparison
#| tbl-cap: "Design comparison across the 2×2 factorial. This study (Cell 1,2) shares GPT-4o with Cell (1,1) and shares Ellsberg gambles with Cell (2,2)."
comparison = pd.DataFrame({
'Parameter': ['LLM', 'Task domain', 'Consequences (K)',
'Alternatives (R)', 'Observations per T',
'Temperature range', 'Stan model'],
'Cell (1,1) Initial': ['GPT-4o', 'Insurance triage', '3',
'30', '~300', '[0.0, 0.3, 0.7, 1.0, 1.5]', 'm_01'],
'Cell (1,2) This study': ['GPT-4o', 'Ellsberg gambles', '4',
'30', '~300', '[0.0, 0.3, 0.7, 1.0, 1.5]', 'm_02'],
'Cell (2,2) Ellsberg': ['Claude 3.5 Sonnet', 'Ellsberg gambles', '4',
'30', '~300', '[0.0, 0.2, 0.5, 0.8, 1.0]', 'm_02'],
})
comparison
```
## Model and Prior Calibration {#sec-model}
### The m_02 Model Variant
We fit the **m_02** model — the same variant used in the Ellsberg study. The prior on $\alpha$ is calibrated for the Ellsberg gamble task's $K = 4$ consequence space:
| | m_0 (foundational) | m_01 (insurance) | m_02 (this study & Ellsberg) |
|---|---|---|---|
| $\alpha$ prior | $\text{Lognormal}(0, 1)$ | $\text{Lognormal}(3.0, 0.75)$ | $\text{Lognormal}(3.5, 0.75)$ |
| Prior median | $\approx 1$ | $\approx 20$ | $\approx 33$ |
| Prior 90% CI | $[0.19, 5.0]$ | $[5.5, 67]$ | $[10, 124]$ |
| $K$ | generic | 3 | 4 |
Using the same model and prior as the Ellsberg study ensures that any difference in results is attributable to the LLM change (GPT-4o vs. Claude), not to modelling differences.
## Model Validation {#sec-validation}
### Parameter Recovery {#sec-parameter-recovery}
We validate that m_02's parameters are identifiable under this study's design ($M \approx 300$, $K = 4$, $D = 32$, $R = 30$) via 20 iterations of parameter recovery.
```{python}
#| label: load-recovery
#| output: false
recovery_dir = os.path.join(project_root, "results", "parameter_recovery", "gpt4o_ellsberg_recovery")
recovery_summary_dir = os.path.join(recovery_dir, "recovery_summary")
with open(os.path.join(recovery_summary_dir, "recovery_statistics.json")) as f:
recovery_stats = json.load(f)
true_params_path = os.path.join(recovery_dir, "all_true_parameters.json")
with open(true_params_path) as f:
all_true_params = json.load(f)
posterior_summaries = []
true_params_list = []
for i in range(1, 21):
iter_dir = os.path.join(recovery_dir, f"iteration_{i}")
summary_path = os.path.join(iter_dir, "posterior_summary.csv")
if os.path.exists(summary_path):
df = pd.read_csv(summary_path, index_col=0)
posterior_summaries.append(df)
true_params_list.append(all_true_params[i - 1])
n_successful = len(posterior_summaries)
print(f"Loaded {n_successful} recovery iterations")
```
```{python}
#| label: fig-alpha-recovery
#| fig-cap: "Recovery of the sensitivity parameter α under the m_02 prior with the GPT-4o × Ellsberg design (K=4). Left: true vs. estimated values with identity line. Right: 90% credible intervals for each iteration, coloured by whether they contain the true value."
alpha_true = np.array([p['alpha'] for p in true_params_list])
alpha_mean = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries])
alpha_lower = np.array([s.loc['alpha', '5%'] for s in posterior_summaries])
alpha_upper = np.array([s.loc['alpha', '95%'] for s in posterior_summaries])
alpha_bias = np.mean(alpha_mean - alpha_true)
alpha_rmse = np.sqrt(np.mean((alpha_mean - alpha_true)**2))
alpha_coverage = np.mean((alpha_true >= alpha_lower) & (alpha_true <= alpha_upper))
alpha_ci_width = np.mean(alpha_upper - alpha_lower)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
ax = axes[0]
ax.scatter(alpha_true, alpha_mean, alpha=0.7, s=60, c=SEU_COLORS['primary'], edgecolor='white')
lims = [min(alpha_true.min(), alpha_mean.min()) * 0.9,
max(alpha_true.max(), alpha_mean.max()) * 1.1]
ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line')
ax.set_xlim(lims)
ax.set_ylim(lims)
ax.set_xlabel('True α', fontsize=12)
ax.set_ylabel('Estimated α (posterior mean)', fontsize=12)
ax.set_title(f'α Recovery: Bias={alpha_bias:.2f}, RMSE={alpha_rmse:.2f}', fontsize=12)
ax.legend()
ax.set_aspect('equal')
ax = axes[1]
for i in range(len(alpha_true)):
covered = (alpha_true[i] >= alpha_lower[i]) & (alpha_true[i] <= alpha_upper[i])
color = 'forestgreen' if covered else 'crimson'
ax.plot([i, i], [alpha_lower[i], alpha_upper[i]], color=color, linewidth=2, alpha=0.7)
ax.scatter(i, alpha_mean[i], color=color, s=40, zorder=3)
ax.scatter(np.arange(len(alpha_true)), alpha_true, color='black', s=60, marker='x',
label='True value', zorder=4, linewidth=2)
ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('α', fontsize=12)
ax.set_title(f'α: 90% Credible Intervals (Coverage = {alpha_coverage:.0%})', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
```
```{python}
#| label: tbl-recovery-metrics
#| tbl-cap: "Parameter recovery metrics for m_02 with the GPT-4o × Ellsberg design (M≈300, K=4, D=32, R=30)."
K_val = 4
D_val = 32
all_beta_coverage = []
for k in range(K_val):
for d in range(D_val):
param_name = f"beta[{k+1},{d+1}]"
try:
bt = np.array([p['beta'][k][d] for p in true_params_list])
bl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
bu = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])
all_beta_coverage.append(np.mean((bt >= bl) & (bt <= bu)))
except (KeyError, IndexError):
pass
all_delta_coverage = []
for k in range(K_val - 1):
param_name = f"delta[{k+1}]"
try:
dt = np.array([p['delta'][k] for p in true_params_list])
dl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
du = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])
all_delta_coverage.append(np.mean((dt >= dl) & (dt <= du)))
except (KeyError, IndexError):
pass
metrics = pd.DataFrame([
{'Parameter': 'α', 'Bias': f'{alpha_bias:.2f}', 'RMSE': f'{alpha_rmse:.2f}',
'Coverage (90%)': f'{alpha_coverage:.0%}', 'CI Width': f'{alpha_ci_width:.1f}'},
{'Parameter': f'β (mean over {K_val*D_val})',
'Bias': '—', 'RMSE': '—',
'Coverage (90%)': f'{np.mean(all_beta_coverage):.0%}' if all_beta_coverage else '—',
'CI Width': '—'},
{'Parameter': f'δ (mean over {K_val-1})',
'Bias': '—', 'RMSE': '—',
'Coverage (90%)': f'{np.mean(all_delta_coverage):.0%}' if all_delta_coverage else '—',
'CI Width': '—'},
])
metrics
```
::: {.callout-note}
## SBC Inherited, Not Re-Run
Simulation-Based Calibration (SBC) was *not* performed on this dataset. We rely on the SBC validation from the [Ellsberg study report](../ellsberg_study/01_ellsberg_study.qmd), which used the same `m_02.stan` model and the same `α ~ LogNormal(3.5, 0.75)` prior; SBC tests the sampler under the prior-and-likelihood geometry and is therefore inherited validly across applications that share both the model and the prior. The same convention is used (and labelled consistently) in the [Claude insurance study](../claude_insurance_study/01_claude_insurance_study.qmd) limitations section. The parameter recovery analysis above (which *was* performed with this study's specific design parameters) provides the complementary, data-specific assurance that the posterior concentrates correctly around known true values under realistic conditions.
:::
## Results {#sec-results}
### Loading Posterior Draws
```{python}
#| label: load-posteriors
#| output: false
temperatures = [0.0, 0.3, 0.7, 1.0, 1.5]
temp_labels = {t: f"T={t}" for t in temperatures}
alpha_draws = {}
for t in temperatures:
key = f"T{str(t).replace('.', '_')}"
data = np.load(data_dir / f"alpha_draws_{key}.npz")
alpha_draws[t] = data['alpha']
with open(data_dir / "primary_analysis.json") as f:
analysis = json.load(f)
with open(data_dir / "fit_summary.json") as f:
fit_summary = json.load(f)
```
```{python}
#| echo: false
for t in temperatures:
n = len(alpha_draws[t])
print(f" T={t}: {n:,} posterior draws loaded")
```
### MCMC Diagnostics
```{python}
#| label: tbl-diagnostics
#| tbl-cap: "MCMC diagnostics for all five temperature conditions. All fits used 4 chains with 1,000 warmup and 1,000 sampling iterations each (4,000 post-warmup draws total)."
diag_rows = []
for t in temperatures:
key = f"T{str(t).replace('.', '_')}"
with open(data_dir / f"diagnostics_{key}.txt") as f:
diag_text = f.read()
if "No divergent transitions" in diag_text:
n_div = 0
else:
match = re.search(r'(\d+) of (\d+)', diag_text)
n_div = int(match.group(1)) if match else 0
rhat_ok = "R-hat values satisfactory" in diag_text or "R_hat" not in diag_text.replace("R-hat values satisfactory", "")
ess_ok = "effective sample size satisfactory" in diag_text
ebfmi_ok = "E-BFMI satisfactory" in diag_text
diag_rows.append({
'Temperature': t,
'Divergences': f"{n_div}/4000",
'R̂': '✓' if rhat_ok else '✗',
'ESS': '✓' if ess_ok else '✗',
'E-BFMI': '✓' if ebfmi_ok else '✗',
})
pd.DataFrame(diag_rows)
```
### Posterior Summaries
```{python}
#| label: tbl-posteriors
#| tbl-cap: "Posterior summaries for the sensitivity parameter α at each temperature level. Intervals are 90% credible intervals."
summary = analysis['summary_table']
rows = []
for s in summary:
rows.append({
'Temperature': s['temperature'],
'Median': f"{s['median']:.1f}",
'Mean': f"{s['mean']:.1f}",
'SD': f"{s['sd']:.1f}",
'90% CI': f"[{s['ci_low']:.1f}, {s['ci_high']:.1f}]",
})
pd.DataFrame(rows)
```
A clear declining pattern is visible: $\alpha$ drops from $\approx 110$ at $T = 0.0$ to $\approx 52$ at $T = 1.5$, a roughly 2:1 ratio across the temperature range.
### Forest Plot
```{python}
#| label: fig-forest
#| fig-cap: "Forest plot of posterior α distributions across temperature conditions. Points show posterior medians; thick bars span the 50% credible interval; thin bars span the 90% credible interval."
#| fig-height: 5
fig, ax = plt.subplots(figsize=(8, 5))
y_positions = np.arange(len(temperatures))[::-1]
for i, t in enumerate(temperatures):
draws = alpha_draws[t]
median = np.median(draws)
q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95])
y = y_positions[i]
ax.plot([q05, q95], [y, y], color=SEU_PALETTE[i], linewidth=1.5, alpha=0.7)
ax.plot([q25, q75], [y, y], color=SEU_PALETTE[i], linewidth=4, alpha=0.9)
ax.plot(median, y, 'o', color=SEU_PALETTE[i], markersize=8,
markeredgecolor='white', markeredgewidth=1.5, zorder=5)
ax.set_yticks(y_positions)
ax.set_yticklabels([f'T = {t}' for t in temperatures])
ax.set_xlabel('Sensitivity (α)')
ax.set_title('Posterior Distributions of α by Temperature')
ax.grid(axis='x', alpha=0.3)
ax.grid(axis='y', alpha=0)
plt.tight_layout()
plt.show()
```
### Posterior Densities
```{python}
#| label: fig-density
#| fig-cap: "Kernel density estimates of the posterior α distributions. The posteriors show a clear separation, with higher temperatures yielding lower α values."
#| fig-height: 5
from scipy.stats import gaussian_kde
fig, ax = plt.subplots(figsize=(8, 5))
for i, t in enumerate(temperatures):
draws = alpha_draws[t]
kde = gaussian_kde(draws)
x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.1, 300)
ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=SEU_PALETTE[i])
ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2,
label=f'T = {t} (median = {np.median(draws):.0f})')
ax.set_xlabel('Sensitivity (α)')
ax.set_ylabel('Density')
ax.set_title('Posterior Density of α')
ax.legend(loc='upper right')
plt.tight_layout()
plt.show()
```
### Posterior Predictive Checks
```{python}
#| label: tbl-ppc
#| tbl-cap: "Posterior predictive check p-values for each temperature condition. Values near 0.5 indicate good calibration."
ppc_rows = []
for t in temperatures:
key = f"T{str(t).replace('.', '_')}"
with open(data_dir / f"ppc_{key}.json") as f:
ppc = json.load(f)
pvals = ppc['p_values']
ppc_rows.append({
'Temperature': t,
'Log-likelihood': f"{pvals['ll']:.3f}",
'Modal frequency': f"{pvals['modal']:.3f}",
'Mean probability': f"{pvals['prob']:.3f}",
})
pd.DataFrame(ppc_rows)
```
All posterior predictive p-values fall within an acceptable range, indicating adequate model fit across conditions.
## Monotonicity Analysis {#sec-monotonicity}
### Global Slope
```{python}
#| label: fig-slope
#| fig-cap: "Posterior distribution of the slope Δα/ΔT. The bulk of the distribution lies below zero, providing strong evidence for a negative temperature–sensitivity relationship."
slopes = analysis['slope']
temp_array = np.array(temperatures)
slope_draws = []
for draw_idx in range(len(alpha_draws[temperatures[0]])):
alphas_at_draw = np.array([alpha_draws[t][draw_idx] for t in temperatures])
b = np.cov(temp_array, alphas_at_draw)[0, 1] / np.var(temp_array)
slope_draws.append(b)
slope_draws = np.array(slope_draws)
fig, ax = plt.subplots(figsize=(8, 4))
kde = gaussian_kde(slope_draws)
x_grid = np.linspace(np.percentile(slope_draws, 0.5), np.percentile(slope_draws, 99.5), 300)
ax.fill_between(x_grid, kde(x_grid), alpha=0.3, color=SEU_COLORS['primary'])
ax.plot(x_grid, kde(x_grid), color=SEU_COLORS['primary'], linewidth=2)
median_slope = np.median(slope_draws)
ax.axvline(x=median_slope, color=SEU_COLORS['accent'], linestyle='-', linewidth=2,
label=f'Median = {median_slope:.1f}')
ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5, label='No effect')
q05, q95 = np.percentile(slope_draws, [5, 95])
mask = (x_grid >= q05) & (x_grid <= q95)
ax.fill_between(x_grid[mask], kde(x_grid[mask]), alpha=0.15, color=SEU_COLORS['accent'])
ax.axvline(x=q05, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6)
ax.axvline(x=q95, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6)
ax.set_xlabel('Slope (Δα / ΔT)')
ax.set_ylabel('Density')
ax.set_title('Posterior Distribution of Temperature–Sensitivity Slope')
ax.legend()
plt.tight_layout()
plt.show()
print(f"Slope summary:")
print(f" Median: {median_slope:.1f}")
print(f" 90% CI: [{q05:.1f}, {q95:.1f}]")
print(f" P(slope < 0): {np.mean(slope_draws < 0):.3f}")
```
### Pairwise Comparisons
```{python}
#| label: tbl-pairwise
#| tbl-cap: "Posterior probability that α is higher at the lower temperature in each pair."
pairs = analysis['pairwise_comparisons']
pair_rows = []
for key, prob in pairs.items():
t1, t2 = key.split('_vs_')
if prob > 0.95:
strength = '●●● (strong)'
elif prob > 0.8:
strength = '●● (moderate)'
elif prob > 0.65:
strength = '● (weak)'
elif prob < 0.35:
strength = '○ (reversed)'
else:
strength = '— (indistinguishable)'
pair_rows.append({
'Comparison': f'α(T={t1}) > α(T={t2})',
'P': f'{prob:.3f}',
'Evidence': strength,
})
pd.DataFrame(pair_rows)
```
### Strict Monotonicity
```{python}
#| label: monotonicity
#| echo: true
n_draws = len(alpha_draws[0.0])
strictly_decreasing = 0
for i in range(n_draws):
vals = [alpha_draws[t][i] for t in temperatures]
if all(vals[j] > vals[j+1] for j in range(len(vals)-1)):
strictly_decreasing += 1
p_mono = strictly_decreasing / n_draws
print(f"P(α strictly decreasing across all T): {p_mono:.4f}")
```
The posterior probability of strict monotonicity — that $\alpha$ decreases at every successive temperature step — provides a stringent test of the ordering. Combined with the pairwise comparisons above, these results indicate that the negative temperature–$\alpha$ relationship is not driven by any single temperature pair but reflects a consistent ordering across the full range.
## Comparison with Other Factorial Cells {#sec-comparison}
This comparison addresses two questions:
1. **Task effect** — Is GPT-4o's temperature–α relationship stable across tasks? (Compare with Cell 1,1: GPT-4o × Insurance)
2. **LLM effect** — Does the choice of LLM matter for the Ellsberg task? (Compare with Cell 2,2: Claude × Ellsberg)
```{python}
#| label: fig-cross-study
#| fig-cap: "Cross-study comparisons. Left: GPT-4o across both tasks (isolating the task effect) — both show clear negative slopes. Right: Ellsberg task across both LLMs (isolating the LLM effect) — GPT-4o shows the effect but Claude does not."
initial_data_dir = Path("..") / "temperature_study" / "data"
ellsberg_data_dir = Path("..") / "ellsberg_study" / "data"
with open(initial_data_dir / "primary_analysis.json") as f:
initial_analysis = json.load(f)
with open(ellsberg_data_dir / "primary_analysis.json") as f:
ellsberg_analysis = json.load(f)
initial_temps = [s['temperature'] for s in initial_analysis['summary_table']]
initial_medians = [s['median'] for s in initial_analysis['summary_table']]
initial_lows = [s['ci_low'] for s in initial_analysis['summary_table']]
initial_highs = [s['ci_high'] for s in initial_analysis['summary_table']]
this_medians = [s['median'] for s in analysis['summary_table']]
this_lows = [s['ci_low'] for s in analysis['summary_table']]
this_highs = [s['ci_high'] for s in analysis['summary_table']]
ells_temps = [s['temperature'] for s in ellsberg_analysis['summary_table']]
ells_medians = [s['median'] for s in ellsberg_analysis['summary_table']]
ells_lows = [s['ci_low'] for s in ellsberg_analysis['summary_table']]
ells_highs = [s['ci_high'] for s in ellsberg_analysis['summary_table']]
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left panel: GPT-4o on both tasks
ax = axes[0]
ax.errorbar(initial_temps, initial_medians,
yerr=[np.array(initial_medians) - np.array(initial_lows),
np.array(initial_highs) - np.array(initial_medians)],
fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8,
capsize=5, capthick=1.5, label='Insurance (K=3)')
ax.errorbar(temperatures, this_medians,
yerr=[np.array(this_medians) - np.array(this_lows),
np.array(this_highs) - np.array(this_medians)],
fmt='s-', color=SEU_COLORS['accent'], linewidth=2, markersize=8,
capsize=5, capthick=1.5, label='Ellsberg (K=4)')
ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('GPT-4o: Task Effect\n(Same LLM, Different Tasks)')
ax.legend()
# Right panel: Ellsberg with both LLMs
ax = axes[1]
ax.errorbar(temperatures, this_medians,
yerr=[np.array(this_medians) - np.array(this_lows),
np.array(this_highs) - np.array(this_medians)],
fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8,
capsize=5, capthick=1.5, label='GPT-4o')
ax.errorbar(ells_temps, ells_medians,
yerr=[np.array(ells_medians) - np.array(ells_lows),
np.array(ells_highs) - np.array(ells_medians)],
fmt='s-', color=SEU_COLORS['accent'], linewidth=2, markersize=8,
capsize=5, capthick=1.5, label='Claude')
ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('Ellsberg Task: LLM Effect\n(Same Task, Different LLMs)')
ax.legend()
plt.tight_layout()
plt.show()
```
```{python}
#| label: tbl-cross-study
#| tbl-cap: "Cross-study comparison across the 2×2 factorial."
# Handle different field names
initial_slope_val = initial_analysis['slope'].get('median', initial_analysis['slope'].get('slope', 'N/A'))
initial_p_neg = initial_analysis['slope'].get('p_negative', None)
ellsberg_slope_val = ellsberg_analysis['slope'].get('median', ellsberg_analysis['slope'].get('slope', 'N/A'))
ellsberg_p_neg = ellsberg_analysis['slope'].get('p_negative', None)
cross = pd.DataFrame([
{'Cell': '(1,1) GPT-4o × Insurance',
'Slope median': f"{initial_slope_val:.1f}" if isinstance(initial_slope_val, (int, float)) else initial_slope_val,
'P(slope < 0)': f"{initial_p_neg:.3f}" if initial_p_neg else '> 0.99',
'P(strict mono)': f"{initial_analysis['monotonicity_prob']:.3f}"},
{'Cell': '(1,2) GPT-4o × Ellsberg (this)',
'Slope median': f"{analysis['slope']['median']:.1f}",
'P(slope < 0)': f"{analysis['slope']['p_negative']:.3f}",
'P(strict mono)': f"{analysis['monotonicity_prob']:.3f}"},
{'Cell': '(2,2) Claude × Ellsberg',
'Slope median': f"{ellsberg_slope_val:.1f}" if isinstance(ellsberg_slope_val, (int, float)) else ellsberg_slope_val,
'P(slope < 0)': f"{ellsberg_p_neg:.3f}" if ellsberg_p_neg else '—',
'P(strict mono)': f"{ellsberg_analysis['monotonicity_prob']:.3f}"},
])
cross
```
### Key Observations
1. **Task effect (GPT-4o row):** GPT-4o shows a clear negative slope on *both* tasks — approximately $-25$ on insurance and $-38$ on Ellsberg. The *qualitative* pattern (clear negative relationship) replicates across task domains. However, the absolute slope magnitudes and $\alpha$ levels are **not directly comparable** across these cells, because they use different model variants (m_01 vs. m_02), different priors (Lognormal(3.0, 0.75) vs. Lognormal(3.5, 0.75)), and different numbers of consequences ($K = 3$ vs. $K = 4$). The $\alpha$ parameter's scale is implicitly set by the interaction of prior, consequence space, and utility geometry, so a value of $\alpha \approx 110$ in the $K = 4$ Ellsberg cell does not have the same behavioural meaning as $\alpha \approx 41$ in the $K = 3$ insurance cell. The key inference is that both cells exhibit a clearly negative slope, not that one slope is "larger" than the other.
2. **LLM effect (Ellsberg column):** On the same Ellsberg task, GPT-4o obtains $P(\text{slope} < 0) \approx 0.98$ while Claude obtains $P(\text{slope} < 0) \approx 0.77$. The LLM makes a substantial difference. Since both cells use the same m_02 model and the same prior, the $\alpha$ estimates *are* directly comparable within the Ellsberg column.
3. **Pattern:** GPT-4o consistently shows clear monotonic decline; Claude consistently shows weaker or absent effects.
::: {.callout-warning}
## Temperature Range Mismatch
The GPT-4o cells use temperatures $\{0.0, 0.3, 0.7, 1.0, 1.5\}$ ($\Delta T = 1.5$) while the Claude cells use $\{0.0, 0.2, 0.5, 0.8, 1.0\}$ ($\Delta T = 1.0$) due to API constraints. The slopes ($\Delta\alpha / \Delta T$) are estimated over different covariate ranges, which complicates direct magnitude comparison. The GPT-4o slope may appear steeper in part because it includes the $T = 1.5$ point, where the largest marginal drop occurs. The restricted-range analysis below addresses this by recomputing the GPT-4o slope over $T \in [0.0, 1.0]$ only.
:::
### Restricted-Range Slope (T ∈ [0.0, 1.0])
To enable a fairer cross-LLM comparison, we recompute the GPT-4o slope using only the four temperature levels within $[0.0, 1.0]$, approximately matching Claude's range.
```{python}
#| label: restricted-slope
#| echo: true
restricted_temps = [0.0, 0.3, 0.7, 1.0]
restricted_temp_array = np.array(restricted_temps)
restricted_slope_draws = []
for draw_idx in range(len(alpha_draws[0.0])):
alphas_at_draw = np.array([alpha_draws[t][draw_idx] for t in restricted_temps])
b = np.cov(restricted_temp_array, alphas_at_draw)[0, 1] / np.var(restricted_temp_array)
restricted_slope_draws.append(b)
restricted_slope_draws = np.array(restricted_slope_draws)
r_median = np.median(restricted_slope_draws)
r_q05, r_q95 = np.percentile(restricted_slope_draws, [5, 95])
r_p_neg = np.mean(restricted_slope_draws < 0)
print(f"Restricted-range slope (T ∈ [0.0, 1.0]):")
print(f" Median: {r_median:.1f}")
print(f" 90% CI: [{r_q05:.1f}, {r_q95:.1f}]")
print(f" P(slope < 0): {r_p_neg:.3f}")
print(f"")
print(f"Full-range slope (T ∈ [0.0, 1.5]):")
print(f" Median: {median_slope:.1f}")
print(f" P(slope < 0): {np.mean(slope_draws < 0):.3f}")
```
The negative relationship is preserved — and remains strong — even when the high-temperature $T = 1.5$ condition is excluded. This confirms that the effect is not an artifact of the extended temperature range available for GPT-4o, and supports the cross-LLM comparison on a common temperature interval.
### Summary Heatmap
```{python}
#| label: fig-summary
#| fig-cap: "Left: α vs. temperature showing the declining pattern. Right: pairwise posterior probabilities P(α_row > α_col)."
pairs = analysis['pairwise_comparisons']
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
medians = [np.median(alpha_draws[t]) for t in temperatures]
q05s_plot = [np.percentile(alpha_draws[t], 5) for t in temperatures]
q95s_plot = [np.percentile(alpha_draws[t], 95) for t in temperatures]
axes[0].errorbar(temperatures, medians,
yerr=[np.array(medians) - np.array(q05s_plot),
np.array(q95s_plot) - np.array(medians)],
fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8,
capsize=5, capthick=1.5)
axes[0].set_xlabel('Temperature')
axes[0].set_ylabel('Sensitivity (α)')
axes[0].set_title('α vs. Temperature')
axes[0].set_xticks(temperatures)
n_temps = len(temperatures)
heatmap = np.full((n_temps, n_temps), np.nan)
for key, prob in pairs.items():
t1, t2 = key.split('_vs_')
i = temperatures.index(float(t1))
j = temperatures.index(float(t2))
heatmap[i, j] = prob
heatmap[j, i] = 1 - prob
np.fill_diagonal(heatmap, 0.5)
im = axes[1].imshow(heatmap, cmap='RdYlGn', vmin=0, vmax=1, aspect='equal')
axes[1].set_xticks(range(n_temps))
axes[1].set_xticklabels([f'{t}' for t in temperatures])
axes[1].set_yticks(range(n_temps))
axes[1].set_yticklabels([f'{t}' for t in temperatures])
axes[1].set_xlabel('Temperature (column)')
axes[1].set_ylabel('Temperature (row)')
axes[1].set_title('P(α_row > α_col)')
for i in range(n_temps):
for j in range(n_temps):
if not np.isnan(heatmap[i, j]):
color = 'white' if heatmap[i, j] > 0.8 or heatmap[i, j] < 0.2 else 'black'
axes[1].text(j, i, f'{heatmap[i, j]:.2f}', ha='center', va='center',
fontsize=9, color=color)
plt.colorbar(im, ax=axes[1], shrink=0.8)
plt.tight_layout()
plt.show()
```
## Discussion {#sec-discussion}
### Summary of Findings
GPT-4o shows a **clear negative temperature–sensitivity relationship** on the Ellsberg gamble task:
1. **Strong negative slope.** The posterior slope $\Delta\alpha / \Delta T$ has a median of $\approx -38$, with $P(\text{slope} < 0) \approx 0.98$. This is comparable in strength to the initial temperature study. The posterior probability of strict monotonicity — $\alpha$ decreasing at every successive temperature step — reinforces this conclusion (see @sec-monotonicity).
2. **Consistent decline.** The $\alpha$ estimates decline monotonically from $\approx 110$ at $T = 0.0$ to $\approx 52$ at $T = 1.5$, a roughly 2:1 ratio. The largest marginal drop is between $T = 1.0$ and $T = 1.5$, suggesting a possible nonlinearity in the temperature–sensitivity relationship: the marginal effect of temperature on $\alpha$ may be largest at high temperatures. This high-temperature regime ($T > 1.0$) is unavailable for Claude due to API constraints, limiting cross-LLM comparability in this range. Whether the large $T = 1.0 \to 1.5$ drop reflects a genuine threshold effect or simply the tail of a smooth nonlinear relationship cannot be determined from five temperature points alone.
3. **Good model fit.** Posterior predictive checks indicate adequate calibration at all temperature levels.
### Interpreting α in Behavioural Terms
The $\alpha$ parameter controls the concentration of the softmax choice rule: higher $\alpha$ implies sharper concentration of choice probability on the SEU-maximising alternative. At $\alpha \approx 110$ ($T = 0.0$), the model assigns overwhelming probability to the EU-maximising gamble in most decision problems — choice behaviour is nearly deterministic. At $\alpha \approx 52$ ($T = 1.5$), the concentration is still well above chance (which would correspond to $\alpha \to 0$), but the model allocates meaningfully more probability mass to suboptimal alternatives. The 2:1 decline thus represents a shift from near-deterministic SEU maximisation to a regime where random or heuristic choice patterns are more frequent, though still well above indifference.
### Comparability of α Across Factorial Cells
A recurring issue in cross-cell comparison is the **non-comparability of absolute $\alpha$ values** across cells that use different model variants. The insurance cells ($K = 3$) use m_01 with a Lognormal(3.0, 0.75) prior, while the Ellsberg cells ($K = 4$) use m_02 with a Lognormal(3.5, 0.75) prior. The $\alpha$ parameter interacts with the dimension of the consequence space and the utility geometry in ways that make raw magnitudes non-portable: a value of $\alpha = 50$ implies different choice probabilities under $K = 3$ vs. $K = 4$ because the softmax operates over different-sized simplex spaces. Consequently, comparisons across the task dimension of the factorial should focus on **directional quantities** — the sign of the slope, $P(\text{slope} < 0)$, and strict monotonicity probabilities — rather than absolute $\alpha$ levels or slope magnitudes. Within-column comparisons (same task, different LLMs) do not have this limitation, since both cells share the same model and prior.
### Implications for the Factorial Design
This cell confirms that GPT-4o's temperature–sensitivity relationship **generalises across task domains**. The two GPT-4o cells (insurance and Ellsberg) both show clear declines, while the two Claude cells (insurance and Ellsberg) both show weak or absent effects. This pattern strongly supports the conclusion that:
- **LLM is the primary driver.** The main effect of LLM (GPT-4o vs. Claude) is large and consistent.
- **Task is a secondary factor.** Changing the task ($K = 3$ to $K = 4$) does not eliminate or reverse the effect for GPT-4o, and does not create the effect for Claude.
Note that this report analyses each factorial cell independently; it does not formally model the LLM × Task interaction. The [2×2 Factorial Synthesis](../factorial_synthesis/01_factorial_synthesis.qmd) report quantifies the LLM main effect, task main effect, and their interaction by jointly analysing the posterior draws from all four cells, including a difference-in-differences analysis that addresses the interaction directly.
### Connection to the Ambiguity Aversion Literature
The Ellsberg gamble paradigm was originally designed to reveal violations of subjective expected utility theory — specifically, ambiguity aversion (Ellsberg, 1961; Camerer & Weber, 1992). In the human decision-making literature, the canonical finding is that agents prefer gambles with known probabilities over gambles with equivalent expected values but unknown probabilities (Trautmann & van de Kuilen, 2015). Our gamble set spans three ambiguity tiers precisely to allow for such effects.
In this study, however, we use the Ellsberg gambles primarily as a **task domain** for the factorial design — a set of $K = 4$ decision problems that differs structurally from the insurance triage task. The $\alpha$ parameter captures *overall* sensitivity to expected utility maximisation, not ambiguity-specific processing. Whether GPT-4o exhibits ambiguity aversion per se — i.e., whether its subjective probability estimates ($\psi$) or choice patterns systematically shift as a function of gamble ambiguity tier — is an important question that lies beyond the scope of the current temperature-focused analysis. We note this as a natural direction for future work: examining whether the temperature effect on $\alpha$ is modulated by ambiguity level (e.g., whether high-ambiguity gambles show a larger temperature-induced degradation in choice quality) would connect this line of work more directly to the behavioural decision theory literature.
### Embedding Model and Cross-LLM Comparability
Both GPT-4o cells use OpenAI's `text-embedding-3-small` for feature construction, and both Claude cells also use `text-embedding-3-small` (following the same pipeline). The embedding model is therefore held constant across all four factorial cells, preventing the embedding representation from confounding the LLM main effect. Any differences in the PCA features across cells arise from differences in the LLM's natural-language assessments (which vary with both LLM identity and temperature), not from the embedding step.
### Relation to the Broader Literature
The finding that LLM identity is the dominant factor in the temperature–sensitivity relationship contributes to a growing literature examining LLM decision-making and rationality. Recent work has investigated how LLMs perform on classic decision-theoretic tasks (Binz & Schulz, 2023), how API parameters affect output quality (Renze & Guven, 2024), and whether LLMs exhibit systematic biases analogous to those documented in humans (Hagendorff et al., 2023). Our results suggest that the mapping from temperature to choice quality is not a universal property of LLM decoding but depends on model-specific characteristics — potentially reflecting differences in training, architecture, or the relationship between the temperature parameter and the model's internal representations.
### Limitations
1. **No prior sensitivity analysis.** The m_02 prior on $\alpha$ is Lognormal(3.5, 0.75), with prior 90% CI $\approx [10, 124]$. The posterior at $T = 0.0$ ($\alpha \approx 110$) falls near the upper tail of this prior. While $M \approx 300$ observations provide substantial information, and the parameter recovery analysis confirms good identifiability, we have not verified that the slope sign and $P(\text{slope} < 0)$ are robust to alternative prior specifications (e.g., a wider Lognormal(3.5, 1.0) or a shifted Lognormal(4.0, 0.75)). This is a standard robustness check for Bayesian analyses (Kruschke, 2013) and is planned as a supplementary analysis.
2. **Temperature range mismatch.** The GPT-4o cells sample temperatures up to $T = 1.5$, while Claude is limited to $T = 1.0$. The restricted-range analysis in @sec-comparison confirms that the negative slope holds over $T \in [0.0, 1.0]$, but slope *magnitudes* remain difficult to compare directly given the non-identical temperature grids.
3. **No within-study ambiguity analysis.** The gamble set spans three ambiguity tiers, but the current analysis does not examine whether $\alpha$ or choice patterns vary by tier. This limits the study's contribution to the ambiguity aversion literature specifically.
## Reproducibility {#sec-reproducibility}
### Data Snapshot
| File | Description |
|------|-------------|
| `alpha_draws_T*.npz` | Posterior draws of α (4,000 per condition) |
| `ppc_T*.json` | Posterior predictive check results |
| `diagnostics_T*.txt` | CmdStan diagnostic output |
| `stan_data_T*.json` | Stan-ready data (for refitting) |
| `fit_summary.json` | Summary statistics across conditions |
| `primary_analysis.json` | Pre-computed monotonicity and slope statistics |
| `run_summary.json` | Pipeline metadata and configuration |
| `study_config.yaml` | Frozen copy of the study configuration |
### Refitting from Source
```{python}
#| label: refit-example
#| eval: false
#| echo: true
# Uncomment to refit from source data (requires CmdStanPy)
#
# import cmdstanpy
# model = cmdstanpy.CmdStanModel(stan_file="models/m_02.stan")
#
# for t in [0.0, 0.3, 0.7, 1.0, 1.5]:
# key = f"T{str(t).replace('.', '_')}"
# fit = model.sample(
# data=f"data/stan_data_{key}.json",
# chains=4,
# iter_warmup=1000,
# iter_sampling=1000,
# seed=42,
# )
# print(f"T={t}: alpha median = {np.median(fit.stan_variable('alpha')):.1f}")
```
## References
::: {#refs}
:::