---
title: "Temperature and SEU Sensitivity: GPT-4o × Ellsberg Study"
subtitle: "Application Report: GPT-4o × Ellsberg (Cell 1,2)"
description: |
An investigation of how LLM sampling temperature affects estimated
sensitivity (α) to subjective expected utility maximization, using
Ellsberg-style gambles (K=4) and GPT-4o (OpenAI).
This study isolates the task effect by pairing GPT-4o with the Ellsberg
gamble domain used in the Ellsberg study.
categories: [applications, temperature, ellsberg, m_02, openai, factorial]
bibliography: ../../references.bib
execute:
cache: true
---
```{python}
#| label: setup
#| include: false
import sys
import os
reports_root = os.path.normpath(os.path.join(os.getcwd(), '..', '..'))
project_root = os.path.dirname(reports_root)
sys.path.insert(0, reports_root)
sys.path.insert(0, project_root)
import numpy as np
import json
import re
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import pandas as pd
from report_utils import set_seu_style, SEU_COLORS, SEU_PALETTE
set_seu_style()
from pathlib import Path
data_dir = Path("data")
```
## Introduction {#sec-introduction}
The initial temperature study ([Report 1](../temperature_study/01_initial_study.qmd)) found a clear negative relationship between LLM sampling temperature and estimated sensitivity $\alpha$, using **GPT-4o** on **insurance claims triage** ($K = 3$). The Ellsberg study ([Report 2](../ellsberg_study/01_ellsberg_study.qmd)) then changed **both** the LLM (to Claude 3.5 Sonnet) and the task (to Ellsberg gambles, $K = 4$) simultaneously, finding a weaker effect — but the confounded design made it impossible to disentangle the two factors.
This study completes the $2 \times 2$ factorial design by pairing **GPT-4o** with the **Ellsberg gamble** task — holding the LLM constant relative to the initial study while varying the task, and holding the task constant relative to the Ellsberg study while varying the LLM.
| | Insurance (K=3) | Ellsberg (K=4) |
|---|---|---|
| **GPT-4o** | Initial study ✓ | **This study** |
| **Claude** | Cell (2,1) — new | Ellsberg study ✓ |
::: {.callout-important}
## Summary of Findings
GPT-4o shows a **strong negative global temperature–α slope** on the Ellsberg gamble task ($\Delta\alpha / \Delta T \approx -38$, $P(\text{slope} < 0) \approx 0.98$, 90% CI $[-72, -10]$), **replicating qualitatively** the pattern observed in the initial temperature study with insurance claims triage. Strict monotonicity is low ($P \approx 0.09$): the global decline is real but is dominated by separation at $T = 1.5$, with adjacent low/mid-temperature pairs overlapping substantially. Within this cell, the negative global slope is therefore robust but coarse-grained. Cross-cell attribution — the LLM main effect, the task main effect, and any interaction — is reported in the [Factorial Synthesis](../factorial_synthesis/01_factorial_synthesis.qmd).
:::
## Experimental Design {#sec-design}
### Task and Conditions
We use the Ellsberg gamble task from the [Ellsberg study](../ellsberg_study/01_ellsberg_study.qmd): GPT-4o evaluates gambles involving urns with unknown color compositions. The alternative pool consists of 30 Ellsberg-style urn gambles, each with $K = 4$ possible monetary consequences (\$0, \$1, \$2, \$3) corresponding to drawing different-colored balls from urns of 60–120 balls. The gambles are organized into three ambiguity tiers:
- **Tier 1 (No ambiguity; E01–E08):** 8 gambles with fully specified ball counts — pure risk with known objective probabilities.
- **Tier 2 (Moderate ambiguity; E09–E20):** 12 gambles with partially bounded counts ("at least"/"at most" constraints); urn totals always stated.
- **Tier 3 (High ambiguity; E21–E30):** 10 gambles where only one color's count is known; the remainder is an unknown mixture — closest to @ellsberg1961's original setup.
The LLM first *assesses* each gamble individually, and then makes a *choice* among alternatives in each problem. The gamble set is identical to that used in the Claude × Ellsberg study, ensuring that any differences in results are attributable to the LLM, not the stimuli.
Five temperature levels define the between-condition factor:
```{python}
#| label: tbl-conditions
#| tbl-cap: "Experimental conditions. Each temperature level constitutes a separate model fit."
conditions = pd.DataFrame({
'Level': [1, 2, 3, 4, 5],
'Temperature': [0.0, 0.3, 0.7, 1.0, 1.5],
'Description': [
'Deterministic (greedy decoding)',
'Low variance',
'Moderate variance',
'High variance',
'Very high variance'
]
})
conditions
```
::: {.callout-note}
## Temperature Range
The OpenAI API supports temperature values in $[0.0, 2.0]$, allowing a wider range than the $[0.0, 1.0]$ supported by Anthropic. The temperature values $\{0.0, 0.3, 0.7, 1.0, 1.5\}$ match those used in the initial temperature study (GPT-4o × Insurance), enabling a cleaner within-provider comparison of the temperature grid; α levels and slope magnitudes across cells still require care because the task, model variant (m_01 vs. m_02), prior, and $K$ also differ.
:::
### Design Parameters
```{python}
#| label: design-params
#| echo: true
import yaml
with open(data_dir / "study_config.yaml") as f:
config = yaml.safe_load(f)
with open(data_dir / "run_summary.json") as f:
run_summary = json.load(f)
print(f"Study Design:")
print(f" Decision problems (M): {config['num_problems']} base × {config['num_presentations']} presentations = {config['num_problems'] * config['num_presentations']}")
print(f" Alternatives per problem: {config['min_alternatives']}–{config['max_alternatives']}")
print(f" Consequences (K): {config['K']}")
print(f" Embedding dimensions (D): {config['target_dim']}")
print(f" Distinct alternatives (R): {run_summary['phases']['phase3_data_prep']['per_temperature']['0.0']['R']}")
print(f" LLM model: {config['llm_model']}")
print(f" Embedding model: {config['embedding_model']}")
print(f" Provider: {config['provider']}")
```
Each of the 100 base problems is presented $P = 3$ times with alternatives shuffled to different positions, yielding approximately $M = 300$ observations per temperature condition. The position counterbalancing mitigates any systematic position bias (e.g., a tendency to select the first-listed alternative), which was identified as a potential confound in early pilot work.
### Feature Construction
Alternative features are constructed through a two-stage embedding process. First, GPT-4o assesses each Ellsberg gamble at the relevant temperature, producing a natural-language evaluation. These assessments are embedded using `text-embedding-3-small` (OpenAI), yielding high-dimensional vectors. All embeddings across temperature conditions are pooled and projected via PCA to $D = 32$ dimensions.
```{python}
#| label: pca-variance
#| echo: false
pca = run_summary['phases']['phase3_data_prep']['pca_summary']
cumvar = np.cumsum(pca['explained_variance_ratio'])
print(f"PCA Summary:")
print(f" Components retained: {pca['n_components']}")
print(f" Total variance explained: {pca['total_explained_variance']:.1%}")
print(f" First 5 components: {cumvar[4]:.1%}")
print(f" First 10 components: {cumvar[9]:.1%}")
```
The PCA reduction to $D = 32$ components captures the large majority of variance in GPT-4o's Ellsberg gamble assessments. The fact that relatively few components account for most of the variance indicates that the embedding space, while nominally high-dimensional, has a moderate effective dimensionality — consistent with the assessment texts varying along a limited number of semantic axes related to gamble value, risk, and ambiguity.
### Data Quality
```{python}
#| label: data-quality
#| echo: false
na = run_summary['phases']['phase2b_choices']['na_summary']
print(f"NA Summary:")
print(f" Overall: {na['overall']['na']} / {na['overall']['total']} ({na['overall']['na_rate']:.1%})")
for key, val in na['per_temperature'].items():
print(f" {key}: {val['na']} / {val['total']} ({val['na_rate']:.1%})")
```
The overall NA rate is low (well under 1%), comparable to the rates observed in the other factorial cells. NA responses (cases where the LLM failed to produce a parseable choice) are excluded from analysis; at these rates, missingness is unlikely to introduce systematic bias.
### Comparison with Other Factorial Cells
```{python}
#| label: tbl-design-comparison
#| tbl-cap: "Design comparison across the 2×2 factorial. This study (Cell 1,2) shares GPT-4o with Cell (1,1) and shares Ellsberg gambles with Cell (2,2)."
comparison = pd.DataFrame({
'Parameter': ['LLM', 'Task domain', 'Consequences (K)',
'Alternatives (R)', 'Observations per T',
'Temperature range', 'Stan model'],
'Cell (1,1) Initial': ['GPT-4o', 'Insurance triage', '3',
'30', '~300', '[0.0, 0.3, 0.7, 1.0, 1.5]', 'm_01'],
'Cell (1,2) This study': ['GPT-4o', 'Ellsberg gambles', '4',
'30', '~300', '[0.0, 0.3, 0.7, 1.0, 1.5]', 'm_02'],
'Cell (2,2) Ellsberg': ['Claude 3.5 Sonnet', 'Ellsberg gambles', '4',
'30', '~300', '[0.0, 0.2, 0.5, 0.8, 1.0]', 'm_02'],
})
comparison
```
## Model and Prior Calibration {#sec-model}
### The m_02 Model Variant
We fit the **m_02** model — the same variant used in the Ellsberg study. The prior on $\alpha$ is calibrated for the Ellsberg gamble task's $K = 4$ consequence space:
| | m_0 (foundational) | m_01 (insurance) | m_02 (this study & Ellsberg) |
|---|---|---|---|
| $\alpha$ prior | $\text{Lognormal}(0, 1)$ | $\text{Lognormal}(3.0, 0.75)$ | $\text{Lognormal}(3.5, 0.75)$ |
| Prior median | $\approx 1$ | $\approx 20$ | $\approx 33$ |
| Prior 90% CI | $[0.19, 5.0]$ | $[5.5, 67]$ | $[10, 124]$ |
| $K$ | generic | 3 | 4 |
Using the same model and prior as the Ellsberg study ensures that any difference in results is attributable to the LLM change (GPT-4o vs. Claude), not to modeling differences.
## Model Validation {#sec-validation}
### Parameter Recovery {#sec-parameter-recovery}
We validate that m_02's parameters are identifiable under this study's design ($M \approx 300$, $K = 4$, $D = 32$, $R = 30$) via 20 iterations of parameter recovery.
```{python}
#| label: load-recovery
#| output: false
recovery_dir = os.path.join(project_root, "results", "parameter_recovery", "gpt4o_ellsberg_recovery")
recovery_summary_dir = os.path.join(recovery_dir, "recovery_summary")
with open(os.path.join(recovery_summary_dir, "recovery_statistics.json")) as f:
recovery_stats = json.load(f)
true_params_path = os.path.join(recovery_dir, "all_true_parameters.json")
with open(true_params_path) as f:
all_true_params = json.load(f)
posterior_summaries = []
true_params_list = []
for i in range(1, 21):
iter_dir = os.path.join(recovery_dir, f"iteration_{i}")
summary_path = os.path.join(iter_dir, "posterior_summary.csv")
if os.path.exists(summary_path):
df = pd.read_csv(summary_path, index_col=0)
posterior_summaries.append(df)
true_params_list.append(all_true_params[i - 1])
n_successful = len(posterior_summaries)
print(f"Loaded {n_successful} recovery iterations")
```
```{python}
#| label: fig-alpha-recovery
#| fig-cap: "Recovery of the sensitivity parameter α under the m_02 prior with the GPT-4o × Ellsberg design (K=4). Left: true vs. estimated values with identity line. Right: 90% credible intervals for each iteration, colored by whether they contain the true value."
alpha_true = np.array([p['alpha'] for p in true_params_list])
alpha_mean = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries])
alpha_lower = np.array([s.loc['alpha', '5%'] for s in posterior_summaries])
alpha_upper = np.array([s.loc['alpha', '95%'] for s in posterior_summaries])
alpha_bias = np.mean(alpha_mean - alpha_true)
alpha_rmse = np.sqrt(np.mean((alpha_mean - alpha_true)**2))
alpha_coverage = np.mean((alpha_true >= alpha_lower) & (alpha_true <= alpha_upper))
alpha_ci_width = np.mean(alpha_upper - alpha_lower)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
ax = axes[0]
ax.scatter(alpha_true, alpha_mean, alpha=0.7, s=60, c=SEU_COLORS['primary'], edgecolor='white')
lims = [min(alpha_true.min(), alpha_mean.min()) * 0.9,
max(alpha_true.max(), alpha_mean.max()) * 1.1]
ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line')
ax.set_xlim(lims)
ax.set_ylim(lims)
ax.set_xlabel('True α', fontsize=12)
ax.set_ylabel('Estimated α (posterior mean)', fontsize=12)
ax.set_title(f'α Recovery: Bias={alpha_bias:.2f}, RMSE={alpha_rmse:.2f}', fontsize=12)
ax.legend()
ax.set_aspect('equal')
ax = axes[1]
for i in range(len(alpha_true)):
covered = (alpha_true[i] >= alpha_lower[i]) & (alpha_true[i] <= alpha_upper[i])
color = 'forestgreen' if covered else 'crimson'
ax.plot([i, i], [alpha_lower[i], alpha_upper[i]], color=color, linewidth=2, alpha=0.7)
ax.scatter(i, alpha_mean[i], color=color, s=40, zorder=3)
ax.scatter(np.arange(len(alpha_true)), alpha_true, color='black', s=60, marker='x',
label='True value', zorder=4, linewidth=2)
ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('α', fontsize=12)
ax.set_title(f'α: 90% Credible Intervals (Coverage = {alpha_coverage:.0%})', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
```
```{python}
#| label: tbl-recovery-metrics
#| tbl-cap: "Parameter recovery metrics for m_02 with the GPT-4o × Ellsberg design (M≈300, K=4, D=32, R=30)."
K_val = 4
D_val = 32
all_beta_coverage = []
for k in range(K_val):
for d in range(D_val):
param_name = f"beta[{k+1},{d+1}]"
try:
bt = np.array([p['beta'][k][d] for p in true_params_list])
bl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
bu = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])
all_beta_coverage.append(np.mean((bt >= bl) & (bt <= bu)))
except (KeyError, IndexError):
pass
all_delta_coverage = []
for k in range(K_val - 1):
param_name = f"delta[{k+1}]"
try:
dt = np.array([p['delta'][k] for p in true_params_list])
dl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
du = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])
all_delta_coverage.append(np.mean((dt >= dl) & (dt <= du)))
except (KeyError, IndexError):
pass
metrics = pd.DataFrame([
{'Parameter': 'α', 'Bias': f'{alpha_bias:.2f}', 'RMSE': f'{alpha_rmse:.2f}',
'Coverage (90%)': f'{alpha_coverage:.0%}', 'CI Width': f'{alpha_ci_width:.1f}'},
{'Parameter': f'β (mean over {K_val*D_val})',
'Bias': '—', 'RMSE': '—',
'Coverage (90%)': f'{np.mean(all_beta_coverage):.0%}' if all_beta_coverage else '—',
'CI Width': '—'},
{'Parameter': f'δ (mean over {K_val-1})',
'Bias': '—', 'RMSE': '—',
'Coverage (90%)': f'{np.mean(all_delta_coverage):.0%}' if all_delta_coverage else '—',
'CI Width': '—'},
])
metrics
```
::: {.callout-note}
## SBC Inherited, Not Re-Run
Simulation-Based Calibration (SBC) was *not* performed on this dataset. We rely on the SBC validation from the [Ellsberg study report](../ellsberg_study/01_ellsberg_study.qmd), which used the same `m_02.stan` model and the same `α ~ LogNormal(3.5, 0.75)` prior. SBC tests sampler calibration under the prior-and-likelihood geometry, so the existing SBC results are *applicable* to this application because the model and prior match; this is not a dataset-specific SBC run, and we note that distinction explicitly. The same convention is used (and labeled consistently) in the [Claude insurance study](../claude_insurance_study/01_claude_insurance_study.qmd) limitations section. The parameter recovery analysis above (which *was* performed with this study's specific design parameters) provides the complementary, data-specific assurance that the posterior concentrates correctly around known true values under realistic conditions.
:::
## Results {#sec-results}
### Loading Posterior Draws
```{python}
#| label: load-posteriors
#| output: false
temperatures = [0.0, 0.3, 0.7, 1.0, 1.5]
temp_labels = {t: f"T={t}" for t in temperatures}
alpha_draws = {}
for t in temperatures:
key = f"T{str(t).replace('.', '_')}"
data = np.load(data_dir / f"alpha_draws_{key}.npz")
alpha_draws[t] = data['alpha']
with open(data_dir / "primary_analysis.json") as f:
analysis = json.load(f)
with open(data_dir / "fit_summary.json") as f:
fit_summary = json.load(f)
```
```{python}
#| echo: false
for t in temperatures:
n = len(alpha_draws[t])
print(f" T={t}: {n:,} posterior draws loaded")
```
### MCMC Diagnostics
```{python}
#| label: tbl-diagnostics
#| tbl-cap: "MCMC diagnostics for all five temperature conditions. All fits used 4 chains with 1,000 warmup and 1,000 sampling iterations each (4,000 post-warmup draws total)."
diag_rows = []
for t in temperatures:
key = f"T{str(t).replace('.', '_')}"
with open(data_dir / f"diagnostics_{key}.txt") as f:
diag_text = f.read()
if "No divergent transitions" in diag_text:
n_div = 0
else:
match = re.search(r'(\d+) of (\d+)', diag_text)
n_div = int(match.group(1)) if match else 0
# R-hat: satisfactory only if CmdStan reports the satisfactory phrase and
# no parameters are flagged as exceeding the 1.01 threshold.
rhat_warn = "R-hat greater than 1.01" in diag_text
rhat_ok = ("R-hat values satisfactory" in diag_text) and not rhat_warn
# α-specific check: the CmdStan warning block lists offending parameters by
# name; if "alpha" never appears in the file, no α R-hat exceeded 1.01.
rhat_alpha_ok = "alpha" not in diag_text
ess_ok = "effective sample size satisfactory" in diag_text
ebfmi_ok = "E-BFMI satisfactory" in diag_text
diag_rows.append({
'Temperature': t,
'Divergences': f"{n_div}/4000",
'R̂ (α)': '✓' if rhat_alpha_ok else '✗',
'R̂ (all)': '✓' if rhat_ok else '✗',
'ESS': '✓' if ess_ok else '✗',
'E-BFMI': '✓' if ebfmi_ok else '✗',
})
pd.DataFrame(diag_rows)
```
Posterior predictive checks (reported below) are satisfactory across all five conditions, and effective sample size, treedepth, and E-BFMI diagnostics are clean. Two of the five fits show small numbers of divergent transitions (4/4000 at $T = 0.0$ and 6/4000 at $T = 1.5$, each $\le 0.15\%$), and several fits flag rank-normalized split R-hat $> 1.01$ for individual entries of the per-item parameters β and ψ — these are nuisance parameters in the SEU model rather than the sensitivity parameter α, and the α posteriors themselves remain well-mixed (R̂ < 1.01 for every α in every condition). We therefore distinguish the α-specific diagnostics (column R̂ (α)) from the all-parameters diagnostics (column R̂ (all)).
### Posterior Summaries
```{python}
#| label: tbl-posteriors
#| tbl-cap: "Posterior summaries for the sensitivity parameter α at each temperature level. Intervals are 90% credible intervals."
summary = analysis['summary_table']
rows = []
for s in summary:
rows.append({
'Temperature': s['temperature'],
'Median': f"{s['median']:.1f}",
'Mean': f"{s['mean']:.1f}",
'SD': f"{s['sd']:.1f}",
'90% CI': f"[{s['ci_low']:.1f}, {s['ci_high']:.1f}]",
})
pd.DataFrame(rows)
```
A clear declining pattern is visible: $\alpha$ drops from $\approx 110$ at $T = 0.0$ to $\approx 52$ at $T = 1.5$, a roughly 2:1 ratio across the temperature range.
### Forest Plot
```{python}
#| label: fig-forest
#| fig-cap: "Forest plot of posterior α distributions across temperature conditions. Points show posterior medians; thick bars span the 50% credible interval; thin bars span the 90% credible interval."
#| fig-height: 5
fig, ax = plt.subplots(figsize=(8, 5))
y_positions = np.arange(len(temperatures))[::-1]
for i, t in enumerate(temperatures):
draws = alpha_draws[t]
median = np.median(draws)
q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95])
y = y_positions[i]
ax.plot([q05, q95], [y, y], color=SEU_PALETTE[i], linewidth=1.5, alpha=0.7)
ax.plot([q25, q75], [y, y], color=SEU_PALETTE[i], linewidth=4, alpha=0.9)
ax.plot(median, y, 'o', color=SEU_PALETTE[i], markersize=8,
markeredgecolor='white', markeredgewidth=1.5, zorder=5)
ax.set_yticks(y_positions)
ax.set_yticklabels([f'T = {t}' for t in temperatures])
ax.set_xlabel('Sensitivity (α)')
ax.set_title('Posterior Distributions of α by Temperature')
ax.grid(axis='x', alpha=0.3)
ax.grid(axis='y', alpha=0)
plt.tight_layout()
plt.show()
```
### Posterior Densities
```{python}
#| label: fig-density
#| fig-cap: "Kernel density estimates of the posterior α distributions. The highest temperature ($T = 1.5$) separates clearly from the rest; the four lower temperatures overlap substantially, consistent with the modest adjacent pairwise probabilities reported in @sec-monotonicity."
#| fig-height: 5
from scipy.stats import gaussian_kde
fig, ax = plt.subplots(figsize=(8, 5))
for i, t in enumerate(temperatures):
draws = alpha_draws[t]
kde = gaussian_kde(draws)
x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.1, 300)
ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=SEU_PALETTE[i])
ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2,
label=f'T = {t} (median = {np.median(draws):.0f})')
ax.set_xlabel('Sensitivity (α)')
ax.set_ylabel('Density')
ax.set_title('Posterior Density of α')
ax.legend(loc='upper right')
plt.tight_layout()
plt.show()
```
### Posterior Predictive Checks
```{python}
#| label: tbl-ppc
#| tbl-cap: "Posterior predictive check p-values for each temperature condition. Values near 0.5 indicate good calibration."
ppc_rows = []
for t in temperatures:
key = f"T{str(t).replace('.', '_')}"
with open(data_dir / f"ppc_{key}.json") as f:
ppc = json.load(f)
pvals = ppc['p_values']
ppc_rows.append({
'Temperature': t,
'Log-likelihood': f"{pvals['ll']:.3f}",
'Modal frequency': f"{pvals['modal']:.3f}",
'Mean probability': f"{pvals['prob']:.3f}",
})
pd.DataFrame(ppc_rows)
```
All posterior predictive p-values fall within an acceptable range, indicating adequate model fit across conditions.
## Monotonicity Analysis {#sec-monotonicity}
### Global Slope
```{python}
#| label: fig-slope
#| fig-cap: "Posterior distribution of the slope Δα/ΔT. The bulk of the distribution lies below zero, providing strong evidence for a negative temperature–sensitivity relationship."
slopes = analysis['slope']
temp_array = np.array(temperatures)
slope_draws = []
for draw_idx in range(len(alpha_draws[temperatures[0]])):
alphas_at_draw = np.array([alpha_draws[t][draw_idx] for t in temperatures])
b = np.cov(temp_array, alphas_at_draw)[0, 1] / np.var(temp_array)
slope_draws.append(b)
slope_draws = np.array(slope_draws)
fig, ax = plt.subplots(figsize=(8, 4))
kde = gaussian_kde(slope_draws)
x_grid = np.linspace(np.percentile(slope_draws, 0.5), np.percentile(slope_draws, 99.5), 300)
ax.fill_between(x_grid, kde(x_grid), alpha=0.3, color=SEU_COLORS['primary'])
ax.plot(x_grid, kde(x_grid), color=SEU_COLORS['primary'], linewidth=2)
median_slope = np.median(slope_draws)
ax.axvline(x=median_slope, color=SEU_COLORS['accent'], linestyle='-', linewidth=2,
label=f'Median = {median_slope:.1f}')
ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5, label='No effect')
q05, q95 = np.percentile(slope_draws, [5, 95])
mask = (x_grid >= q05) & (x_grid <= q95)
ax.fill_between(x_grid[mask], kde(x_grid[mask]), alpha=0.15, color=SEU_COLORS['accent'])
ax.axvline(x=q05, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6)
ax.axvline(x=q95, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6)
ax.set_xlabel('Slope (Δα / ΔT)')
ax.set_ylabel('Density')
ax.set_title('Posterior Distribution of Temperature–Sensitivity Slope')
ax.legend()
plt.tight_layout()
plt.show()
print(f"Slope summary:")
print(f" Median: {median_slope:.1f}")
print(f" 90% CI: [{q05:.1f}, {q95:.1f}]")
print(f" P(slope < 0): {np.mean(slope_draws < 0):.3f}")
```
### Pairwise Comparisons
```{python}
#| label: tbl-pairwise
#| tbl-cap: "Posterior probability that α is higher at the lower temperature in each pair."
pairs = analysis['pairwise_comparisons']
pair_rows = []
for key, prob in pairs.items():
t1, t2 = key.split('_vs_')
if prob > 0.95:
strength = '●●● (strong)'
elif prob > 0.8:
strength = '●● (moderate)'
elif prob > 0.65:
strength = '● (weak)'
elif prob < 0.35:
strength = '○ (reversed)'
else:
strength = '— (indistinguishable)'
pair_rows.append({
'Comparison': f'α(T={t1}) > α(T={t2})',
'P': f'{prob:.3f}',
'Evidence': strength,
})
pd.DataFrame(pair_rows)
```
### Strict Monotonicity
```{python}
#| label: monotonicity
#| echo: true
n_draws = len(alpha_draws[0.0])
strictly_decreasing = 0
for i in range(n_draws):
vals = [alpha_draws[t][i] for t in temperatures]
if all(vals[j] > vals[j+1] for j in range(len(vals)-1)):
strictly_decreasing += 1
p_mono = strictly_decreasing / n_draws
print(f"P(α strictly decreasing across all T): {p_mono:.4f}")
```
The posterior probability of strict monotonicity — that $\alpha$ decreases at every successive temperature step — provides a stringent test of the ordering. Combined with the pairwise comparisons above, these results indicate that the negative temperature–$\alpha$ relationship is not driven by any single temperature pair but reflects a consistent ordering across the full range.
## Comparison with Other Factorial Cells {#sec-comparison}
This comparison addresses two questions:
1. **Task effect** — Is GPT-4o's temperature–α relationship stable across tasks? (Compare with Cell 1,1: GPT-4o × Insurance)
2. **LLM effect** — Does the choice of LLM matter for the Ellsberg task? (Compare with Cell 2,2: Claude × Ellsberg)
```{python}
#| label: fig-cross-study
#| fig-cap: "Cross-study comparisons. Left: GPT-4o across both tasks (isolating the task effect) — both show clear negative slopes. Right: Ellsberg task across both LLMs (isolating the LLM effect) — GPT-4o shows the effect but Claude does not."
initial_data_dir = Path("..") / "temperature_study" / "data"
ellsberg_data_dir = Path("..") / "ellsberg_study" / "data"
with open(initial_data_dir / "primary_analysis.json") as f:
initial_analysis = json.load(f)
with open(ellsberg_data_dir / "primary_analysis.json") as f:
ellsberg_analysis = json.load(f)
initial_temps = [s['temperature'] for s in initial_analysis['summary_table']]
initial_medians = [s['median'] for s in initial_analysis['summary_table']]
initial_lows = [s['ci_low'] for s in initial_analysis['summary_table']]
initial_highs = [s['ci_high'] for s in initial_analysis['summary_table']]
this_medians = [s['median'] for s in analysis['summary_table']]
this_lows = [s['ci_low'] for s in analysis['summary_table']]
this_highs = [s['ci_high'] for s in analysis['summary_table']]
ells_temps = [s['temperature'] for s in ellsberg_analysis['summary_table']]
ells_medians = [s['median'] for s in ellsberg_analysis['summary_table']]
ells_lows = [s['ci_low'] for s in ellsberg_analysis['summary_table']]
ells_highs = [s['ci_high'] for s in ellsberg_analysis['summary_table']]
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left panel: GPT-4o on both tasks
ax = axes[0]
ax.errorbar(initial_temps, initial_medians,
yerr=[np.array(initial_medians) - np.array(initial_lows),
np.array(initial_highs) - np.array(initial_medians)],
fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8,
capsize=5, capthick=1.5, label='Insurance (K=3)')
ax.errorbar(temperatures, this_medians,
yerr=[np.array(this_medians) - np.array(this_lows),
np.array(this_highs) - np.array(this_medians)],
fmt='s-', color=SEU_COLORS['accent'], linewidth=2, markersize=8,
capsize=5, capthick=1.5, label='Ellsberg (K=4)')
ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('GPT-4o: Task Effect\n(Same LLM, Different Tasks)')
ax.legend()
# Right panel: Ellsberg with both LLMs
ax = axes[1]
ax.errorbar(temperatures, this_medians,
yerr=[np.array(this_medians) - np.array(this_lows),
np.array(this_highs) - np.array(this_medians)],
fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8,
capsize=5, capthick=1.5, label='GPT-4o')
ax.errorbar(ells_temps, ells_medians,
yerr=[np.array(ells_medians) - np.array(ells_lows),
np.array(ells_highs) - np.array(ells_medians)],
fmt='s-', color=SEU_COLORS['accent'], linewidth=2, markersize=8,
capsize=5, capthick=1.5, label='Claude')
ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('Ellsberg Task: LLM Effect\n(Same Task, Different LLMs)')
ax.legend()
plt.tight_layout()
plt.show()
```
```{python}
#| label: tbl-cross-study
#| tbl-cap: "Cross-study comparison across the 2×2 factorial."
# Handle different field names
initial_slope_val = initial_analysis['slope'].get('median', initial_analysis['slope'].get('slope', 'N/A'))
initial_p_neg = initial_analysis['slope'].get('p_negative', None)
ellsberg_slope_val = ellsberg_analysis['slope'].get('median', ellsberg_analysis['slope'].get('slope', 'N/A'))
ellsberg_p_neg = ellsberg_analysis['slope'].get('p_negative', None)
cross = pd.DataFrame([
{'Cell': '(1,1) GPT-4o × Insurance',
'Slope median': f"{initial_slope_val:.1f}" if isinstance(initial_slope_val, (int, float)) else initial_slope_val,
'P(slope < 0)': f"{initial_p_neg:.3f}" if initial_p_neg else '> 0.99',
'P(strict mono)': f"{initial_analysis['monotonicity_prob']:.3f}"},
{'Cell': '(1,2) GPT-4o × Ellsberg (this)',
'Slope median': f"{analysis['slope']['median']:.1f}",
'P(slope < 0)': f"{analysis['slope']['p_negative']:.3f}",
'P(strict mono)': f"{analysis['monotonicity_prob']:.3f}"},
{'Cell': '(2,2) Claude × Ellsberg',
'Slope median': f"{ellsberg_slope_val:.1f}" if isinstance(ellsberg_slope_val, (int, float)) else ellsberg_slope_val,
'P(slope < 0)': f"{ellsberg_p_neg:.3f}" if ellsberg_p_neg else '—',
'P(strict mono)': f"{ellsberg_analysis['monotonicity_prob']:.3f}"},
])
cross
```
### Key Observations
1. **Task effect (GPT-4o row):** GPT-4o shows a clear negative slope on *both* tasks — approximately $-25$ on insurance and $-38$ on Ellsberg. The *qualitative* pattern (clear negative relationship) replicates across task domains. However, the absolute slope magnitudes and $\alpha$ levels are **not directly comparable** across these cells, because they use different model variants (m_01 vs. m_02), different priors (Lognormal(3.0, 0.75) vs. Lognormal(3.5, 0.75)), and different numbers of consequences ($K = 3$ vs. $K = 4$). The $\alpha$ parameter's scale is implicitly set by the interaction of prior, consequence space, and utility geometry, so a value of $\alpha \approx 110$ in the $K = 4$ Ellsberg cell does not have the same behavioral meaning as $\alpha \approx 41$ in the $K = 3$ insurance cell. The key inference is that both cells exhibit a clearly negative slope, not that one slope is "larger" than the other.
2. **LLM effect (Ellsberg column):** On the same Ellsberg task, GPT-4o obtains $P(\text{slope} < 0) \approx 0.98$ while Claude obtains $P(\text{slope} < 0) \approx 0.77$. The LLM makes a substantial difference. Since both cells use the same m_02 model and the same prior, the $\alpha$ estimates *are* directly comparable within the Ellsberg column.
3. **Pattern:** GPT-4o consistently shows clear monotonic decline; Claude consistently shows weaker or absent effects.
::: {.callout-warning}
## Temperature Range Mismatch
The GPT-4o cells use temperatures $\{0.0, 0.3, 0.7, 1.0, 1.5\}$ ($\Delta T = 1.5$) while the Claude cells use $\{0.0, 0.2, 0.5, 0.8, 1.0\}$ ($\Delta T = 1.0$) due to API constraints. The slopes ($\Delta\alpha / \Delta T$) are estimated over different covariate ranges, which complicates direct magnitude comparison. The GPT-4o slope may appear steeper in part because it includes the $T = 1.5$ point, where the largest marginal drop occurs. The restricted-range analysis below addresses this by recomputing the GPT-4o slope over $T \in [0.0, 1.0]$ only.
:::
### Restricted-Range Slope (T ∈ [0.0, 1.0])
To enable a fairer cross-LLM comparison, we recompute the GPT-4o slope using only the four temperature levels within $[0.0, 1.0]$, approximately matching Claude's range.
```{python}
#| label: restricted-slope
#| echo: true
restricted_temps = [0.0, 0.3, 0.7, 1.0]
restricted_temp_array = np.array(restricted_temps)
restricted_slope_draws = []
for draw_idx in range(len(alpha_draws[0.0])):
alphas_at_draw = np.array([alpha_draws[t][draw_idx] for t in restricted_temps])
b = np.cov(restricted_temp_array, alphas_at_draw)[0, 1] / np.var(restricted_temp_array)
restricted_slope_draws.append(b)
restricted_slope_draws = np.array(restricted_slope_draws)
r_median = np.median(restricted_slope_draws)
r_q05, r_q95 = np.percentile(restricted_slope_draws, [5, 95])
r_p_neg = np.mean(restricted_slope_draws < 0)
print(f"Restricted-range slope (T ∈ [0.0, 1.0]):")
print(f" Median: {r_median:.1f}")
print(f" 90% CI: [{r_q05:.1f}, {r_q95:.1f}]")
print(f" P(slope < 0): {r_p_neg:.3f}")
print(f"")
print(f"Full-range slope (T ∈ [0.0, 1.5]):")
print(f" Median: {median_slope:.1f}")
print(f" P(slope < 0): {np.mean(slope_draws < 0):.3f}")
```
The negative relationship is preserved — and remains strong — even when the high-temperature $T = 1.5$ condition is excluded. This confirms that the effect is not an artifact of the extended temperature range available for GPT-4o, and supports the cross-LLM comparison on a common temperature interval.
### Summary Heatmap
```{python}
#| label: fig-summary
#| fig-cap: "Left: α vs. temperature showing the declining pattern. Right: pairwise posterior probabilities P(α_row > α_col)."
pairs = analysis['pairwise_comparisons']
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
medians = [np.median(alpha_draws[t]) for t in temperatures]
q05s_plot = [np.percentile(alpha_draws[t], 5) for t in temperatures]
q95s_plot = [np.percentile(alpha_draws[t], 95) for t in temperatures]
axes[0].errorbar(temperatures, medians,
yerr=[np.array(medians) - np.array(q05s_plot),
np.array(q95s_plot) - np.array(medians)],
fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8,
capsize=5, capthick=1.5)
axes[0].set_xlabel('Temperature')
axes[0].set_ylabel('Sensitivity (α)')
axes[0].set_title('α vs. Temperature')
axes[0].set_xticks(temperatures)
n_temps = len(temperatures)
heatmap = np.full((n_temps, n_temps), np.nan)
for key, prob in pairs.items():
t1, t2 = key.split('_vs_')
i = temperatures.index(float(t1))
j = temperatures.index(float(t2))
heatmap[i, j] = prob
heatmap[j, i] = 1 - prob
np.fill_diagonal(heatmap, 0.5)
im = axes[1].imshow(heatmap, cmap='RdYlGn', vmin=0, vmax=1, aspect='equal')
axes[1].set_xticks(range(n_temps))
axes[1].set_xticklabels([f'{t}' for t in temperatures])
axes[1].set_yticks(range(n_temps))
axes[1].set_yticklabels([f'{t}' for t in temperatures])
axes[1].set_xlabel('Temperature (column)')
axes[1].set_ylabel('Temperature (row)')
axes[1].set_title('P(α_row > α_col)')
for i in range(n_temps):
for j in range(n_temps):
if not np.isnan(heatmap[i, j]):
color = 'white' if heatmap[i, j] > 0.8 or heatmap[i, j] < 0.2 else 'black'
axes[1].text(j, i, f'{heatmap[i, j]:.2f}', ha='center', va='center',
fontsize=9, color=color)
plt.colorbar(im, ax=axes[1], shrink=0.8)
plt.tight_layout()
plt.show()
```
## Discussion {#sec-discussion}
### Summary of Findings
GPT-4o shows a **clear negative temperature–sensitivity relationship** on the Ellsberg gamble task:
1. **Strong negative global slope.** The posterior slope $\Delta\alpha / \Delta T$ has a median of $\approx -38$, with $P(\text{slope} < 0) \approx 0.98$ and a 90% CI of $[-72, -10]$ that excludes zero. By contrast, the probability of *strict* monotonicity — $\alpha$ decreasing at every successive temperature step — is only $\approx 0.09$ (see @sec-monotonicity). The pattern is best described as a strong negative *global* trend driven especially by the gap between $T = 1.0$ and $T = 1.5$, rather than a consistently ordered decline at every adjacent temperature.
2. **Globally declining but not strictly monotone.** The $\alpha$ estimates decline from a median of $\approx 110$ at $T = 0.0$ to $\approx 52$ at $T = 1.5$, a roughly 2:1 ratio across the full temperature range. Adjacent low/mid-temperature pairs overlap substantially in their posteriors — the pairwise probabilities $P(\alpha_{0.0} > \alpha_{0.3}) = 0.53$, $P(\alpha_{0.3} > \alpha_{0.7}) = 0.58$, and $P(\alpha_{0.7} > \alpha_{1.0}) = 0.68$ are all modest — while the high-temperature pairs separate clearly ($P(\alpha_{1.0} > \alpha_{1.5}) = 0.91$, $P(\alpha_{0.0} > \alpha_{1.5}) = 0.98$). The largest marginal drop is between $T = 1.0$ and $T = 1.5$. Whether this reflects a genuine threshold effect or simply the tail of a smooth nonlinear relationship cannot be determined from five temperature points alone. This high-temperature regime ($T > 1.0$) is unavailable for Claude due to API constraints, limiting cross-LLM comparability in this range.
3. **Good model fit.** Posterior predictive checks indicate adequate calibration at all temperature levels.
4. **Pooled ambiguity tiers.** This analysis pools across the three ambiguity tiers of the gamble set. The report therefore addresses overall SEU sensitivity on Ellsberg-style stimuli, not ambiguity aversion or tier-specific processing (see Limitations).
### Interpreting α in Behavioral Terms
The $\alpha$ parameter controls the concentration of the softmax choice rule: higher $\alpha$ implies sharper concentration of choice probability on the SEU-maximizing alternative. At $\alpha \approx 110$ ($T = 0.0$), the model assigns overwhelming probability to the EU-maximizing gamble in most decision problems — choice behavior is nearly deterministic. At $\alpha \approx 52$ ($T = 1.5$), the concentration is still well above chance (which would correspond to $\alpha \to 0$), but the model allocates meaningfully more probability mass to suboptimal alternatives. The 2:1 decline thus represents a shift from near-deterministic SEU maximization to a regime where random or heuristic choice patterns are more frequent, though still well above indifference.
### Comparability of α Across Factorial Cells {#sec-comparability}
A recurring issue in cross-cell comparison is the **non-comparability of absolute $\alpha$ values** across cells that use different model variants. The insurance cells ($K = 3$) use m_01 with a Lognormal(3.0, 0.75) prior, while the Ellsberg cells ($K = 4$) use m_02 with a Lognormal(3.5, 0.75) prior. The $\alpha$ parameter interacts with the dimension of the consequence space and the utility geometry in ways that make raw magnitudes non-portable: a value of $\alpha = 50$ implies different choice probabilities under $K = 3$ vs. $K = 4$ because the softmax operates over different-sized simplex spaces. Consequently, comparisons across the task dimension of the factorial should focus on **directional quantities** — the sign of the slope, $P(\text{slope} < 0)$, and strict monotonicity probabilities — rather than absolute $\alpha$ levels or slope magnitudes. Within-column comparisons (same task, different LLMs) do not have this limitation, since both cells share the same model and prior.
### Implications for the Factorial Design
This cell shows that GPT-4o's temperature–sensitivity relationship **replicates qualitatively in the Ellsberg task**: both GPT-4o cells (insurance and Ellsberg) show clear negative global slopes, while both Claude cells (insurance and Ellsberg) show weak or absent effects. Taken together, these four exploratory cells support the interpretation that:
- **LLM identity is the leading qualitative differentiator** in the temperature–$\alpha$ relationship across these four cells. The within-cell evidence is strong for GPT-4o on both tasks.
- **Task is a secondary factor.** Changing the task ($K = 3$ to $K = 4$) does not eliminate or reverse the effect for GPT-4o, and does not create the effect for Claude.
This four-cell comparison is exploratory rather than confirmatory: the cells were fitted independently, GPT-4o and Claude were sampled on different temperature grids and provider-specific temperature scales, and the design supports directional rather than magnitude comparisons (see also @sec-comparability). The between-LLM probabilities reported in the synthesis are $\approx 0.80$–$0.82$, which is moderate rather than decisive.
Note that this report analyses each factorial cell independently; it does not formally model the LLM × Task interaction. The [2×2 Factorial Synthesis](../factorial_synthesis/01_factorial_synthesis.qmd) report quantifies the LLM main effect, task main effect, and their interaction by jointly analysing the posterior draws from all four cells, including a difference-in-differences analysis that addresses the interaction directly.
### Connection to the Ambiguity Aversion Literature
The Ellsberg gamble paradigm was originally designed to reveal violations of subjective expected utility theory — specifically, ambiguity aversion [@ellsberg1961; @camerer1992]. In the human decision-making literature, the canonical finding is that agents prefer gambles with known probabilities over gambles with equivalent expected values but unknown probabilities [@trautmann2015]. Our gamble set spans three ambiguity tiers precisely to allow for such effects.
In this study, however, we use the Ellsberg gambles primarily as a **task domain** for the factorial design — a set of $K = 4$ decision problems that differs structurally from the insurance triage task. The $\alpha$ parameter captures *overall* sensitivity to expected utility maximization, not ambiguity-specific processing. Whether GPT-4o exhibits ambiguity aversion per se — i.e., whether its subjective probability estimates ($\psi$) or choice patterns systematically shift as a function of gamble ambiguity tier — is an important question that lies beyond the scope of the current temperature-focused analysis. We note this as a natural direction for future work: examining whether the temperature effect on $\alpha$ is modulated by ambiguity level (e.g., whether high-ambiguity gambles show a larger temperature-induced degradation in choice quality) would connect this line of work more directly to the behavioral decision theory literature.
### Embedding Model and Cross-LLM Comparability
Both GPT-4o cells use OpenAI's `text-embedding-3-small` for feature construction, and both Claude cells also use `text-embedding-3-small` (following the same pipeline). The embedding model is therefore held constant across all four factorial cells, preventing the embedding representation from confounding the LLM main effect. Any differences in the PCA features across cells arise from differences in the LLM's natural-language assessments (which vary with both LLM identity and temperature), not from the embedding step.
### Relation to the Broader Literature
The finding that LLM identity is the leading qualitative differentiator in the temperature–sensitivity relationship across these four exploratory cells contributes to a growing literature examining LLM decision-making and rationality. Recent work has investigated how LLMs perform on classic decision-theoretic tasks [@binz2023], how API parameters affect output quality [@renze2024], and whether LLMs exhibit systematic biases analogous to those documented in humans [@hagendorff2023]. Our results suggest that the mapping from temperature to choice quality is not a universal property of LLM decoding but depends on model-specific characteristics — potentially reflecting differences in training, architecture, or the relationship between the temperature parameter and the model's internal representations.
### Limitations
1. **No prior sensitivity analysis.** The m_02 prior on $\alpha$ is Lognormal(3.5, 0.75), with prior 90% CI $\approx [10, 124]$. The posterior at $T = 0.0$ ($\alpha \approx 110$) falls near the upper tail of this prior. While $M \approx 300$ observations provide substantial information, and the parameter recovery analysis confirms good identifiability, we have not verified that the slope sign and $P(\text{slope} < 0)$ are robust to alternative prior specifications (e.g., a wider Lognormal(3.5, 1.0) or a shifted Lognormal(4.0, 0.75)). This is a standard robustness check for Bayesian analyses [@kruschke2013] and is planned as a supplementary analysis.
2. **Temperature range mismatch.** The GPT-4o cells sample temperatures up to $T = 1.5$, while Claude is limited to $T = 1.0$. The restricted-range analysis in @sec-comparison confirms that the negative slope holds over $T \in [0.0, 1.0]$, but slope *magnitudes* remain difficult to compare directly given the non-identical temperature grids.
3. **No within-study ambiguity analysis.** The gamble set spans three ambiguity tiers, but the current analysis does not examine whether $\alpha$ or choice patterns vary by tier. This limits the study's contribution to the ambiguity aversion literature specifically.
## Reproducibility {#sec-reproducibility}
### Data Snapshot
| File | Description |
|------|-------------|
| `alpha_draws_T*.npz` | Posterior draws of α (4,000 per condition) |
| `ppc_T*.json` | Posterior predictive check results |
| `diagnostics_T*.txt` | CmdStan diagnostic output |
| `stan_data_T*.json` | Stan-ready data (for refitting) |
| `fit_summary.json` | Summary statistics across conditions |
| `primary_analysis.json` | Pre-computed monotonicity and slope statistics |
| `run_summary.json` | Pipeline metadata and configuration |
| `study_config.yaml` | Frozen copy of the study configuration |
### Refitting from Source
```{python}
#| label: refit-example
#| eval: false
#| echo: true
# Uncomment to refit from source data (requires CmdStanPy)
#
# import cmdstanpy
# model = cmdstanpy.CmdStanModel(stan_file="models/m_02.stan")
#
# for t in [0.0, 0.3, 0.7, 1.0, 1.5]:
# key = f"T{str(t).replace('.', '_')}"
# fit = model.sample(
# data=f"data/stan_data_{key}.json",
# chains=4,
# iter_warmup=1000,
# iter_sampling=1000,
# seed=42,
# )
# print(f"T={t}: alpha median = {np.median(fit.stan_variable('alpha')):.1f}")
```
## References
::: {#refs}
:::