Temperature and SEU Sensitivity: GPT-4o × Ellsberg Study

Jeff Helzner

Temperature and SEU Sensitivity: GPT-4o × Ellsberg Study

Application Report: GPT-4o × Ellsberg (Cell 1,2)

applications

temperature

ellsberg

m_02

openai

factorial

An investigation of how LLM sampling temperature affects estimated sensitivity (α) to subjective expected utility maximisation, using Ellsberg-style gambles (K=4) and GPT-4o (OpenAI). This study isolates the task effect by pairing GPT-4o with the Ellsberg gamble domain used in the Ellsberg study.

Author

Jeff Helzner

Published

May 12, 2026

0.1 Introduction

The initial temperature study (Report 1) found a clear negative relationship between LLM sampling temperature and estimated sensitivity $\alpha$, using GPT-4o on insurance claims triage ($K = 3$). The Ellsberg study (Report 2) then changed both the LLM (to Claude 3.5 Sonnet) and the task (to Ellsberg gambles, $K = 4$) simultaneously, finding a weaker effect — but the confounded design made it impossible to disentangle the two factors.

This study completes the $2 \times 2$ factorial design by pairing GPT-4o with the Ellsberg gamble task — holding the LLM constant relative to the initial study while varying the task, and holding the task constant relative to the Ellsberg study while varying the LLM.

	Insurance (K=3)	Ellsberg (K=4)
GPT-4o	Initial study ✓	This study
Claude	Cell (2,1) — new	Ellsberg study ✓

Summary of Findings

GPT-4o shows a clear negative temperature–α relationship on the Ellsberg gamble task ($\Delta\alpha / \Delta T \approx -38$, $P(\text{slope} < 0) \approx 0.98$), replicating the pattern observed in the initial temperature study with insurance claims triage. Combined with the Claude × Insurance cell (which shows no such relationship), this confirms that the LLM is the dominant factor: GPT-4o exhibits the temperature–sensitivity effect regardless of task domain, while Claude does not.

0.2 Experimental Design

0.2.1 Task and Conditions

We use the Ellsberg gamble task from the Ellsberg study: GPT-4o evaluates gambles involving urns with unknown colour compositions. The alternative pool consists of 30 Ellsberg-style urn gambles, each with $K = 4$ possible monetary consequences ($0, $1, $2, $3) corresponding to drawing different-coloured balls from urns of 60–120 balls. The gambles are organized into three ambiguity tiers:

Tier 1 (No ambiguity; E01–E08): 8 gambles with fully specified ball counts — pure risk with known objective probabilities.
Tier 2 (Moderate ambiguity; E09–E20): 12 gambles with partially bounded counts (“at least”/“at most” constraints); urn totals always stated.
Tier 3 (High ambiguity; E21–E30): 10 gambles where only one colour’s count is known; the remainder is an unknown mixture — closest to Ellsberg’s (1961) original setup.

The LLM first assesses each gamble individually, and then makes a choice among alternatives in each problem. The gamble set is identical to that used in the Claude × Ellsberg study, ensuring that any differences in results are attributable to the LLM, not the stimuli.

Five temperature levels define the between-condition factor:

Show code

conditions = pd.DataFrame({
    'Level': [1, 2, 3, 4, 5],
    'Temperature': [0.0, 0.3, 0.7, 1.0, 1.5],
    'Description': [
        'Deterministic (greedy decoding)',
        'Low variance',
        'Moderate variance',
        'High variance',
        'Very high variance'
    ]
})
conditions

Table 1: Experimental conditions. Each temperature level constitutes a separate model fit.

	Level	Temperature	Description
0	1	0.0	Deterministic (greedy decoding)
1	2	0.3	Low variance
2	3	0.7	Moderate variance
3	4	1.0	High variance
4	5	1.5	Very high variance

Temperature Range

The OpenAI API supports temperature values in $[0.0, 2.0]$, allowing a wider range than the $[0.0, 1.0]$ supported by Anthropic. The temperature values $\{0.0, 0.3, 0.7, 1.0, 1.5\}$ match those used in the initial temperature study (GPT-4o × Insurance), enabling direct comparison.

0.2.2 Design Parameters

Show code

import yaml

with open(data_dir / "study_config.yaml") as f:
    config = yaml.safe_load(f)

with open(data_dir / "run_summary.json") as f:
    run_summary = json.load(f)

print(f"Study Design:")
print(f"  Decision problems (M):      {config['num_problems']} base × {config['num_presentations']} presentations = {config['num_problems'] * config['num_presentations']}")
print(f"  Alternatives per problem:    {config['min_alternatives']}–{config['max_alternatives']}")
print(f"  Consequences (K):            {config['K']}")
print(f"  Embedding dimensions (D):    {config['target_dim']}")
print(f"  Distinct alternatives (R):   {run_summary['phases']['phase3_data_prep']['per_temperature']['0.0']['R']}")
print(f"  LLM model:                   {config['llm_model']}")
print(f"  Embedding model:             {config['embedding_model']}")
print(f"  Provider:                    {config['provider']}")

Study Design:
  Decision problems (M):      100 base × 3 presentations = 300
  Alternatives per problem:    2–4
  Consequences (K):            4
  Embedding dimensions (D):    32
  Distinct alternatives (R):   30
  LLM model:                   gpt-4o
  Embedding model:             text-embedding-3-small
  Provider:                    openai

Each of the 100 base problems is presented $P = 3$ times with alternatives shuffled to different positions, yielding approximately $M = 300$ observations per temperature condition. The position counterbalancing mitigates any systematic position bias (e.g., a tendency to select the first-listed alternative), which was identified as a potential confound in early pilot work.

0.2.3 Feature Construction

Alternative features are constructed through a two-stage embedding process. First, GPT-4o assesses each Ellsberg gamble at the relevant temperature, producing a natural-language evaluation. These assessments are embedded using text-embedding-3-small (OpenAI), yielding high-dimensional vectors. All embeddings across temperature conditions are pooled and projected via PCA to $D = 32$ dimensions.

PCA Summary:
  Components retained: 32
  Total variance explained: 84.1%
  First 5 components: 47.4%
  First 10 components: 62.1%

The PCA reduction to $D = 32$ components captures the large majority of variance in GPT-4o’s Ellsberg gamble assessments. The fact that relatively few components account for most of the variance indicates that the embedding space, while nominally high-dimensional, has a moderate effective dimensionality — consistent with the assessment texts varying along a limited number of semantic axes related to gamble value, risk, and ambiguity.

0.2.4 Data Quality

NA Summary:
  Overall: 1 / 1500 (0.1%)
  T=0.0: 0 / 300 (0.0%)
  T=0.3: 0 / 300 (0.0%)
  T=0.7: 0 / 300 (0.0%)
  T=1.0: 0 / 300 (0.0%)
  T=1.5: 1 / 300 (0.3%)

The overall NA rate is low (well under 1%), comparable to the rates observed in the other factorial cells. NA responses (cases where the LLM failed to produce a parseable choice) are excluded from analysis; at these rates, missingness is unlikely to introduce systematic bias.

0.2.5 Comparison with Other Factorial Cells

Show code

comparison = pd.DataFrame({
    'Parameter': ['LLM', 'Task domain', 'Consequences (K)',
                  'Alternatives (R)', 'Observations per T',
                  'Temperature range', 'Stan model'],
    'Cell (1,1) Initial': ['GPT-4o', 'Insurance triage', '3',
                           '30', '~300', '[0.0, 0.3, 0.7, 1.0, 1.5]', 'm_01'],
    'Cell (1,2) This study': ['GPT-4o', 'Ellsberg gambles', '4',
                              '30', '~300', '[0.0, 0.3, 0.7, 1.0, 1.5]', 'm_02'],
    'Cell (2,2) Ellsberg': ['Claude 3.5 Sonnet', 'Ellsberg gambles', '4',
                            '30', '~300', '[0.0, 0.2, 0.5, 0.8, 1.0]', 'm_02'],
})
comparison

Table 2: Design comparison across the 2×2 factorial. This study (Cell 1,2) shares GPT-4o with Cell (1,1) and shares Ellsberg gambles with Cell (2,2).

	Parameter	Cell (1,1) Initial	Cell (1,2) This study	Cell (2,2) Ellsberg
0	LLM	GPT-4o	GPT-4o	Claude 3.5 Sonnet
1	Task domain	Insurance triage	Ellsberg gambles	Ellsberg gambles
2	Consequences (K)	3	4	4
3	Alternatives (R)	30	30	30
4	Observations per T	~300	~300	~300
5	Temperature range	[0.0, 0.3, 0.7, 1.0, 1.5]	[0.0, 0.3, 0.7, 1.0, 1.5]	[0.0, 0.2, 0.5, 0.8, 1.0]
6	Stan model	m_01	m_02	m_02

0.3 Model and Prior Calibration

0.3.1 The m_02 Model Variant

We fit the m_02 model — the same variant used in the Ellsberg study. The prior on $\alpha$ is calibrated for the Ellsberg gamble task’s $K = 4$ consequence space:

	m_0 (foundational)	m_01 (insurance)	m_02 (this study & Ellsberg)
$\alpha$ prior	$\text{Lognormal}(0, 1)$	$\text{Lognormal}(3.0, 0.75)$	$\text{Lognormal}(3.5, 0.75)$
Prior median	$\approx 1$	$\approx 20$	$\approx 33$
Prior 90% CI	$[0.19, 5.0]$	$[5.5, 67]$	$[10, 124]$
$K$	generic	3	4

Using the same model and prior as the Ellsberg study ensures that any difference in results is attributable to the LLM change (GPT-4o vs. Claude), not to modelling differences.

0.4 Model Validation

0.4.1 Parameter Recovery

We validate that m_02’s parameters are identifiable under this study’s design ($M \approx 300$, $K = 4$, $D = 32$, $R = 30$) via 20 iterations of parameter recovery.

Show code

recovery_dir = os.path.join(project_root, "results", "parameter_recovery", "gpt4o_ellsberg_recovery")
recovery_summary_dir = os.path.join(recovery_dir, "recovery_summary")

with open(os.path.join(recovery_summary_dir, "recovery_statistics.json")) as f:
    recovery_stats = json.load(f)

true_params_path = os.path.join(recovery_dir, "all_true_parameters.json")
with open(true_params_path) as f:
    all_true_params = json.load(f)

posterior_summaries = []
true_params_list = []
for i in range(1, 21):
    iter_dir = os.path.join(recovery_dir, f"iteration_{i}")
    summary_path = os.path.join(iter_dir, "posterior_summary.csv")
    if os.path.exists(summary_path):
        df = pd.read_csv(summary_path, index_col=0)
        posterior_summaries.append(df)
        true_params_list.append(all_true_params[i - 1])

n_successful = len(posterior_summaries)
print(f"Loaded {n_successful} recovery iterations")

Show code

alpha_true = np.array([p['alpha'] for p in true_params_list])
alpha_mean = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries])
alpha_lower = np.array([s.loc['alpha', '5%'] for s in posterior_summaries])
alpha_upper = np.array([s.loc['alpha', '95%'] for s in posterior_summaries])

alpha_bias = np.mean(alpha_mean - alpha_true)
alpha_rmse = np.sqrt(np.mean((alpha_mean - alpha_true)**2))
alpha_coverage = np.mean((alpha_true >= alpha_lower) & (alpha_true <= alpha_upper))
alpha_ci_width = np.mean(alpha_upper - alpha_lower)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax = axes[0]
ax.scatter(alpha_true, alpha_mean, alpha=0.7, s=60, c=SEU_COLORS['primary'], edgecolor='white')
lims = [min(alpha_true.min(), alpha_mean.min()) * 0.9,
        max(alpha_true.max(), alpha_mean.max()) * 1.1]
ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line')
ax.set_xlim(lims)
ax.set_ylim(lims)
ax.set_xlabel('True α', fontsize=12)
ax.set_ylabel('Estimated α (posterior mean)', fontsize=12)
ax.set_title(f'α Recovery: Bias={alpha_bias:.2f}, RMSE={alpha_rmse:.2f}', fontsize=12)
ax.legend()
ax.set_aspect('equal')

ax = axes[1]
for i in range(len(alpha_true)):
    covered = (alpha_true[i] >= alpha_lower[i]) & (alpha_true[i] <= alpha_upper[i])
    color = 'forestgreen' if covered else 'crimson'
    ax.plot([i, i], [alpha_lower[i], alpha_upper[i]], color=color, linewidth=2, alpha=0.7)
    ax.scatter(i, alpha_mean[i], color=color, s=40, zorder=3)

ax.scatter(np.arange(len(alpha_true)), alpha_true, color='black', s=60, marker='x',
           label='True value', zorder=4, linewidth=2)
ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('α', fontsize=12)
ax.set_title(f'α: 90% Credible Intervals (Coverage = {alpha_coverage:.0%})', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Figure 1: Recovery of the sensitivity parameter α under the m_02 prior with the GPT-4o × Ellsberg design (K=4). Left: true vs. estimated values with identity line. Right: 90% credible intervals for each iteration, coloured by whether they contain the true value.

Show code

K_val = 4
D_val = 32

all_beta_coverage = []
for k in range(K_val):
    for d in range(D_val):
        param_name = f"beta[{k+1},{d+1}]"
        try:
            bt = np.array([p['beta'][k][d] for p in true_params_list])
            bl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
            bu = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])
            all_beta_coverage.append(np.mean((bt >= bl) & (bt <= bu)))
        except (KeyError, IndexError):
            pass

all_delta_coverage = []
for k in range(K_val - 1):
    param_name = f"delta[{k+1}]"
    try:
        dt = np.array([p['delta'][k] for p in true_params_list])
        dl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
        du = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])
        all_delta_coverage.append(np.mean((dt >= dl) & (dt <= du)))
    except (KeyError, IndexError):
        pass

metrics = pd.DataFrame([
    {'Parameter': 'α', 'Bias': f'{alpha_bias:.2f}', 'RMSE': f'{alpha_rmse:.2f}',
     'Coverage (90%)': f'{alpha_coverage:.0%}', 'CI Width': f'{alpha_ci_width:.1f}'},
    {'Parameter': f'β (mean over {K_val*D_val})',
     'Bias': '—', 'RMSE': '—',
     'Coverage (90%)': f'{np.mean(all_beta_coverage):.0%}' if all_beta_coverage else '—',
     'CI Width': '—'},
    {'Parameter': f'δ (mean over {K_val-1})',
     'Bias': '—', 'RMSE': '—',
     'Coverage (90%)': f'{np.mean(all_delta_coverage):.0%}' if all_delta_coverage else '—',
     'CI Width': '—'},
])
metrics

Table 3: Parameter recovery metrics for m_02 with the GPT-4o × Ellsberg design (M≈300, K=4, D=32, R=30).

	Parameter	Bias	RMSE	Coverage (90%)	CI Width
0	α	-0.36	10.41	90%	39.7
1	β (mean over 128)	—	—	90%	—
2	δ (mean over 3)	—	—	85%	—

SBC Inherited, Not Re-Run

Simulation-Based Calibration (SBC) was not performed on this dataset. We rely on the SBC validation from the Ellsberg study report, which used the same m_02.stan model and the same α ~ LogNormal(3.5, 0.75) prior; SBC tests the sampler under the prior-and-likelihood geometry and is therefore inherited validly across applications that share both the model and the prior. The same convention is used (and labelled consistently) in the Claude insurance study limitations section. The parameter recovery analysis above (which was performed with this study’s specific design parameters) provides the complementary, data-specific assurance that the posterior concentrates correctly around known true values under realistic conditions.

0.5 Results

0.5.1 Loading Posterior Draws

Show code

temperatures = [0.0, 0.3, 0.7, 1.0, 1.5]
temp_labels = {t: f"T={t}" for t in temperatures}

alpha_draws = {}
for t in temperatures:
    key = f"T{str(t).replace('.', '_')}"
    data = np.load(data_dir / f"alpha_draws_{key}.npz")
    alpha_draws[t] = data['alpha']

with open(data_dir / "primary_analysis.json") as f:
    analysis = json.load(f)

with open(data_dir / "fit_summary.json") as f:
    fit_summary = json.load(f)

  T=0.0: 4,000 posterior draws loaded
  T=0.3: 4,000 posterior draws loaded
  T=0.7: 4,000 posterior draws loaded
  T=1.0: 4,000 posterior draws loaded
  T=1.5: 4,000 posterior draws loaded

0.5.2 MCMC Diagnostics

Show code

diag_rows = []
for t in temperatures:
    key = f"T{str(t).replace('.', '_')}"
    with open(data_dir / f"diagnostics_{key}.txt") as f:
        diag_text = f.read()

    if "No divergent transitions" in diag_text:
        n_div = 0
    else:
        match = re.search(r'(\d+) of (\d+)', diag_text)
        n_div = int(match.group(1)) if match else 0

    rhat_ok = "R-hat values satisfactory" in diag_text or "R_hat" not in diag_text.replace("R-hat values satisfactory", "")
    ess_ok = "effective sample size satisfactory" in diag_text
    ebfmi_ok = "E-BFMI satisfactory" in diag_text

    diag_rows.append({
        'Temperature': t,
        'Divergences': f"{n_div}/4000",
        'R̂': '✓' if rhat_ok else '✗',
        'ESS': '✓' if ess_ok else '✗',
        'E-BFMI': '✓' if ebfmi_ok else '✗',
    })

pd.DataFrame(diag_rows)

Table 4: MCMC diagnostics for all five temperature conditions. All fits used 4 chains with 1,000 warmup and 1,000 sampling iterations each (4,000 post-warmup draws total).

	Temperature	Divergences	R̂	ESS	E-BFMI
0	0.0	4/4000	✓	✓	✓
1	0.3	0/4000	✓	✓	✓
2	0.7	0/4000	✓	✓	✓
3	1.0	0/4000	✓	✓	✓
4	1.5	6/4000	✓	✓	✓

0.5.3 Posterior Summaries

Show code

summary = analysis['summary_table']

rows = []
for s in summary:
    rows.append({
        'Temperature': s['temperature'],
        'Median': f"{s['median']:.1f}",
        'Mean': f"{s['mean']:.1f}",
        'SD': f"{s['sd']:.1f}",
        '90% CI': f"[{s['ci_low']:.1f}, {s['ci_high']:.1f}]",
    })

pd.DataFrame(rows)

Table 5: Posterior summaries for the sensitivity parameter α at each temperature level. Intervals are 90% credible intervals.

	Temperature	Median	Mean	SD	90% CI
0	0.0	110.4	114.2	28.6	[74.4, 167.2]
1	0.3	106.9	111.1	28.3	[72.6, 163.8]
2	0.7	99.5	103.6	28.0	[65.4, 154.6]
3	1.0	84.0	87.4	22.2	[57.1, 126.1]
4	1.5	52.2	54.4	14.4	[35.5, 80.3]

A clear declining pattern is visible: $\alpha$ drops from $\approx 110$ at $T = 0.0$ to $\approx 52$ at $T = 1.5$, a roughly 2:1 ratio across the temperature range.

0.5.4 Forest Plot

Show code

fig, ax = plt.subplots(figsize=(8, 5))

y_positions = np.arange(len(temperatures))[::-1]

for i, t in enumerate(temperatures):
    draws = alpha_draws[t]
    median = np.median(draws)
    q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95])

    y = y_positions[i]
    ax.plot([q05, q95], [y, y], color=SEU_PALETTE[i], linewidth=1.5, alpha=0.7)
    ax.plot([q25, q75], [y, y], color=SEU_PALETTE[i], linewidth=4, alpha=0.9)
    ax.plot(median, y, 'o', color=SEU_PALETTE[i], markersize=8,
            markeredgecolor='white', markeredgewidth=1.5, zorder=5)

ax.set_yticks(y_positions)
ax.set_yticklabels([f'T = {t}' for t in temperatures])
ax.set_xlabel('Sensitivity (α)')
ax.set_title('Posterior Distributions of α by Temperature')
ax.grid(axis='x', alpha=0.3)
ax.grid(axis='y', alpha=0)

plt.tight_layout()
plt.show()

Figure 2: Forest plot of posterior α distributions across temperature conditions. Points show posterior medians; thick bars span the 50% credible interval; thin bars span the 90% credible interval.

0.5.5 Posterior Densities

Show code

from scipy.stats import gaussian_kde

fig, ax = plt.subplots(figsize=(8, 5))

for i, t in enumerate(temperatures):
    draws = alpha_draws[t]
    kde = gaussian_kde(draws)
    x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.1, 300)
    ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=SEU_PALETTE[i])
    ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2,
            label=f'T = {t} (median = {np.median(draws):.0f})')

ax.set_xlabel('Sensitivity (α)')
ax.set_ylabel('Density')
ax.set_title('Posterior Density of α')
ax.legend(loc='upper right')

plt.tight_layout()
plt.show()

Figure 3: Kernel density estimates of the posterior α distributions. The posteriors show a clear separation, with higher temperatures yielding lower α values.

0.5.6 Posterior Predictive Checks

Show code

ppc_rows = []
for t in temperatures:
    key = f"T{str(t).replace('.', '_')}"
    with open(data_dir / f"ppc_{key}.json") as f:
        ppc = json.load(f)

    pvals = ppc['p_values']
    ppc_rows.append({
        'Temperature': t,
        'Log-likelihood': f"{pvals['ll']:.3f}",
        'Modal frequency': f"{pvals['modal']:.3f}",
        'Mean probability': f"{pvals['prob']:.3f}",
    })

pd.DataFrame(ppc_rows)

Table 6: Posterior predictive check p-values for each temperature condition. Values near 0.5 indicate good calibration.

	Temperature	Log-likelihood	Modal frequency	Mean probability
0	0.0	0.448	0.403	0.373
1	0.3	0.456	0.320	0.340
2	0.7	0.436	0.370	0.397
3	1.0	0.457	0.384	0.381
4	1.5	0.495	0.534	0.466

All posterior predictive p-values fall within an acceptable range, indicating adequate model fit across conditions.

0.6 Monotonicity Analysis

0.6.1 Global Slope

Show code

slopes = analysis['slope']

temp_array = np.array(temperatures)
slope_draws = []
for draw_idx in range(len(alpha_draws[temperatures[0]])):
    alphas_at_draw = np.array([alpha_draws[t][draw_idx] for t in temperatures])
    b = np.cov(temp_array, alphas_at_draw)[0, 1] / np.var(temp_array)
    slope_draws.append(b)
slope_draws = np.array(slope_draws)

fig, ax = plt.subplots(figsize=(8, 4))

kde = gaussian_kde(slope_draws)
x_grid = np.linspace(np.percentile(slope_draws, 0.5), np.percentile(slope_draws, 99.5), 300)
ax.fill_between(x_grid, kde(x_grid), alpha=0.3, color=SEU_COLORS['primary'])
ax.plot(x_grid, kde(x_grid), color=SEU_COLORS['primary'], linewidth=2)

median_slope = np.median(slope_draws)
ax.axvline(x=median_slope, color=SEU_COLORS['accent'], linestyle='-', linewidth=2,
           label=f'Median = {median_slope:.1f}')
ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5, label='No effect')

q05, q95 = np.percentile(slope_draws, [5, 95])
mask = (x_grid >= q05) & (x_grid <= q95)
ax.fill_between(x_grid[mask], kde(x_grid[mask]), alpha=0.15, color=SEU_COLORS['accent'])
ax.axvline(x=q05, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6)
ax.axvline(x=q95, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6)

ax.set_xlabel('Slope (Δα / ΔT)')
ax.set_ylabel('Density')
ax.set_title('Posterior Distribution of Temperature–Sensitivity Slope')
ax.legend()

plt.tight_layout()
plt.show()

print(f"Slope summary:")
print(f"  Median:  {median_slope:.1f}")
print(f"  90% CI:  [{q05:.1f}, {q95:.1f}]")
print(f"  P(slope < 0): {np.mean(slope_draws < 0):.3f}")

Figure 4: Posterior distribution of the slope Δα/ΔT. The bulk of the distribution lies below zero, providing strong evidence for a negative temperature–sensitivity relationship.

Slope summary:
  Median:  -48.0
  90% CI:  [-90.2, -12.5]
  P(slope < 0): 0.984

0.6.2 Pairwise Comparisons

Show code

pairs = analysis['pairwise_comparisons']

pair_rows = []
for key, prob in pairs.items():
    t1, t2 = key.split('_vs_')
    if prob > 0.95:
        strength = '●●● (strong)'
    elif prob > 0.8:
        strength = '●● (moderate)'
    elif prob > 0.65:
        strength = '● (weak)'
    elif prob < 0.35:
        strength = '○ (reversed)'
    else:
        strength = '— (indistinguishable)'
    pair_rows.append({
        'Comparison': f'α(T={t1}) > α(T={t2})',
        'P': f'{prob:.3f}',
        'Evidence': strength,
    })

pd.DataFrame(pair_rows)

Table 7: Posterior probability that α is higher at the lower temperature in each pair.

	Comparison	P	Evidence
0	α(T=0.0) > α(T=0.3)	0.532	— (indistinguishable)
1	α(T=0.0) > α(T=0.7)	0.611	— (indistinguishable)
2	α(T=0.0) > α(T=1.0)	0.786	● (weak)
3	α(T=0.0) > α(T=1.5)	0.981	●●● (strong)
4	α(T=0.3) > α(T=0.7)	0.578	— (indistinguishable)
5	α(T=0.3) > α(T=1.0)	0.750	● (weak)
6	α(T=0.3) > α(T=1.5)	0.978	●●● (strong)
7	α(T=0.7) > α(T=1.0)	0.678	● (weak)
8	α(T=0.7) > α(T=1.5)	0.959	●●● (strong)
9	α(T=1.0) > α(T=1.5)	0.913	●● (moderate)

0.6.3 Strict Monotonicity

Show code

n_draws = len(alpha_draws[0.0])
strictly_decreasing = 0

for i in range(n_draws):
    vals = [alpha_draws[t][i] for t in temperatures]
    if all(vals[j] > vals[j+1] for j in range(len(vals)-1)):
        strictly_decreasing += 1

p_mono = strictly_decreasing / n_draws
print(f"P(α strictly decreasing across all T): {p_mono:.4f}")

P(α strictly decreasing across all T): 0.0902

The posterior probability of strict monotonicity — that $\alpha$ decreases at every successive temperature step — provides a stringent test of the ordering. Combined with the pairwise comparisons above, these results indicate that the negative temperature–$\alpha$ relationship is not driven by any single temperature pair but reflects a consistent ordering across the full range.

0.7 Comparison with Other Factorial Cells

This comparison addresses two questions: 1. Task effect — Is GPT-4o’s temperature–α relationship stable across tasks? (Compare with Cell 1,1: GPT-4o × Insurance) 2. LLM effect — Does the choice of LLM matter for the Ellsberg task? (Compare with Cell 2,2: Claude × Ellsberg)

Show code

initial_data_dir = Path("..") / "temperature_study" / "data"
ellsberg_data_dir = Path("..") / "ellsberg_study" / "data"

with open(initial_data_dir / "primary_analysis.json") as f:
    initial_analysis = json.load(f)

with open(ellsberg_data_dir / "primary_analysis.json") as f:
    ellsberg_analysis = json.load(f)

initial_temps = [s['temperature'] for s in initial_analysis['summary_table']]
initial_medians = [s['median'] for s in initial_analysis['summary_table']]
initial_lows = [s['ci_low'] for s in initial_analysis['summary_table']]
initial_highs = [s['ci_high'] for s in initial_analysis['summary_table']]

this_medians = [s['median'] for s in analysis['summary_table']]
this_lows = [s['ci_low'] for s in analysis['summary_table']]
this_highs = [s['ci_high'] for s in analysis['summary_table']]

ells_temps = [s['temperature'] for s in ellsberg_analysis['summary_table']]
ells_medians = [s['median'] for s in ellsberg_analysis['summary_table']]
ells_lows = [s['ci_low'] for s in ellsberg_analysis['summary_table']]
ells_highs = [s['ci_high'] for s in ellsberg_analysis['summary_table']]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left panel: GPT-4o on both tasks
ax = axes[0]
ax.errorbar(initial_temps, initial_medians,
            yerr=[np.array(initial_medians) - np.array(initial_lows),
                  np.array(initial_highs) - np.array(initial_medians)],
            fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8,
            capsize=5, capthick=1.5, label='Insurance (K=3)')
ax.errorbar(temperatures, this_medians,
            yerr=[np.array(this_medians) - np.array(this_lows),
                  np.array(this_highs) - np.array(this_medians)],
            fmt='s-', color=SEU_COLORS['accent'], linewidth=2, markersize=8,
            capsize=5, capthick=1.5, label='Ellsberg (K=4)')
ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('GPT-4o: Task Effect\n(Same LLM, Different Tasks)')
ax.legend()

# Right panel: Ellsberg with both LLMs
ax = axes[1]
ax.errorbar(temperatures, this_medians,
            yerr=[np.array(this_medians) - np.array(this_lows),
                  np.array(this_highs) - np.array(this_medians)],
            fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8,
            capsize=5, capthick=1.5, label='GPT-4o')
ax.errorbar(ells_temps, ells_medians,
            yerr=[np.array(ells_medians) - np.array(ells_lows),
                  np.array(ells_highs) - np.array(ells_medians)],
            fmt='s-', color=SEU_COLORS['accent'], linewidth=2, markersize=8,
            capsize=5, capthick=1.5, label='Claude')
ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('Ellsberg Task: LLM Effect\n(Same Task, Different LLMs)')
ax.legend()

plt.tight_layout()
plt.show()

Figure 5: Cross-study comparisons. Left: GPT-4o across both tasks (isolating the task effect) — both show clear negative slopes. Right: Ellsberg task across both LLMs (isolating the LLM effect) — GPT-4o shows the effect but Claude does not.

Show code

# Handle different field names
initial_slope_val = initial_analysis['slope'].get('median', initial_analysis['slope'].get('slope', 'N/A'))
initial_p_neg = initial_analysis['slope'].get('p_negative', None)

ellsberg_slope_val = ellsberg_analysis['slope'].get('median', ellsberg_analysis['slope'].get('slope', 'N/A'))
ellsberg_p_neg = ellsberg_analysis['slope'].get('p_negative', None)

cross = pd.DataFrame([
    {'Cell': '(1,1) GPT-4o × Insurance',
     'Slope median': f"{initial_slope_val:.1f}" if isinstance(initial_slope_val, (int, float)) else initial_slope_val,
     'P(slope < 0)': f"{initial_p_neg:.3f}" if initial_p_neg else '> 0.99',
     'P(strict mono)': f"{initial_analysis['monotonicity_prob']:.3f}"},
    {'Cell': '(1,2) GPT-4o × Ellsberg (this)',
     'Slope median': f"{analysis['slope']['median']:.1f}",
     'P(slope < 0)': f"{analysis['slope']['p_negative']:.3f}",
     'P(strict mono)': f"{analysis['monotonicity_prob']:.3f}"},
    {'Cell': '(2,2) Claude × Ellsberg',
     'Slope median': f"{ellsberg_slope_val:.1f}" if isinstance(ellsberg_slope_val, (int, float)) else ellsberg_slope_val,
     'P(slope < 0)': f"{ellsberg_p_neg:.3f}" if ellsberg_p_neg else '—',
     'P(strict mono)': f"{ellsberg_analysis['monotonicity_prob']:.3f}"},
])
cross

Table 8: Cross-study comparison across the 2×2 factorial.

	Cell	Slope median	P(slope < 0)	P(strict mono)
0	(1,1) GPT-4o × Insurance	-24.6	> 0.99	0.125
1	(1,2) GPT-4o × Ellsberg (this)	-38.4	0.984	0.090
2	(2,2) Claude × Ellsberg	-18.8	0.766	0.009

0.7.1 Key Observations

Task effect (GPT-4o row): GPT-4o shows a clear negative slope on both tasks — approximately $-25$ on insurance and $-38$ on Ellsberg. The qualitative pattern (clear negative relationship) replicates across task domains. However, the absolute slope magnitudes and $\alpha$ levels are not directly comparable across these cells, because they use different model variants (m_01 vs. m_02), different priors (Lognormal(3.0, 0.75) vs. Lognormal(3.5, 0.75)), and different numbers of consequences ($K = 3$ vs. $K = 4$). The $\alpha$ parameter’s scale is implicitly set by the interaction of prior, consequence space, and utility geometry, so a value of $\alpha \approx 110$ in the $K = 4$ Ellsberg cell does not have the same behavioural meaning as $\alpha \approx 41$ in the $K = 3$ insurance cell. The key inference is that both cells exhibit a clearly negative slope, not that one slope is “larger” than the other.
LLM effect (Ellsberg column): On the same Ellsberg task, GPT-4o obtains $P(\text{slope} < 0) \approx 0.98$ while Claude obtains $P(\text{slope} < 0) \approx 0.77$. The LLM makes a substantial difference. Since both cells use the same m_02 model and the same prior, the $\alpha$ estimates are directly comparable within the Ellsberg column.
Pattern: GPT-4o consistently shows clear monotonic decline; Claude consistently shows weaker or absent effects.

Temperature Range Mismatch

The GPT-4o cells use temperatures $\{0.0, 0.3, 0.7, 1.0, 1.5\}$ ($\Delta T = 1.5$) while the Claude cells use $\{0.0, 0.2, 0.5, 0.8, 1.0\}$ ($\Delta T = 1.0$) due to API constraints. The slopes ($\Delta\alpha / \Delta T$) are estimated over different covariate ranges, which complicates direct magnitude comparison. The GPT-4o slope may appear steeper in part because it includes the $T = 1.5$ point, where the largest marginal drop occurs. The restricted-range analysis below addresses this by recomputing the GPT-4o slope over $T \in [0.0, 1.0]$ only.

0.7.2 Restricted-Range Slope (T ∈ [0.0, 1.0])

To enable a fairer cross-LLM comparison, we recompute the GPT-4o slope using only the four temperature levels within $[0.0, 1.0]$, approximately matching Claude’s range.

Show code

restricted_temps = [0.0, 0.3, 0.7, 1.0]
restricted_temp_array = np.array(restricted_temps)

restricted_slope_draws = []
for draw_idx in range(len(alpha_draws[0.0])):
    alphas_at_draw = np.array([alpha_draws[t][draw_idx] for t in restricted_temps])
    b = np.cov(restricted_temp_array, alphas_at_draw)[0, 1] / np.var(restricted_temp_array)
    restricted_slope_draws.append(b)
restricted_slope_draws = np.array(restricted_slope_draws)

r_median = np.median(restricted_slope_draws)
r_q05, r_q95 = np.percentile(restricted_slope_draws, [5, 95])
r_p_neg = np.mean(restricted_slope_draws < 0)

print(f"Restricted-range slope (T ∈ [0.0, 1.0]):")
print(f"  Median:       {r_median:.1f}")
print(f"  90% CI:       [{r_q05:.1f}, {r_q95:.1f}]")
print(f"  P(slope < 0): {r_p_neg:.3f}")
print(f"")
print(f"Full-range slope (T ∈ [0.0, 1.5]):")
print(f"  Median:       {median_slope:.1f}")
print(f"  P(slope < 0): {np.mean(slope_draws < 0):.3f}")

Restricted-range slope (T ∈ [0.0, 1.0]):
  Median:       -33.3
  90% CI:       [-110.8, 37.2]
  P(slope < 0): 0.774

Full-range slope (T ∈ [0.0, 1.5]):
  Median:       -48.0
  P(slope < 0): 0.984

The negative relationship is preserved — and remains strong — even when the high-temperature $T = 1.5$ condition is excluded. This confirms that the effect is not an artifact of the extended temperature range available for GPT-4o, and supports the cross-LLM comparison on a common temperature interval.

0.7.3 Summary Heatmap

Show code

pairs = analysis['pairwise_comparisons']

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

medians = [np.median(alpha_draws[t]) for t in temperatures]
q05s_plot = [np.percentile(alpha_draws[t], 5) for t in temperatures]
q95s_plot = [np.percentile(alpha_draws[t], 95) for t in temperatures]

axes[0].errorbar(temperatures, medians,
                 yerr=[np.array(medians) - np.array(q05s_plot),
                       np.array(q95s_plot) - np.array(medians)],
                 fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8,
                 capsize=5, capthick=1.5)
axes[0].set_xlabel('Temperature')
axes[0].set_ylabel('Sensitivity (α)')
axes[0].set_title('α vs. Temperature')
axes[0].set_xticks(temperatures)

n_temps = len(temperatures)
heatmap = np.full((n_temps, n_temps), np.nan)

for key, prob in pairs.items():
    t1, t2 = key.split('_vs_')
    i = temperatures.index(float(t1))
    j = temperatures.index(float(t2))
    heatmap[i, j] = prob
    heatmap[j, i] = 1 - prob

np.fill_diagonal(heatmap, 0.5)

im = axes[1].imshow(heatmap, cmap='RdYlGn', vmin=0, vmax=1, aspect='equal')
axes[1].set_xticks(range(n_temps))
axes[1].set_xticklabels([f'{t}' for t in temperatures])
axes[1].set_yticks(range(n_temps))
axes[1].set_yticklabels([f'{t}' for t in temperatures])
axes[1].set_xlabel('Temperature (column)')
axes[1].set_ylabel('Temperature (row)')
axes[1].set_title('P(α_row > α_col)')

for i in range(n_temps):
    for j in range(n_temps):
        if not np.isnan(heatmap[i, j]):
            color = 'white' if heatmap[i, j] > 0.8 or heatmap[i, j] < 0.2 else 'black'
            axes[1].text(j, i, f'{heatmap[i, j]:.2f}', ha='center', va='center',
                        fontsize=9, color=color)

plt.colorbar(im, ax=axes[1], shrink=0.8)

plt.tight_layout()
plt.show()

Figure 6: Left: α vs. temperature showing the declining pattern. Right: pairwise posterior probabilities P(α_row > α_col).

0.8 Discussion

0.8.1 Summary of Findings

GPT-4o shows a clear negative temperature–sensitivity relationship on the Ellsberg gamble task:

Strong negative slope. The posterior slope $\Delta\alpha / \Delta T$ has a median of $\approx -38$, with $P(\text{slope} < 0) \approx 0.98$. This is comparable in strength to the initial temperature study. The posterior probability of strict monotonicity — $\alpha$ decreasing at every successive temperature step — reinforces this conclusion (see Section 0.6).
Consistent decline. The $\alpha$ estimates decline monotonically from $\approx 110$ at $T = 0.0$ to $\approx 52$ at $T = 1.5$, a roughly 2:1 ratio. The largest marginal drop is between $T = 1.0$ and $T = 1.5$, suggesting a possible nonlinearity in the temperature–sensitivity relationship: the marginal effect of temperature on $\alpha$ may be largest at high temperatures. This high-temperature regime ($T > 1.0$) is unavailable for Claude due to API constraints, limiting cross-LLM comparability in this range. Whether the large $T = 1.0 \to 1.5$ drop reflects a genuine threshold effect or simply the tail of a smooth nonlinear relationship cannot be determined from five temperature points alone.
Good model fit. Posterior predictive checks indicate adequate calibration at all temperature levels.

0.8.2 Interpreting α in Behavioural Terms

The $\alpha$ parameter controls the concentration of the softmax choice rule: higher $\alpha$ implies sharper concentration of choice probability on the SEU-maximising alternative. At $\alpha \approx 110$ ($T = 0.0$), the model assigns overwhelming probability to the EU-maximising gamble in most decision problems — choice behaviour is nearly deterministic. At $\alpha \approx 52$ ($T = 1.5$), the concentration is still well above chance (which would correspond to $\alpha \to 0$), but the model allocates meaningfully more probability mass to suboptimal alternatives. The 2:1 decline thus represents a shift from near-deterministic SEU maximisation to a regime where random or heuristic choice patterns are more frequent, though still well above indifference.

0.8.3 Comparability of α Across Factorial Cells

A recurring issue in cross-cell comparison is the non-comparability of absolute $\alpha$ values across cells that use different model variants. The insurance cells ($K = 3$) use m_01 with a Lognormal(3.0, 0.75) prior, while the Ellsberg cells ($K = 4$) use m_02 with a Lognormal(3.5, 0.75) prior. The $\alpha$ parameter interacts with the dimension of the consequence space and the utility geometry in ways that make raw magnitudes non-portable: a value of $\alpha = 50$ implies different choice probabilities under $K = 3$ vs. $K = 4$ because the softmax operates over different-sized simplex spaces. Consequently, comparisons across the task dimension of the factorial should focus on directional quantities — the sign of the slope, $P(\text{slope} < 0)$, and strict monotonicity probabilities — rather than absolute $\alpha$ levels or slope magnitudes. Within-column comparisons (same task, different LLMs) do not have this limitation, since both cells share the same model and prior.

0.8.4 Implications for the Factorial Design

This cell confirms that GPT-4o’s temperature–sensitivity relationship generalises across task domains. The two GPT-4o cells (insurance and Ellsberg) both show clear declines, while the two Claude cells (insurance and Ellsberg) both show weak or absent effects. This pattern strongly supports the conclusion that:

LLM is the primary driver. The main effect of LLM (GPT-4o vs. Claude) is large and consistent.
Task is a secondary factor. Changing the task ($K = 3$ to $K = 4$) does not eliminate or reverse the effect for GPT-4o, and does not create the effect for Claude.

Note that this report analyses each factorial cell independently; it does not formally model the LLM × Task interaction. The 2×2 Factorial Synthesis report quantifies the LLM main effect, task main effect, and their interaction by jointly analysing the posterior draws from all four cells, including a difference-in-differences analysis that addresses the interaction directly.

0.8.5 Connection to the Ambiguity Aversion Literature

The Ellsberg gamble paradigm was originally designed to reveal violations of subjective expected utility theory — specifically, ambiguity aversion (Ellsberg, 1961; Camerer & Weber, 1992). In the human decision-making literature, the canonical finding is that agents prefer gambles with known probabilities over gambles with equivalent expected values but unknown probabilities (Trautmann & van de Kuilen, 2015). Our gamble set spans three ambiguity tiers precisely to allow for such effects.

In this study, however, we use the Ellsberg gambles primarily as a task domain for the factorial design — a set of $K = 4$ decision problems that differs structurally from the insurance triage task. The $\alpha$ parameter captures overall sensitivity to expected utility maximisation, not ambiguity-specific processing. Whether GPT-4o exhibits ambiguity aversion per se — i.e., whether its subjective probability estimates ($\psi$) or choice patterns systematically shift as a function of gamble ambiguity tier — is an important question that lies beyond the scope of the current temperature-focused analysis. We note this as a natural direction for future work: examining whether the temperature effect on $\alpha$ is modulated by ambiguity level (e.g., whether high-ambiguity gambles show a larger temperature-induced degradation in choice quality) would connect this line of work more directly to the behavioural decision theory literature.

0.8.6 Embedding Model and Cross-LLM Comparability

Both GPT-4o cells use OpenAI’s text-embedding-3-small for feature construction, and both Claude cells also use text-embedding-3-small (following the same pipeline). The embedding model is therefore held constant across all four factorial cells, preventing the embedding representation from confounding the LLM main effect. Any differences in the PCA features across cells arise from differences in the LLM’s natural-language assessments (which vary with both LLM identity and temperature), not from the embedding step.

0.8.7 Relation to the Broader Literature

The finding that LLM identity is the dominant factor in the temperature–sensitivity relationship contributes to a growing literature examining LLM decision-making and rationality. Recent work has investigated how LLMs perform on classic decision-theoretic tasks (Binz & Schulz, 2023), how API parameters affect output quality (Renze & Guven, 2024), and whether LLMs exhibit systematic biases analogous to those documented in humans (Hagendorff et al., 2023). Our results suggest that the mapping from temperature to choice quality is not a universal property of LLM decoding but depends on model-specific characteristics — potentially reflecting differences in training, architecture, or the relationship between the temperature parameter and the model’s internal representations.

0.8.8 Limitations

No prior sensitivity analysis. The m_02 prior on $\alpha$ is Lognormal(3.5, 0.75), with prior 90% CI $\approx [10, 124]$. The posterior at $T = 0.0$ ($\alpha \approx 110$) falls near the upper tail of this prior. While $M \approx 300$ observations provide substantial information, and the parameter recovery analysis confirms good identifiability, we have not verified that the slope sign and $P(\text{slope} < 0)$ are robust to alternative prior specifications (e.g., a wider Lognormal(3.5, 1.0) or a shifted Lognormal(4.0, 0.75)). This is a standard robustness check for Bayesian analyses (Kruschke, 2013) and is planned as a supplementary analysis.
Temperature range mismatch. The GPT-4o cells sample temperatures up to $T = 1.5$, while Claude is limited to $T = 1.0$. The restricted-range analysis in Section 0.7 confirms that the negative slope holds over $T \in [0.0, 1.0]$, but slope magnitudes remain difficult to compare directly given the non-identical temperature grids.
No within-study ambiguity analysis. The gamble set spans three ambiguity tiers, but the current analysis does not examine whether $\alpha$ or choice patterns vary by tier. This limits the study’s contribution to the ambiguity aversion literature specifically.

0.9 Reproducibility

0.9.1 Data Snapshot

File	Description
`alpha_draws_T*.npz`	Posterior draws of α (4,000 per condition)
`ppc_T*.json`	Posterior predictive check results
`diagnostics_T*.txt`	CmdStan diagnostic output
`stan_data_T*.json`	Stan-ready data (for refitting)
`fit_summary.json`	Summary statistics across conditions
`primary_analysis.json`	Pre-computed monotonicity and slope statistics
`run_summary.json`	Pipeline metadata and configuration
`study_config.yaml`	Frozen copy of the study configuration

0.9.2 Refitting from Source

Show code

# Uncomment to refit from source data (requires CmdStanPy)
#
# import cmdstanpy
# model = cmdstanpy.CmdStanModel(stan_file="models/m_02.stan")
#
# for t in [0.0, 0.3, 0.7, 1.0, 1.5]:
#     key = f"T{str(t).replace('.', '_')}"
#     fit = model.sample(
#         data=f"data/stan_data_{key}.json",
#         chains=4,
#         iter_warmup=1000,
#         iter_sampling=1000,
#         seed=42,
#     )
#     print(f"T={t}: alpha median = {np.median(fit.stan_variable('alpha')):.1f}")

0.10 References

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:

@online{helzner2026,
  author = {Helzner, Jeff},
  title = {Temperature and {SEU} {Sensitivity:} {GPT-4o} × {Ellsberg}
    {Study}},
  date = {2026-05-12},
  url = {https://jeffhelzner.github.io/seu-sensitivity/applications/gpt4o_ellsberg_study/01_gpt4o_ellsberg_study.html},
  langid = {en}
}

For attribution, please cite this work as:

Helzner, Jeff. 2026. “Temperature and SEU Sensitivity: GPT-4o × Ellsberg Study.” SEU Sensitivity Project, May 12. https://jeffhelzner.github.io/seu-sensitivity/applications/gpt4o_ellsberg_study/01_gpt4o_ellsberg_study.html.

--- title: "Temperature and SEU Sensitivity: GPT-4o × Ellsberg Study" subtitle: "Application Report: GPT-4o × Ellsberg (Cell 1,2)" description: | An investigation of how LLM sampling temperature affects estimated sensitivity (α) to subjective expected utility maximisation, using Ellsberg-style gambles (K=4) and GPT-4o (OpenAI). This study isolates the task effect by pairing GPT-4o with the Ellsberg gamble domain used in the Ellsberg study. categories: [applications, temperature, ellsberg, m_02, openai, factorial] execute: cache: true --- ```{python} #| label: setup #| include: false import sys import os reports_root = os.path.normpath(os.path.join(os.getcwd(), '..', '..')) project_root = os.path.dirname(reports_root) sys.path.insert(0, reports_root) sys.path.insert(0, project_root) import numpy as np import json import re import warnings warnings.filterwarnings('ignore') import matplotlib.pyplot as plt import pandas as pd from report_utils import set_seu_style, SEU_COLORS, SEU_PALETTE set_seu_style() from pathlib import Path data_dir = Path("data") ``` ## Introduction {#sec-introduction} The initial temperature study ([Report 1](../temperature_study/01_initial_study.qmd)) found a clear negative relationship between LLM sampling temperature and estimated sensitivity $\alpha$, using **GPT-4o** on **insurance claims triage** ($K = 3$). The Ellsberg study ([Report 2](../ellsberg_study/01_ellsberg_study.qmd)) then changed **both** the LLM (to Claude 3.5 Sonnet) and the task (to Ellsberg gambles, $K = 4$) simultaneously, finding a weaker effect — but the confounded design made it impossible to disentangle the two factors. This study completes the $2 \times 2$ factorial design by pairing **GPT-4o** with the **Ellsberg gamble** task — holding the LLM constant relative to the initial study while varying the task, and holding the task constant relative to the Ellsberg study while varying the LLM. | | Insurance (K=3) | Ellsberg (K=4) | |---|---|---| | **GPT-4o** | Initial study ✓ | **This study** | | **Claude** | Cell (2,1) — new | Ellsberg study ✓ | ::: {.callout-important} ## Summary of Findings GPT-4o shows a **clear negative temperature–α relationship** on the Ellsberg gamble task ($\Delta\alpha / \Delta T \approx -38$, $P(\text{slope} < 0) \approx 0.98$), **replicating** the pattern observed in the initial temperature study with insurance claims triage. Combined with the Claude × Insurance cell (which shows no such relationship), this confirms that the **LLM** is the dominant factor: GPT-4o exhibits the temperature–sensitivity effect regardless of task domain, while Claude does not. ::: ## Experimental Design {#sec-design} ### Task and Conditions We use the Ellsberg gamble task from the [Ellsberg study](../ellsberg_study/01_ellsberg_study.qmd): GPT-4o evaluates gambles involving urns with unknown colour compositions. The alternative pool consists of 30 Ellsberg-style urn gambles, each with $K = 4$ possible monetary consequences (\$0, \$1, \$2, \$3) corresponding to drawing different-coloured balls from urns of 60–120 balls. The gambles are organized into three ambiguity tiers: - **Tier 1 (No ambiguity; E01–E08):** 8 gambles with fully specified ball counts — pure risk with known objective probabilities. - **Tier 2 (Moderate ambiguity; E09–E20):** 12 gambles with partially bounded counts ("at least"/"at most" constraints); urn totals always stated. - **Tier 3 (High ambiguity; E21–E30):** 10 gambles where only one colour's count is known; the remainder is an unknown mixture — closest to Ellsberg's (1961) original setup. The LLM first *assesses* each gamble individually, and then makes a *choice* among alternatives in each problem. The gamble set is identical to that used in the Claude × Ellsberg study, ensuring that any differences in results are attributable to the LLM, not the stimuli. Five temperature levels define the between-condition factor: ```{python} #| label: tbl-conditions #| tbl-cap: "Experimental conditions. Each temperature level constitutes a separate model fit." conditions = pd.DataFrame({ 'Level': [1, 2, 3, 4, 5], 'Temperature': [0.0, 0.3, 0.7, 1.0, 1.5], 'Description': [ 'Deterministic (greedy decoding)', 'Low variance', 'Moderate variance', 'High variance', 'Very high variance' ] }) conditions ``` ::: {.callout-note} ## Temperature Range The OpenAI API supports temperature values in $[0.0, 2.0]$, allowing a wider range than the $[0.0, 1.0]$ supported by Anthropic. The temperature values $\{0.0, 0.3, 0.7, 1.0, 1.5\}$ match those used in the initial temperature study (GPT-4o × Insurance), enabling direct comparison. ::: ### Design Parameters ```{python} #| label: design-params #| echo: true import yaml with open(data_dir / "study_config.yaml") as f: config = yaml.safe_load(f) with open(data_dir / "run_summary.json") as f: run_summary = json.load(f) print(f"Study Design:") print(f" Decision problems (M): {config['num_problems']} base × {config['num_presentations']} presentations = {config['num_problems'] * config['num_presentations']}") print(f" Alternatives per problem: {config['min_alternatives']}–{config['max_alternatives']}") print(f" Consequences (K): {config['K']}") print(f" Embedding dimensions (D): {config['target_dim']}") print(f" Distinct alternatives (R): {run_summary['phases']['phase3_data_prep']['per_temperature']['0.0']['R']}") print(f" LLM model: {config['llm_model']}") print(f" Embedding model: {config['embedding_model']}") print(f" Provider: {config['provider']}") ``` Each of the 100 base problems is presented $P = 3$ times with alternatives shuffled to different positions, yielding approximately $M = 300$ observations per temperature condition. The position counterbalancing mitigates any systematic position bias (e.g., a tendency to select the first-listed alternative), which was identified as a potential confound in early pilot work. ### Feature Construction Alternative features are constructed through a two-stage embedding process. First, GPT-4o assesses each Ellsberg gamble at the relevant temperature, producing a natural-language evaluation. These assessments are embedded using `text-embedding-3-small` (OpenAI), yielding high-dimensional vectors. All embeddings across temperature conditions are pooled and projected via PCA to $D = 32$ dimensions. ```{python} #| label: pca-variance #| echo: false pca = run_summary['phases']['phase3_data_prep']['pca_summary'] cumvar = np.cumsum(pca['explained_variance_ratio']) print(f"PCA Summary:") print(f" Components retained: {pca['n_components']}") print(f" Total variance explained: {pca['total_explained_variance']:.1%}") print(f" First 5 components: {cumvar[4]:.1%}") print(f" First 10 components: {cumvar[9]:.1%}") ``` The PCA reduction to $D = 32$ components captures the large majority of variance in GPT-4o's Ellsberg gamble assessments. The fact that relatively few components account for most of the variance indicates that the embedding space, while nominally high-dimensional, has a moderate effective dimensionality — consistent with the assessment texts varying along a limited number of semantic axes related to gamble value, risk, and ambiguity. ### Data Quality ```{python} #| label: data-quality #| echo: false na = run_summary['phases']['phase2b_choices']['na_summary'] print(f"NA Summary:") print(f" Overall: {na['overall']['na']} / {na['overall']['total']} ({na['overall']['na_rate']:.1%})") for key, val in na['per_temperature'].items(): print(f" {key}: {val['na']} / {val['total']} ({val['na_rate']:.1%})") ``` The overall NA rate is low (well under 1%), comparable to the rates observed in the other factorial cells. NA responses (cases where the LLM failed to produce a parseable choice) are excluded from analysis; at these rates, missingness is unlikely to introduce systematic bias. ### Comparison with Other Factorial Cells ```{python} #| label: tbl-design-comparison #| tbl-cap: "Design comparison across the 2×2 factorial. This study (Cell 1,2) shares GPT-4o with Cell (1,1) and shares Ellsberg gambles with Cell (2,2)." comparison = pd.DataFrame({ 'Parameter': ['LLM', 'Task domain', 'Consequences (K)', 'Alternatives (R)', 'Observations per T', 'Temperature range', 'Stan model'], 'Cell (1,1) Initial': ['GPT-4o', 'Insurance triage', '3', '30', '~300', '[0.0, 0.3, 0.7, 1.0, 1.5]', 'm_01'], 'Cell (1,2) This study': ['GPT-4o', 'Ellsberg gambles', '4', '30', '~300', '[0.0, 0.3, 0.7, 1.0, 1.5]', 'm_02'], 'Cell (2,2) Ellsberg': ['Claude 3.5 Sonnet', 'Ellsberg gambles', '4', '30', '~300', '[0.0, 0.2, 0.5, 0.8, 1.0]', 'm_02'], }) comparison ``` ## Model and Prior Calibration {#sec-model} ### The m_02 Model Variant We fit the **m_02** model — the same variant used in the Ellsberg study. The prior on $\alpha$ is calibrated for the Ellsberg gamble task's $K = 4$ consequence space: | | m_0 (foundational) | m_01 (insurance) | m_02 (this study & Ellsberg) | |---|---|---|---| | $\alpha$ prior | $\text{Lognormal}(0, 1)$ | $\text{Lognormal}(3.0, 0.75)$ | $\text{Lognormal}(3.5, 0.75)$ | | Prior median | $\approx 1$ | $\approx 20$ | $\approx 33$ | | Prior 90% CI | $[0.19, 5.0]$ | $[5.5, 67]$ | $[10, 124]$ | | $K$ | generic | 3 | 4 | Using the same model and prior as the Ellsberg study ensures that any difference in results is attributable to the LLM change (GPT-4o vs. Claude), not to modelling differences. ## Model Validation {#sec-validation} ### Parameter Recovery {#sec-parameter-recovery} We validate that m_02's parameters are identifiable under this study's design ($M \approx 300$, $K = 4$, $D = 32$, $R = 30$) via 20 iterations of parameter recovery. ```{python} #| label: load-recovery #| output: false recovery_dir = os.path.join(project_root, "results", "parameter_recovery", "gpt4o_ellsberg_recovery") recovery_summary_dir = os.path.join(recovery_dir, "recovery_summary") with open(os.path.join(recovery_summary_dir, "recovery_statistics.json")) as f: recovery_stats = json.load(f) true_params_path = os.path.join(recovery_dir, "all_true_parameters.json") with open(true_params_path) as f: all_true_params = json.load(f) posterior_summaries = [] true_params_list = [] for i in range(1, 21): iter_dir = os.path.join(recovery_dir, f"iteration_{i}") summary_path = os.path.join(iter_dir, "posterior_summary.csv") if os.path.exists(summary_path): df = pd.read_csv(summary_path, index_col=0) posterior_summaries.append(df) true_params_list.append(all_true_params[i - 1]) n_successful = len(posterior_summaries) print(f"Loaded {n_successful} recovery iterations") ``` ```{python} #| label: fig-alpha-recovery #| fig-cap: "Recovery of the sensitivity parameter α under the m_02 prior with the GPT-4o × Ellsberg design (K=4). Left: true vs. estimated values with identity line. Right: 90% credible intervals for each iteration, coloured by whether they contain the true value." alpha_true = np.array([p['alpha'] for p in true_params_list]) alpha_mean = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries]) alpha_lower = np.array([s.loc['alpha', '5%'] for s in posterior_summaries]) alpha_upper = np.array([s.loc['alpha', '95%'] for s in posterior_summaries]) alpha_bias = np.mean(alpha_mean - alpha_true) alpha_rmse = np.sqrt(np.mean((alpha_mean - alpha_true)**2)) alpha_coverage = np.mean((alpha_true >= alpha_lower) & (alpha_true <= alpha_upper)) alpha_ci_width = np.mean(alpha_upper - alpha_lower) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) ax = axes[0] ax.scatter(alpha_true, alpha_mean, alpha=0.7, s=60, c=SEU_COLORS['primary'], edgecolor='white') lims = [min(alpha_true.min(), alpha_mean.min()) * 0.9, max(alpha_true.max(), alpha_mean.max()) * 1.1] ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line') ax.set_xlim(lims) ax.set_ylim(lims) ax.set_xlabel('True α', fontsize=12) ax.set_ylabel('Estimated α (posterior mean)', fontsize=12) ax.set_title(f'α Recovery: Bias={alpha_bias:.2f}, RMSE={alpha_rmse:.2f}', fontsize=12) ax.legend() ax.set_aspect('equal') ax = axes[1] for i in range(len(alpha_true)): covered = (alpha_true[i] >= alpha_lower[i]) & (alpha_true[i] <= alpha_upper[i]) color = 'forestgreen' if covered else 'crimson' ax.plot([i, i], [alpha_lower[i], alpha_upper[i]], color=color, linewidth=2, alpha=0.7) ax.scatter(i, alpha_mean[i], color=color, s=40, zorder=3) ax.scatter(np.arange(len(alpha_true)), alpha_true, color='black', s=60, marker='x', label='True value', zorder=4, linewidth=2) ax.set_xlabel('Iteration', fontsize=12) ax.set_ylabel('α', fontsize=12) ax.set_title(f'α: 90% Credible Intervals (Coverage = {alpha_coverage:.0%})', fontsize=12) ax.legend() ax.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` ```{python} #| label: tbl-recovery-metrics #| tbl-cap: "Parameter recovery metrics for m_02 with the GPT-4o × Ellsberg design (M≈300, K=4, D=32, R=30)." K_val = 4 D_val = 32 all_beta_coverage = [] for k in range(K_val): for d in range(D_val): param_name = f"beta[{k+1},{d+1}]" try: bt = np.array([p['beta'][k][d] for p in true_params_list]) bl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries]) bu = np.array([s.loc[param_name, '95%'] for s in posterior_summaries]) all_beta_coverage.append(np.mean((bt >= bl) & (bt <= bu))) except (KeyError, IndexError): pass all_delta_coverage = [] for k in range(K_val - 1): param_name = f"delta[{k+1}]" try: dt = np.array([p['delta'][k] for p in true_params_list]) dl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries]) du = np.array([s.loc[param_name, '95%'] for s in posterior_summaries]) all_delta_coverage.append(np.mean((dt >= dl) & (dt <= du))) except (KeyError, IndexError): pass metrics = pd.DataFrame([ {'Parameter': 'α', 'Bias': f'{alpha_bias:.2f}', 'RMSE': f'{alpha_rmse:.2f}', 'Coverage (90%)': f'{alpha_coverage:.0%}', 'CI Width': f'{alpha_ci_width:.1f}'}, {'Parameter': f'β (mean over {K_val*D_val})', 'Bias': '—', 'RMSE': '—', 'Coverage (90%)': f'{np.mean(all_beta_coverage):.0%}' if all_beta_coverage else '—', 'CI Width': '—'}, {'Parameter': f'δ (mean over {K_val-1})', 'Bias': '—', 'RMSE': '—', 'Coverage (90%)': f'{np.mean(all_delta_coverage):.0%}' if all_delta_coverage else '—', 'CI Width': '—'}, ]) metrics ``` ::: {.callout-note} ## SBC Inherited, Not Re-Run Simulation-Based Calibration (SBC) was *not* performed on this dataset. We rely on the SBC validation from the [Ellsberg study report](../ellsberg_study/01_ellsberg_study.qmd), which used the same `m_02.stan` model and the same `α ~ LogNormal(3.5, 0.75)` prior; SBC tests the sampler under the prior-and-likelihood geometry and is therefore inherited validly across applications that share both the model and the prior. The same convention is used (and labelled consistently) in the [Claude insurance study](../claude_insurance_study/01_claude_insurance_study.qmd) limitations section. The parameter recovery analysis above (which *was* performed with this study's specific design parameters) provides the complementary, data-specific assurance that the posterior concentrates correctly around known true values under realistic conditions. ::: ## Results {#sec-results} ### Loading Posterior Draws ```{python} #| label: load-posteriors #| output: false temperatures = [0.0, 0.3, 0.7, 1.0, 1.5] temp_labels = {t: f"T={t}" for t in temperatures} alpha_draws = {} for t in temperatures: key = f"T{str(t).replace('.', '_')}" data = np.load(data_dir / f"alpha_draws_{key}.npz") alpha_draws[t] = data['alpha'] with open(data_dir / "primary_analysis.json") as f: analysis = json.load(f) with open(data_dir / "fit_summary.json") as f: fit_summary = json.load(f) ``` ```{python} #| echo: false for t in temperatures: n = len(alpha_draws[t]) print(f" T={t}: {n:,} posterior draws loaded") ``` ### MCMC Diagnostics ```{python} #| label: tbl-diagnostics #| tbl-cap: "MCMC diagnostics for all five temperature conditions. All fits used 4 chains with 1,000 warmup and 1,000 sampling iterations each (4,000 post-warmup draws total)." diag_rows = [] for t in temperatures: key = f"T{str(t).replace('.', '_')}" with open(data_dir / f"diagnostics_{key}.txt") as f: diag_text = f.read() if "No divergent transitions" in diag_text: n_div = 0 else: match = re.search(r'(\d+) of (\d+)', diag_text) n_div = int(match.group(1)) if match else 0 rhat_ok = "R-hat values satisfactory" in diag_text or "R_hat" not in diag_text.replace("R-hat values satisfactory", "") ess_ok = "effective sample size satisfactory" in diag_text ebfmi_ok = "E-BFMI satisfactory" in diag_text diag_rows.append({ 'Temperature': t, 'Divergences': f"{n_div}/4000", 'R̂': '✓' if rhat_ok else '✗', 'ESS': '✓' if ess_ok else '✗', 'E-BFMI': '✓' if ebfmi_ok else '✗', }) pd.DataFrame(diag_rows) ``` ### Posterior Summaries ```{python} #| label: tbl-posteriors #| tbl-cap: "Posterior summaries for the sensitivity parameter α at each temperature level. Intervals are 90% credible intervals." summary = analysis['summary_table'] rows = [] for s in summary: rows.append({ 'Temperature': s['temperature'], 'Median': f"{s['median']:.1f}", 'Mean': f"{s['mean']:.1f}", 'SD': f"{s['sd']:.1f}", '90% CI': f"[{s['ci_low']:.1f}, {s['ci_high']:.1f}]", }) pd.DataFrame(rows) ``` A clear declining pattern is visible: $\alpha$ drops from $\approx 110$ at $T = 0.0$ to $\approx 52$ at $T = 1.5$, a roughly 2:1 ratio across the temperature range. ### Forest Plot ```{python} #| label: fig-forest #| fig-cap: "Forest plot of posterior α distributions across temperature conditions. Points show posterior medians; thick bars span the 50% credible interval; thin bars span the 90% credible interval." #| fig-height: 5 fig, ax = plt.subplots(figsize=(8, 5)) y_positions = np.arange(len(temperatures))[::-1] for i, t in enumerate(temperatures): draws = alpha_draws[t] median = np.median(draws) q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95]) y = y_positions[i] ax.plot([q05, q95], [y, y], color=SEU_PALETTE[i], linewidth=1.5, alpha=0.7) ax.plot([q25, q75], [y, y], color=SEU_PALETTE[i], linewidth=4, alpha=0.9) ax.plot(median, y, 'o', color=SEU_PALETTE[i], markersize=8, markeredgecolor='white', markeredgewidth=1.5, zorder=5) ax.set_yticks(y_positions) ax.set_yticklabels([f'T = {t}' for t in temperatures]) ax.set_xlabel('Sensitivity (α)') ax.set_title('Posterior Distributions of α by Temperature') ax.grid(axis='x', alpha=0.3) ax.grid(axis='y', alpha=0) plt.tight_layout() plt.show() ``` ### Posterior Densities ```{python} #| label: fig-density #| fig-cap: "Kernel density estimates of the posterior α distributions. The posteriors show a clear separation, with higher temperatures yielding lower α values." #| fig-height: 5 from scipy.stats import gaussian_kde fig, ax = plt.subplots(figsize=(8, 5)) for i, t in enumerate(temperatures): draws = alpha_draws[t] kde = gaussian_kde(draws) x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.1, 300) ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=SEU_PALETTE[i]) ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2, label=f'T = {t} (median = {np.median(draws):.0f})') ax.set_xlabel('Sensitivity (α)') ax.set_ylabel('Density') ax.set_title('Posterior Density of α') ax.legend(loc='upper right') plt.tight_layout() plt.show() ``` ### Posterior Predictive Checks ```{python} #| label: tbl-ppc #| tbl-cap: "Posterior predictive check p-values for each temperature condition. Values near 0.5 indicate good calibration." ppc_rows = [] for t in temperatures: key = f"T{str(t).replace('.', '_')}" with open(data_dir / f"ppc_{key}.json") as f: ppc = json.load(f) pvals = ppc['p_values'] ppc_rows.append({ 'Temperature': t, 'Log-likelihood': f"{pvals['ll']:.3f}", 'Modal frequency': f"{pvals['modal']:.3f}", 'Mean probability': f"{pvals['prob']:.3f}", }) pd.DataFrame(ppc_rows) ``` All posterior predictive p-values fall within an acceptable range, indicating adequate model fit across conditions. ## Monotonicity Analysis {#sec-monotonicity} ### Global Slope ```{python} #| label: fig-slope #| fig-cap: "Posterior distribution of the slope Δα/ΔT. The bulk of the distribution lies below zero, providing strong evidence for a negative temperature–sensitivity relationship." slopes = analysis['slope'] temp_array = np.array(temperatures) slope_draws = [] for draw_idx in range(len(alpha_draws[temperatures[0]])): alphas_at_draw = np.array([alpha_draws[t][draw_idx] for t in temperatures]) b = np.cov(temp_array, alphas_at_draw)[0, 1] / np.var(temp_array) slope_draws.append(b) slope_draws = np.array(slope_draws) fig, ax = plt.subplots(figsize=(8, 4)) kde = gaussian_kde(slope_draws) x_grid = np.linspace(np.percentile(slope_draws, 0.5), np.percentile(slope_draws, 99.5), 300) ax.fill_between(x_grid, kde(x_grid), alpha=0.3, color=SEU_COLORS['primary']) ax.plot(x_grid, kde(x_grid), color=SEU_COLORS['primary'], linewidth=2) median_slope = np.median(slope_draws) ax.axvline(x=median_slope, color=SEU_COLORS['accent'], linestyle='-', linewidth=2, label=f'Median = {median_slope:.1f}') ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5, label='No effect') q05, q95 = np.percentile(slope_draws, [5, 95]) mask = (x_grid >= q05) & (x_grid <= q95) ax.fill_between(x_grid[mask], kde(x_grid[mask]), alpha=0.15, color=SEU_COLORS['accent']) ax.axvline(x=q05, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6) ax.axvline(x=q95, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6) ax.set_xlabel('Slope (Δα / ΔT)') ax.set_ylabel('Density') ax.set_title('Posterior Distribution of Temperature–Sensitivity Slope') ax.legend() plt.tight_layout() plt.show() print(f"Slope summary:") print(f" Median: {median_slope:.1f}") print(f" 90% CI: [{q05:.1f}, {q95:.1f}]") print(f" P(slope < 0): {np.mean(slope_draws < 0):.3f}") ``` ### Pairwise Comparisons ```{python} #| label: tbl-pairwise #| tbl-cap: "Posterior probability that α is higher at the lower temperature in each pair." pairs = analysis['pairwise_comparisons'] pair_rows = [] for key, prob in pairs.items(): t1, t2 = key.split('_vs_') if prob > 0.95: strength = '●●● (strong)' elif prob > 0.8: strength = '●● (moderate)' elif prob > 0.65: strength = '● (weak)' elif prob < 0.35: strength = '○ (reversed)' else: strength = '— (indistinguishable)' pair_rows.append({ 'Comparison': f'α(T={t1}) > α(T={t2})', 'P': f'{prob:.3f}', 'Evidence': strength, }) pd.DataFrame(pair_rows) ``` ### Strict Monotonicity ```{python} #| label: monotonicity #| echo: true n_draws = len(alpha_draws[0.0]) strictly_decreasing = 0 for i in range(n_draws): vals = [alpha_draws[t][i] for t in temperatures] if all(vals[j] > vals[j+1] for j in range(len(vals)-1)): strictly_decreasing += 1 p_mono = strictly_decreasing / n_draws print(f"P(α strictly decreasing across all T): {p_mono:.4f}") ``` The posterior probability of strict monotonicity — that $\alpha$ decreases at every successive temperature step — provides a stringent test of the ordering. Combined with the pairwise comparisons above, these results indicate that the negative temperature–$\alpha$ relationship is not driven by any single temperature pair but reflects a consistent ordering across the full range. ## Comparison with Other Factorial Cells {#sec-comparison} This comparison addresses two questions: 1. **Task effect** — Is GPT-4o's temperature–α relationship stable across tasks? (Compare with Cell 1,1: GPT-4o × Insurance) 2. **LLM effect** — Does the choice of LLM matter for the Ellsberg task? (Compare with Cell 2,2: Claude × Ellsberg) ```{python} #| label: fig-cross-study #| fig-cap: "Cross-study comparisons. Left: GPT-4o across both tasks (isolating the task effect) — both show clear negative slopes. Right: Ellsberg task across both LLMs (isolating the LLM effect) — GPT-4o shows the effect but Claude does not." initial_data_dir = Path("..") / "temperature_study" / "data" ellsberg_data_dir = Path("..") / "ellsberg_study" / "data" with open(initial_data_dir / "primary_analysis.json") as f: initial_analysis = json.load(f) with open(ellsberg_data_dir / "primary_analysis.json") as f: ellsberg_analysis = json.load(f) initial_temps = [s['temperature'] for s in initial_analysis['summary_table']] initial_medians = [s['median'] for s in initial_analysis['summary_table']] initial_lows = [s['ci_low'] for s in initial_analysis['summary_table']] initial_highs = [s['ci_high'] for s in initial_analysis['summary_table']] this_medians = [s['median'] for s in analysis['summary_table']] this_lows = [s['ci_low'] for s in analysis['summary_table']] this_highs = [s['ci_high'] for s in analysis['summary_table']] ells_temps = [s['temperature'] for s in ellsberg_analysis['summary_table']] ells_medians = [s['median'] for s in ellsberg_analysis['summary_table']] ells_lows = [s['ci_low'] for s in ellsberg_analysis['summary_table']] ells_highs = [s['ci_high'] for s in ellsberg_analysis['summary_table']] fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left panel: GPT-4o on both tasks ax = axes[0] ax.errorbar(initial_temps, initial_medians, yerr=[np.array(initial_medians) - np.array(initial_lows), np.array(initial_highs) - np.array(initial_medians)], fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8, capsize=5, capthick=1.5, label='Insurance (K=3)') ax.errorbar(temperatures, this_medians, yerr=[np.array(this_medians) - np.array(this_lows), np.array(this_highs) - np.array(this_medians)], fmt='s-', color=SEU_COLORS['accent'], linewidth=2, markersize=8, capsize=5, capthick=1.5, label='Ellsberg (K=4)') ax.set_xlabel('Temperature') ax.set_ylabel('Sensitivity (α)') ax.set_title('GPT-4o: Task Effect\n(Same LLM, Different Tasks)') ax.legend() # Right panel: Ellsberg with both LLMs ax = axes[1] ax.errorbar(temperatures, this_medians, yerr=[np.array(this_medians) - np.array(this_lows), np.array(this_highs) - np.array(this_medians)], fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8, capsize=5, capthick=1.5, label='GPT-4o') ax.errorbar(ells_temps, ells_medians, yerr=[np.array(ells_medians) - np.array(ells_lows), np.array(ells_highs) - np.array(ells_medians)], fmt='s-', color=SEU_COLORS['accent'], linewidth=2, markersize=8, capsize=5, capthick=1.5, label='Claude') ax.set_xlabel('Temperature') ax.set_ylabel('Sensitivity (α)') ax.set_title('Ellsberg Task: LLM Effect\n(Same Task, Different LLMs)') ax.legend() plt.tight_layout() plt.show() ``` ```{python} #| label: tbl-cross-study #| tbl-cap: "Cross-study comparison across the 2×2 factorial." # Handle different field names initial_slope_val = initial_analysis['slope'].get('median', initial_analysis['slope'].get('slope', 'N/A')) initial_p_neg = initial_analysis['slope'].get('p_negative', None) ellsberg_slope_val = ellsberg_analysis['slope'].get('median', ellsberg_analysis['slope'].get('slope', 'N/A')) ellsberg_p_neg = ellsberg_analysis['slope'].get('p_negative', None) cross = pd.DataFrame([ {'Cell': '(1,1) GPT-4o × Insurance', 'Slope median': f"{initial_slope_val:.1f}" if isinstance(initial_slope_val, (int, float)) else initial_slope_val, 'P(slope < 0)': f"{initial_p_neg:.3f}" if initial_p_neg else '> 0.99', 'P(strict mono)': f"{initial_analysis['monotonicity_prob']:.3f}"}, {'Cell': '(1,2) GPT-4o × Ellsberg (this)', 'Slope median': f"{analysis['slope']['median']:.1f}", 'P(slope < 0)': f"{analysis['slope']['p_negative']:.3f}", 'P(strict mono)': f"{analysis['monotonicity_prob']:.3f}"}, {'Cell': '(2,2) Claude × Ellsberg', 'Slope median': f"{ellsberg_slope_val:.1f}" if isinstance(ellsberg_slope_val, (int, float)) else ellsberg_slope_val, 'P(slope < 0)': f"{ellsberg_p_neg:.3f}" if ellsberg_p_neg else '—', 'P(strict mono)': f"{ellsberg_analysis['monotonicity_prob']:.3f}"}, ]) cross ``` ### Key Observations 1. **Task effect (GPT-4o row):** GPT-4o shows a clear negative slope on *both* tasks — approximately $-25$ on insurance and $-38$ on Ellsberg. The *qualitative* pattern (clear negative relationship) replicates across task domains. However, the absolute slope magnitudes and $\alpha$ levels are **not directly comparable** across these cells, because they use different model variants (m_01 vs. m_02), different priors (Lognormal(3.0, 0.75) vs. Lognormal(3.5, 0.75)), and different numbers of consequences ($K = 3$ vs. $K = 4$). The $\alpha$ parameter's scale is implicitly set by the interaction of prior, consequence space, and utility geometry, so a value of $\alpha \approx 110$ in the $K = 4$ Ellsberg cell does not have the same behavioural meaning as $\alpha \approx 41$ in the $K = 3$ insurance cell. The key inference is that both cells exhibit a clearly negative slope, not that one slope is "larger" than the other. 2. **LLM effect (Ellsberg column):** On the same Ellsberg task, GPT-4o obtains $P(\text{slope} < 0) \approx 0.98$ while Claude obtains $P(\text{slope} < 0) \approx 0.77$. The LLM makes a substantial difference. Since both cells use the same m_02 model and the same prior, the $\alpha$ estimates *are* directly comparable within the Ellsberg column. 3. **Pattern:** GPT-4o consistently shows clear monotonic decline; Claude consistently shows weaker or absent effects. ::: {.callout-warning} ## Temperature Range Mismatch The GPT-4o cells use temperatures $\{0.0, 0.3, 0.7, 1.0, 1.5\}$ ($\Delta T = 1.5$) while the Claude cells use $\{0.0, 0.2, 0.5, 0.8, 1.0\}$ ($\Delta T = 1.0$) due to API constraints. The slopes ($\Delta\alpha / \Delta T$) are estimated over different covariate ranges, which complicates direct magnitude comparison. The GPT-4o slope may appear steeper in part because it includes the $T = 1.5$ point, where the largest marginal drop occurs. The restricted-range analysis below addresses this by recomputing the GPT-4o slope over $T \in [0.0, 1.0]$ only. ::: ### Restricted-Range Slope (T ∈ [0.0, 1.0]) To enable a fairer cross-LLM comparison, we recompute the GPT-4o slope using only the four temperature levels within $[0.0, 1.0]$, approximately matching Claude's range. ```{python} #| label: restricted-slope #| echo: true restricted_temps = [0.0, 0.3, 0.7, 1.0] restricted_temp_array = np.array(restricted_temps) restricted_slope_draws = [] for draw_idx in range(len(alpha_draws[0.0])): alphas_at_draw = np.array([alpha_draws[t][draw_idx] for t in restricted_temps]) b = np.cov(restricted_temp_array, alphas_at_draw)[0, 1] / np.var(restricted_temp_array) restricted_slope_draws.append(b) restricted_slope_draws = np.array(restricted_slope_draws) r_median = np.median(restricted_slope_draws) r_q05, r_q95 = np.percentile(restricted_slope_draws, [5, 95]) r_p_neg = np.mean(restricted_slope_draws < 0) print(f"Restricted-range slope (T ∈ [0.0, 1.0]):") print(f" Median: {r_median:.1f}") print(f" 90% CI: [{r_q05:.1f}, {r_q95:.1f}]") print(f" P(slope < 0): {r_p_neg:.3f}") print(f"") print(f"Full-range slope (T ∈ [0.0, 1.5]):") print(f" Median: {median_slope:.1f}") print(f" P(slope < 0): {np.mean(slope_draws < 0):.3f}") ``` The negative relationship is preserved — and remains strong — even when the high-temperature $T = 1.5$ condition is excluded. This confirms that the effect is not an artifact of the extended temperature range available for GPT-4o, and supports the cross-LLM comparison on a common temperature interval. ### Summary Heatmap ```{python} #| label: fig-summary #| fig-cap: "Left: α vs. temperature showing the declining pattern. Right: pairwise posterior probabilities P(α_row > α_col)." pairs = analysis['pairwise_comparisons'] fig, axes = plt.subplots(1, 2, figsize=(12, 5)) medians = [np.median(alpha_draws[t]) for t in temperatures] q05s_plot = [np.percentile(alpha_draws[t], 5) for t in temperatures] q95s_plot = [np.percentile(alpha_draws[t], 95) for t in temperatures] axes[0].errorbar(temperatures, medians, yerr=[np.array(medians) - np.array(q05s_plot), np.array(q95s_plot) - np.array(medians)], fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8, capsize=5, capthick=1.5) axes[0].set_xlabel('Temperature') axes[0].set_ylabel('Sensitivity (α)') axes[0].set_title('α vs. Temperature') axes[0].set_xticks(temperatures) n_temps = len(temperatures) heatmap = np.full((n_temps, n_temps), np.nan) for key, prob in pairs.items(): t1, t2 = key.split('_vs_') i = temperatures.index(float(t1)) j = temperatures.index(float(t2)) heatmap[i, j] = prob heatmap[j, i] = 1 - prob np.fill_diagonal(heatmap, 0.5) im = axes[1].imshow(heatmap, cmap='RdYlGn', vmin=0, vmax=1, aspect='equal') axes[1].set_xticks(range(n_temps)) axes[1].set_xticklabels([f'{t}' for t in temperatures]) axes[1].set_yticks(range(n_temps)) axes[1].set_yticklabels([f'{t}' for t in temperatures]) axes[1].set_xlabel('Temperature (column)') axes[1].set_ylabel('Temperature (row)') axes[1].set_title('P(α_row > α_col)') for i in range(n_temps): for j in range(n_temps): if not np.isnan(heatmap[i, j]): color = 'white' if heatmap[i, j] > 0.8 or heatmap[i, j] < 0.2 else 'black' axes[1].text(j, i, f'{heatmap[i, j]:.2f}', ha='center', va='center', fontsize=9, color=color) plt.colorbar(im, ax=axes[1], shrink=0.8) plt.tight_layout() plt.show() ``` ## Discussion {#sec-discussion} ### Summary of Findings GPT-4o shows a **clear negative temperature–sensitivity relationship** on the Ellsberg gamble task: 1. **Strong negative slope.** The posterior slope $\Delta\alpha / \Delta T$ has a median of $\approx -38$, with $P(\text{slope} < 0) \approx 0.98$. This is comparable in strength to the initial temperature study. The posterior probability of strict monotonicity — $\alpha$ decreasing at every successive temperature step — reinforces this conclusion (see @sec-monotonicity). 2. **Consistent decline.** The $\alpha$ estimates decline monotonically from $\approx 110$ at $T = 0.0$ to $\approx 52$ at $T = 1.5$, a roughly 2:1 ratio. The largest marginal drop is between $T = 1.0$ and $T = 1.5$, suggesting a possible nonlinearity in the temperature–sensitivity relationship: the marginal effect of temperature on $\alpha$ may be largest at high temperatures. This high-temperature regime ($T > 1.0$) is unavailable for Claude due to API constraints, limiting cross-LLM comparability in this range. Whether the large $T = 1.0 \to 1.5$ drop reflects a genuine threshold effect or simply the tail of a smooth nonlinear relationship cannot be determined from five temperature points alone. 3. **Good model fit.** Posterior predictive checks indicate adequate calibration at all temperature levels. ### Interpreting α in Behavioural Terms The $\alpha$ parameter controls the concentration of the softmax choice rule: higher $\alpha$ implies sharper concentration of choice probability on the SEU-maximising alternative. At $\alpha \approx 110$ ($T = 0.0$), the model assigns overwhelming probability to the EU-maximising gamble in most decision problems — choice behaviour is nearly deterministic. At $\alpha \approx 52$ ($T = 1.5$), the concentration is still well above chance (which would correspond to $\alpha \to 0$), but the model allocates meaningfully more probability mass to suboptimal alternatives. The 2:1 decline thus represents a shift from near-deterministic SEU maximisation to a regime where random or heuristic choice patterns are more frequent, though still well above indifference. ### Comparability of α Across Factorial Cells A recurring issue in cross-cell comparison is the **non-comparability of absolute $\alpha$ values** across cells that use different model variants. The insurance cells ($K = 3$) use m_01 with a Lognormal(3.0, 0.75) prior, while the Ellsberg cells ($K = 4$) use m_02 with a Lognormal(3.5, 0.75) prior. The $\alpha$ parameter interacts with the dimension of the consequence space and the utility geometry in ways that make raw magnitudes non-portable: a value of $\alpha = 50$ implies different choice probabilities under $K = 3$ vs. $K = 4$ because the softmax operates over different-sized simplex spaces. Consequently, comparisons across the task dimension of the factorial should focus on **directional quantities** — the sign of the slope, $P(\text{slope} < 0)$, and strict monotonicity probabilities — rather than absolute $\alpha$ levels or slope magnitudes. Within-column comparisons (same task, different LLMs) do not have this limitation, since both cells share the same model and prior. ### Implications for the Factorial Design This cell confirms that GPT-4o's temperature–sensitivity relationship **generalises across task domains**. The two GPT-4o cells (insurance and Ellsberg) both show clear declines, while the two Claude cells (insurance and Ellsberg) both show weak or absent effects. This pattern strongly supports the conclusion that: - **LLM is the primary driver.** The main effect of LLM (GPT-4o vs. Claude) is large and consistent. - **Task is a secondary factor.** Changing the task ($K = 3$ to $K = 4$) does not eliminate or reverse the effect for GPT-4o, and does not create the effect for Claude. Note that this report analyses each factorial cell independently; it does not formally model the LLM × Task interaction. The [2×2 Factorial Synthesis](../factorial_synthesis/01_factorial_synthesis.qmd) report quantifies the LLM main effect, task main effect, and their interaction by jointly analysing the posterior draws from all four cells, including a difference-in-differences analysis that addresses the interaction directly. ### Connection to the Ambiguity Aversion Literature The Ellsberg gamble paradigm was originally designed to reveal violations of subjective expected utility theory — specifically, ambiguity aversion (Ellsberg, 1961; Camerer & Weber, 1992). In the human decision-making literature, the canonical finding is that agents prefer gambles with known probabilities over gambles with equivalent expected values but unknown probabilities (Trautmann & van de Kuilen, 2015). Our gamble set spans three ambiguity tiers precisely to allow for such effects. In this study, however, we use the Ellsberg gambles primarily as a **task domain** for the factorial design — a set of $K = 4$ decision problems that differs structurally from the insurance triage task. The $\alpha$ parameter captures *overall* sensitivity to expected utility maximisation, not ambiguity-specific processing. Whether GPT-4o exhibits ambiguity aversion per se — i.e., whether its subjective probability estimates ($\psi$) or choice patterns systematically shift as a function of gamble ambiguity tier — is an important question that lies beyond the scope of the current temperature-focused analysis. We note this as a natural direction for future work: examining whether the temperature effect on $\alpha$ is modulated by ambiguity level (e.g., whether high-ambiguity gambles show a larger temperature-induced degradation in choice quality) would connect this line of work more directly to the behavioural decision theory literature. ### Embedding Model and Cross-LLM Comparability Both GPT-4o cells use OpenAI's `text-embedding-3-small` for feature construction, and both Claude cells also use `text-embedding-3-small` (following the same pipeline). The embedding model is therefore held constant across all four factorial cells, preventing the embedding representation from confounding the LLM main effect. Any differences in the PCA features across cells arise from differences in the LLM's natural-language assessments (which vary with both LLM identity and temperature), not from the embedding step. ### Relation to the Broader Literature The finding that LLM identity is the dominant factor in the temperature–sensitivity relationship contributes to a growing literature examining LLM decision-making and rationality. Recent work has investigated how LLMs perform on classic decision-theoretic tasks (Binz & Schulz, 2023), how API parameters affect output quality (Renze & Guven, 2024), and whether LLMs exhibit systematic biases analogous to those documented in humans (Hagendorff et al., 2023). Our results suggest that the mapping from temperature to choice quality is not a universal property of LLM decoding but depends on model-specific characteristics — potentially reflecting differences in training, architecture, or the relationship between the temperature parameter and the model's internal representations. ### Limitations 1. **No prior sensitivity analysis.** The m_02 prior on $\alpha$ is Lognormal(3.5, 0.75), with prior 90% CI $\approx [10, 124]$. The posterior at $T = 0.0$ ($\alpha \approx 110$) falls near the upper tail of this prior. While $M \approx 300$ observations provide substantial information, and the parameter recovery analysis confirms good identifiability, we have not verified that the slope sign and $P(\text{slope} < 0)$ are robust to alternative prior specifications (e.g., a wider Lognormal(3.5, 1.0) or a shifted Lognormal(4.0, 0.75)). This is a standard robustness check for Bayesian analyses (Kruschke, 2013) and is planned as a supplementary analysis. 2. **Temperature range mismatch.** The GPT-4o cells sample temperatures up to $T = 1.5$, while Claude is limited to $T = 1.0$. The restricted-range analysis in @sec-comparison confirms that the negative slope holds over $T \in [0.0, 1.0]$, but slope *magnitudes* remain difficult to compare directly given the non-identical temperature grids. 3. **No within-study ambiguity analysis.** The gamble set spans three ambiguity tiers, but the current analysis does not examine whether $\alpha$ or choice patterns vary by tier. This limits the study's contribution to the ambiguity aversion literature specifically. ## Reproducibility {#sec-reproducibility} ### Data Snapshot | File | Description | |------|-------------| | `alpha_draws_T*.npz` | Posterior draws of α (4,000 per condition) | | `ppc_T*.json` | Posterior predictive check results | | `diagnostics_T*.txt` | CmdStan diagnostic output | | `stan_data_T*.json` | Stan-ready data (for refitting) | | `fit_summary.json` | Summary statistics across conditions | | `primary_analysis.json` | Pre-computed monotonicity and slope statistics | | `run_summary.json` | Pipeline metadata and configuration | | `study_config.yaml` | Frozen copy of the study configuration | ### Refitting from Source ```{python} #| label: refit-example #| eval: false #| echo: true # Uncomment to refit from source data (requires CmdStanPy) # # import cmdstanpy # model = cmdstanpy.CmdStanModel(stan_file="models/m_02.stan") # # for t in [0.0, 0.3, 0.7, 1.0, 1.5]: # key = f"T{str(t).replace('.', '_')}" # fit = model.sample( # data=f"data/stan_data_{key}.json", # chains=4, # iter_warmup=1000, # iter_sampling=1000, # seed=42, # ) # print(f"T={t}: alpha median = {np.median(fit.stan_variable('alpha')):.1f}") ``` ## References ::: {#refs} :::

	m_0 (foundational)	m_01 (insurance)	m_02 (this study & Ellsberg)
\(\alpha\) prior	\(\text{Lognormal}(0, 1)\)	\(\text{Lognormal}(3.0, 0.75)\)	\(\text{Lognormal}(3.5, 0.75)\)
Prior median	\(\approx 1\)	\(\approx 20\)	\(\approx 33\)
Prior 90% CI	\([0.19, 5.0]\)	\([5.5, 67]\)	\([10, 124]\)
\(K\)	generic	3	4