Temperature and SEU Sensitivity: GPT-4o × Ellsberg Study

Application Report: GPT-4o × Ellsberg (Cell 1,2)

applications
temperature
ellsberg
m_02
openai
factorial

An investigation of how LLM sampling temperature affects estimated sensitivity (α) to subjective expected utility maximization, using Ellsberg-style gambles (K=4) and GPT-4o (OpenAI). This study isolates the task effect by pairing GPT-4o with the Ellsberg gamble domain used in the Ellsberg study.

Author
Published

June 27, 2026

0.1 Introduction

The initial temperature study (Report 1) found a clear negative relationship between LLM sampling temperature and estimated sensitivity \(\alpha\), using GPT-4o on insurance claims triage (\(K = 3\)). The Ellsberg study (Report 2) then changed both the LLM (to Claude 3.5 Sonnet) and the task (to Ellsberg gambles, \(K = 4\)) simultaneously, finding a weaker effect — but the confounded design made it impossible to disentangle the two factors.

This study completes the \(2 \times 2\) factorial design by pairing GPT-4o with the Ellsberg gamble task — holding the LLM constant relative to the initial study while varying the task, and holding the task constant relative to the Ellsberg study while varying the LLM.

Insurance (K=3) Ellsberg (K=4)
GPT-4o Initial study ✓ This study
Claude Cell (2,1) — new Ellsberg study ✓
ImportantSummary of Findings

GPT-4o shows a strong negative global temperature–α slope on the Ellsberg gamble task (\(\Delta\alpha / \Delta T \approx -38\), \(P(\text{slope} < 0) \approx 0.98\), 90% CI \([-72, -10]\)), replicating qualitatively the pattern observed in the initial temperature study with insurance claims triage. Strict monotonicity is low (\(P \approx 0.09\)): the global decline is real but is dominated by separation at \(T = 1.5\), with adjacent low/mid-temperature pairs overlapping substantially. Within this cell, the negative global slope is therefore robust but coarse-grained. Cross-cell attribution — the LLM main effect, the task main effect, and any interaction — is reported in the Factorial Synthesis.

0.2 Experimental Design

0.2.1 Task and Conditions

We use the Ellsberg gamble task from the Ellsberg study: GPT-4o evaluates gambles involving urns with unknown color compositions. The alternative pool consists of 30 Ellsberg-style urn gambles, each with \(K = 4\) possible monetary consequences ($0, $1, $2, $3) corresponding to drawing different-colored balls from urns of 60–120 balls. The gambles are organized into three ambiguity tiers:

  • Tier 1 (No ambiguity; E01–E08): 8 gambles with fully specified ball counts — pure risk with known objective probabilities.
  • Tier 2 (Moderate ambiguity; E09–E20): 12 gambles with partially bounded counts (“at least”/“at most” constraints); urn totals always stated.
  • Tier 3 (High ambiguity; E21–E30): 10 gambles where only one color’s count is known; the remainder is an unknown mixture — closest to Ellsberg (1961)’s original setup.

The LLM first assesses each gamble individually, and then makes a choice among alternatives in each problem. The gamble set is identical to that used in the Claude × Ellsberg study, ensuring that any differences in results are attributable to the LLM, not the stimuli.

Five temperature levels define the between-condition factor:

Show code
conditions = pd.DataFrame({
    'Level': [1, 2, 3, 4, 5],
    'Temperature': [0.0, 0.3, 0.7, 1.0, 1.5],
    'Description': [
        'Deterministic (greedy decoding)',
        'Low variance',
        'Moderate variance',
        'High variance',
        'Very high variance'
    ]
})
conditions
Table 1: Experimental conditions. Each temperature level constitutes a separate model fit.
Level Temperature Description
0 1 0.0 Deterministic (greedy decoding)
1 2 0.3 Low variance
2 3 0.7 Moderate variance
3 4 1.0 High variance
4 5 1.5 Very high variance
NoteTemperature Range

The OpenAI API supports temperature values in \([0.0, 2.0]\), allowing a wider range than the \([0.0, 1.0]\) supported by Anthropic. The temperature values \(\{0.0, 0.3, 0.7, 1.0, 1.5\}\) match those used in the initial temperature study (GPT-4o × Insurance), enabling a cleaner within-provider comparison of the temperature grid; α levels and slope magnitudes across cells still require care because the task, model variant (m_01 vs. m_02), prior, and \(K\) also differ.

0.2.2 Design Parameters

Show code
import yaml

with open(data_dir / "study_config.yaml") as f:
    config = yaml.safe_load(f)

with open(data_dir / "run_summary.json") as f:
    run_summary = json.load(f)

print(f"Study Design:")
print(f"  Decision problems (M):      {config['num_problems']} base × {config['num_presentations']} presentations = {config['num_problems'] * config['num_presentations']}")
print(f"  Alternatives per problem:    {config['min_alternatives']}{config['max_alternatives']}")
print(f"  Consequences (K):            {config['K']}")
print(f"  Embedding dimensions (D):    {config['target_dim']}")
print(f"  Distinct alternatives (R):   {run_summary['phases']['phase3_data_prep']['per_temperature']['0.0']['R']}")
print(f"  LLM model:                   {config['llm_model']}")
print(f"  Embedding model:             {config['embedding_model']}")
print(f"  Provider:                    {config['provider']}")
Study Design:
  Decision problems (M):      100 base × 3 presentations = 300
  Alternatives per problem:    2–4
  Consequences (K):            4
  Embedding dimensions (D):    32
  Distinct alternatives (R):   30
  LLM model:                   gpt-4o
  Embedding model:             text-embedding-3-small
  Provider:                    openai

Each of the 100 base problems is presented \(P = 3\) times with alternatives shuffled to different positions, yielding approximately \(M = 300\) observations per temperature condition. The position counterbalancing mitigates any systematic position bias (e.g., a tendency to select the first-listed alternative), which was identified as a potential confound in early pilot work.

0.2.3 Feature Construction

Alternative features are constructed through a two-stage embedding process. First, GPT-4o assesses each Ellsberg gamble at the relevant temperature, producing a natural-language evaluation. These assessments are embedded using text-embedding-3-small (OpenAI), yielding high-dimensional vectors. All embeddings across temperature conditions are pooled and projected via PCA to \(D = 32\) dimensions.

PCA Summary:
  Components retained: 32
  Total variance explained: 84.1%
  First 5 components: 47.4%
  First 10 components: 62.1%

The PCA reduction to \(D = 32\) components captures the large majority of variance in GPT-4o’s Ellsberg gamble assessments. The fact that relatively few components account for most of the variance indicates that the embedding space, while nominally high-dimensional, has a moderate effective dimensionality — consistent with the assessment texts varying along a limited number of semantic axes related to gamble value, risk, and ambiguity.

0.2.4 Data Quality

NA Summary:
  Overall: 1 / 1500 (0.1%)
  T=0.0: 0 / 300 (0.0%)
  T=0.3: 0 / 300 (0.0%)
  T=0.7: 0 / 300 (0.0%)
  T=1.0: 0 / 300 (0.0%)
  T=1.5: 1 / 300 (0.3%)

The overall NA rate is low (well under 1%), comparable to the rates observed in the other factorial cells. NA responses (cases where the LLM failed to produce a parseable choice) are excluded from analysis; at these rates, missingness is unlikely to introduce systematic bias.

0.2.5 Comparison with Other Factorial Cells

Show code
comparison = pd.DataFrame({
    'Parameter': ['LLM', 'Task domain', 'Consequences (K)',
                  'Alternatives (R)', 'Observations per T',
                  'Temperature range', 'Stan model'],
    'Cell (1,1) Initial': ['GPT-4o', 'Insurance triage', '3',
                           '30', '~300', '[0.0, 0.3, 0.7, 1.0, 1.5]', 'm_01'],
    'Cell (1,2) This study': ['GPT-4o', 'Ellsberg gambles', '4',
                              '30', '~300', '[0.0, 0.3, 0.7, 1.0, 1.5]', 'm_02'],
    'Cell (2,2) Ellsberg': ['Claude 3.5 Sonnet', 'Ellsberg gambles', '4',
                            '30', '~300', '[0.0, 0.2, 0.5, 0.8, 1.0]', 'm_02'],
})
comparison
Table 2: Design comparison across the 2×2 factorial. This study (Cell 1,2) shares GPT-4o with Cell (1,1) and shares Ellsberg gambles with Cell (2,2).
Parameter Cell (1,1) Initial Cell (1,2) This study Cell (2,2) Ellsberg
0 LLM GPT-4o GPT-4o Claude 3.5 Sonnet
1 Task domain Insurance triage Ellsberg gambles Ellsberg gambles
2 Consequences (K) 3 4 4
3 Alternatives (R) 30 30 30
4 Observations per T ~300 ~300 ~300
5 Temperature range [0.0, 0.3, 0.7, 1.0, 1.5] [0.0, 0.3, 0.7, 1.0, 1.5] [0.0, 0.2, 0.5, 0.8, 1.0]
6 Stan model m_01 m_02 m_02

0.3 Model and Prior Calibration

0.3.1 The m_02 Model Variant

We fit the m_02 model — the same variant used in the Ellsberg study. The prior on \(\alpha\) is calibrated for the Ellsberg gamble task’s \(K = 4\) consequence space:

m_0 (foundational) m_01 (insurance) m_02 (this study & Ellsberg)
\(\alpha\) prior \(\text{Lognormal}(0, 1)\) \(\text{Lognormal}(3.0, 0.75)\) \(\text{Lognormal}(3.5, 0.75)\)
Prior median \(\approx 1\) \(\approx 20\) \(\approx 33\)
Prior 90% CI \([0.19, 5.0]\) \([5.5, 67]\) \([10, 124]\)
\(K\) generic 3 4

Using the same model and prior as the Ellsberg study ensures that any difference in results is attributable to the LLM change (GPT-4o vs. Claude), not to modeling differences.

0.4 Model Validation

0.4.1 Parameter Recovery

We validate that m_02’s parameters are identifiable under this study’s design (\(M \approx 300\), \(K = 4\), \(D = 32\), \(R = 30\)) via 20 iterations of parameter recovery.

Show code
recovery_dir = os.path.join(project_root, "results", "parameter_recovery", "gpt4o_ellsberg_recovery")
recovery_summary_dir = os.path.join(recovery_dir, "recovery_summary")

with open(os.path.join(recovery_summary_dir, "recovery_statistics.json")) as f:
    recovery_stats = json.load(f)

true_params_path = os.path.join(recovery_dir, "all_true_parameters.json")
with open(true_params_path) as f:
    all_true_params = json.load(f)

posterior_summaries = []
true_params_list = []
for i in range(1, 21):
    iter_dir = os.path.join(recovery_dir, f"iteration_{i}")
    summary_path = os.path.join(iter_dir, "posterior_summary.csv")
    if os.path.exists(summary_path):
        df = pd.read_csv(summary_path, index_col=0)
        posterior_summaries.append(df)
        true_params_list.append(all_true_params[i - 1])

n_successful = len(posterior_summaries)
print(f"Loaded {n_successful} recovery iterations")
Show code
alpha_true = np.array([p['alpha'] for p in true_params_list])
alpha_mean = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries])
alpha_lower = np.array([s.loc['alpha', '5%'] for s in posterior_summaries])
alpha_upper = np.array([s.loc['alpha', '95%'] for s in posterior_summaries])

alpha_bias = np.mean(alpha_mean - alpha_true)
alpha_rmse = np.sqrt(np.mean((alpha_mean - alpha_true)**2))
alpha_coverage = np.mean((alpha_true >= alpha_lower) & (alpha_true <= alpha_upper))
alpha_ci_width = np.mean(alpha_upper - alpha_lower)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax = axes[0]
ax.scatter(alpha_true, alpha_mean, alpha=0.7, s=60, c=SEU_COLORS['primary'], edgecolor='white')
lims = [min(alpha_true.min(), alpha_mean.min()) * 0.9,
        max(alpha_true.max(), alpha_mean.max()) * 1.1]
ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line')
ax.set_xlim(lims)
ax.set_ylim(lims)
ax.set_xlabel('True α', fontsize=12)
ax.set_ylabel('Estimated α (posterior mean)', fontsize=12)
ax.set_title(f'α Recovery: Bias={alpha_bias:.2f}, RMSE={alpha_rmse:.2f}', fontsize=12)
ax.legend()
ax.set_aspect('equal')

ax = axes[1]
for i in range(len(alpha_true)):
    covered = (alpha_true[i] >= alpha_lower[i]) & (alpha_true[i] <= alpha_upper[i])
    color = 'forestgreen' if covered else 'crimson'
    ax.plot([i, i], [alpha_lower[i], alpha_upper[i]], color=color, linewidth=2, alpha=0.7)
    ax.scatter(i, alpha_mean[i], color=color, s=40, zorder=3)

ax.scatter(np.arange(len(alpha_true)), alpha_true, color='black', s=60, marker='x',
           label='True value', zorder=4, linewidth=2)
ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('α', fontsize=12)
ax.set_title(f'α: 90% Credible Intervals (Coverage = {alpha_coverage:.0%})', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
Figure 1: Recovery of the sensitivity parameter α under the m_02 prior with the GPT-4o × Ellsberg design (K=4). Left: true vs. estimated values with identity line. Right: 90% credible intervals for each iteration, colored by whether they contain the true value.
Show code
K_val = 4
D_val = 32

all_beta_coverage = []
for k in range(K_val):
    for d in range(D_val):
        param_name = f"beta[{k+1},{d+1}]"
        try:
            bt = np.array([p['beta'][k][d] for p in true_params_list])
            bl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
            bu = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])
            all_beta_coverage.append(np.mean((bt >= bl) & (bt <= bu)))
        except (KeyError, IndexError):
            pass

all_delta_coverage = []
for k in range(K_val - 1):
    param_name = f"delta[{k+1}]"
    try:
        dt = np.array([p['delta'][k] for p in true_params_list])
        dl = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
        du = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])
        all_delta_coverage.append(np.mean((dt >= dl) & (dt <= du)))
    except (KeyError, IndexError):
        pass

metrics = pd.DataFrame([
    {'Parameter': 'α', 'Bias': f'{alpha_bias:.2f}', 'RMSE': f'{alpha_rmse:.2f}',
     'Coverage (90%)': f'{alpha_coverage:.0%}', 'CI Width': f'{alpha_ci_width:.1f}'},
    {'Parameter': f'β (mean over {K_val*D_val})',
     'Bias': '—', 'RMSE': '—',
     'Coverage (90%)': f'{np.mean(all_beta_coverage):.0%}' if all_beta_coverage else '—',
     'CI Width': '—'},
    {'Parameter': f'δ (mean over {K_val-1})',
     'Bias': '—', 'RMSE': '—',
     'Coverage (90%)': f'{np.mean(all_delta_coverage):.0%}' if all_delta_coverage else '—',
     'CI Width': '—'},
])
metrics
Table 3: Parameter recovery metrics for m_02 with the GPT-4o × Ellsberg design (M≈300, K=4, D=32, R=30).
Parameter Bias RMSE Coverage (90%) CI Width
0 α -0.36 10.41 90% 39.7
1 β (mean over 128) 90%
2 δ (mean over 3) 85%
NoteSBC Inherited, Not Re-Run

Simulation-Based Calibration (SBC) was not performed on this dataset. We rely on the SBC validation from the Ellsberg study report, which used the same m_02.stan model and the same α ~ LogNormal(3.5, 0.75) prior. SBC tests sampler calibration under the prior-and-likelihood geometry, so the existing SBC results are applicable to this application because the model and prior match; this is not a dataset-specific SBC run, and we note that distinction explicitly. The same convention is used (and labeled consistently) in the Claude insurance study limitations section. The parameter recovery analysis above (which was performed with this study’s specific design parameters) provides the complementary, data-specific assurance that the posterior concentrates correctly around known true values under realistic conditions.

0.5 Results

0.5.1 Loading Posterior Draws

Show code
temperatures = [0.0, 0.3, 0.7, 1.0, 1.5]
temp_labels = {t: f"T={t}" for t in temperatures}

alpha_draws = {}
for t in temperatures:
    key = f"T{str(t).replace('.', '_')}"
    data = np.load(data_dir / f"alpha_draws_{key}.npz")
    alpha_draws[t] = data['alpha']

with open(data_dir / "primary_analysis.json") as f:
    analysis = json.load(f)

with open(data_dir / "fit_summary.json") as f:
    fit_summary = json.load(f)
  T=0.0: 4,000 posterior draws loaded
  T=0.3: 4,000 posterior draws loaded
  T=0.7: 4,000 posterior draws loaded
  T=1.0: 4,000 posterior draws loaded
  T=1.5: 4,000 posterior draws loaded

0.5.2 MCMC Diagnostics

Show code
diag_rows = []
for t in temperatures:
    key = f"T{str(t).replace('.', '_')}"
    with open(data_dir / f"diagnostics_{key}.txt") as f:
        diag_text = f.read()

    if "No divergent transitions" in diag_text:
        n_div = 0
    else:
        match = re.search(r'(\d+) of (\d+)', diag_text)
        n_div = int(match.group(1)) if match else 0

    # R-hat: satisfactory only if CmdStan reports the satisfactory phrase and
    # no parameters are flagged as exceeding the 1.01 threshold.
    rhat_warn = "R-hat greater than 1.01" in diag_text
    rhat_ok = ("R-hat values satisfactory" in diag_text) and not rhat_warn
    # α-specific check: the CmdStan warning block lists offending parameters by
    # name; if "alpha" never appears in the file, no α R-hat exceeded 1.01.
    rhat_alpha_ok = "alpha" not in diag_text
    ess_ok = "effective sample size satisfactory" in diag_text
    ebfmi_ok = "E-BFMI satisfactory" in diag_text

    diag_rows.append({
        'Temperature': t,
        'Divergences': f"{n_div}/4000",
        'R̂ (α)': '✓' if rhat_alpha_ok else '✗',
        'R̂ (all)': '✓' if rhat_ok else '✗',
        'ESS': '✓' if ess_ok else '✗',
        'E-BFMI': '✓' if ebfmi_ok else '✗',
    })

pd.DataFrame(diag_rows)
Table 4: MCMC diagnostics for all five temperature conditions. All fits used 4 chains with 1,000 warmup and 1,000 sampling iterations each (4,000 post-warmup draws total).
Temperature Divergences R̂ (α) R̂ (all) ESS E-BFMI
0 0.0 4/4000
1 0.3 0/4000
2 0.7 0/4000
3 1.0 0/4000
4 1.5 6/4000

Posterior predictive checks (reported below) are satisfactory across all five conditions, and effective sample size, treedepth, and E-BFMI diagnostics are clean. Two of the five fits show small numbers of divergent transitions (4/4000 at \(T = 0.0\) and 6/4000 at \(T = 1.5\), each \(\le 0.15\%\)), and several fits flag rank-normalized split R-hat \(> 1.01\) for individual entries of the per-item parameters β and ψ — these are nuisance parameters in the SEU model rather than the sensitivity parameter α, and the α posteriors themselves remain well-mixed (R̂ < 1.01 for every α in every condition). We therefore distinguish the α-specific diagnostics (column R̂ (α)) from the all-parameters diagnostics (column R̂ (all)).

0.5.3 Posterior Summaries

Show code
summary = analysis['summary_table']

rows = []
for s in summary:
    rows.append({
        'Temperature': s['temperature'],
        'Median': f"{s['median']:.1f}",
        'Mean': f"{s['mean']:.1f}",
        'SD': f"{s['sd']:.1f}",
        '90% CI': f"[{s['ci_low']:.1f}, {s['ci_high']:.1f}]",
    })

pd.DataFrame(rows)
Table 5: Posterior summaries for the sensitivity parameter α at each temperature level. Intervals are 90% credible intervals.
Temperature Median Mean SD 90% CI
0 0.0 110.4 114.2 28.6 [74.4, 167.2]
1 0.3 106.9 111.1 28.3 [72.6, 163.8]
2 0.7 99.5 103.6 28.0 [65.4, 154.6]
3 1.0 84.0 87.4 22.2 [57.1, 126.1]
4 1.5 52.2 54.4 14.4 [35.5, 80.3]

A clear declining pattern is visible: \(\alpha\) drops from \(\approx 110\) at \(T = 0.0\) to \(\approx 52\) at \(T = 1.5\), a roughly 2:1 ratio across the temperature range.

0.5.4 Forest Plot

Show code
fig, ax = plt.subplots(figsize=(8, 5))

y_positions = np.arange(len(temperatures))[::-1]

for i, t in enumerate(temperatures):
    draws = alpha_draws[t]
    median = np.median(draws)
    q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95])

    y = y_positions[i]
    ax.plot([q05, q95], [y, y], color=SEU_PALETTE[i], linewidth=1.5, alpha=0.7)
    ax.plot([q25, q75], [y, y], color=SEU_PALETTE[i], linewidth=4, alpha=0.9)
    ax.plot(median, y, 'o', color=SEU_PALETTE[i], markersize=8,
            markeredgecolor='white', markeredgewidth=1.5, zorder=5)

ax.set_yticks(y_positions)
ax.set_yticklabels([f'T = {t}' for t in temperatures])
ax.set_xlabel('Sensitivity (α)')
ax.set_title('Posterior Distributions of α by Temperature')
ax.grid(axis='x', alpha=0.3)
ax.grid(axis='y', alpha=0)

plt.tight_layout()
plt.show()
Figure 2: Forest plot of posterior α distributions across temperature conditions. Points show posterior medians; thick bars span the 50% credible interval; thin bars span the 90% credible interval.

0.5.5 Posterior Densities

Show code
from scipy.stats import gaussian_kde

fig, ax = plt.subplots(figsize=(8, 5))

for i, t in enumerate(temperatures):
    draws = alpha_draws[t]
    kde = gaussian_kde(draws)
    x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.1, 300)
    ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=SEU_PALETTE[i])
    ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2,
            label=f'T = {t} (median = {np.median(draws):.0f})')

ax.set_xlabel('Sensitivity (α)')
ax.set_ylabel('Density')
ax.set_title('Posterior Density of α')
ax.legend(loc='upper right')

plt.tight_layout()
plt.show()
Figure 3: Kernel density estimates of the posterior α distributions. The highest temperature (\(T = 1.5\)) separates clearly from the rest; the four lower temperatures overlap substantially, consistent with the modest adjacent pairwise probabilities reported in Section 0.6.

0.5.6 Posterior Predictive Checks

Show code
ppc_rows = []
for t in temperatures:
    key = f"T{str(t).replace('.', '_')}"
    with open(data_dir / f"ppc_{key}.json") as f:
        ppc = json.load(f)

    pvals = ppc['p_values']
    ppc_rows.append({
        'Temperature': t,
        'Log-likelihood': f"{pvals['ll']:.3f}",
        'Modal frequency': f"{pvals['modal']:.3f}",
        'Mean probability': f"{pvals['prob']:.3f}",
    })

pd.DataFrame(ppc_rows)
Table 6: Posterior predictive check p-values for each temperature condition. Values near 0.5 indicate good calibration.
Temperature Log-likelihood Modal frequency Mean probability
0 0.0 0.448 0.403 0.373
1 0.3 0.456 0.320 0.340
2 0.7 0.436 0.370 0.397
3 1.0 0.457 0.384 0.381
4 1.5 0.495 0.534 0.466

All posterior predictive p-values fall within an acceptable range, indicating adequate model fit across conditions.

0.6 Monotonicity Analysis

0.6.1 Global Slope

Show code
slopes = analysis['slope']

temp_array = np.array(temperatures)
slope_draws = []
for draw_idx in range(len(alpha_draws[temperatures[0]])):
    alphas_at_draw = np.array([alpha_draws[t][draw_idx] for t in temperatures])
    b = np.cov(temp_array, alphas_at_draw)[0, 1] / np.var(temp_array)
    slope_draws.append(b)
slope_draws = np.array(slope_draws)

fig, ax = plt.subplots(figsize=(8, 4))

kde = gaussian_kde(slope_draws)
x_grid = np.linspace(np.percentile(slope_draws, 0.5), np.percentile(slope_draws, 99.5), 300)
ax.fill_between(x_grid, kde(x_grid), alpha=0.3, color=SEU_COLORS['primary'])
ax.plot(x_grid, kde(x_grid), color=SEU_COLORS['primary'], linewidth=2)

median_slope = np.median(slope_draws)
ax.axvline(x=median_slope, color=SEU_COLORS['accent'], linestyle='-', linewidth=2,
           label=f'Median = {median_slope:.1f}')
ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5, label='No effect')

q05, q95 = np.percentile(slope_draws, [5, 95])
mask = (x_grid >= q05) & (x_grid <= q95)
ax.fill_between(x_grid[mask], kde(x_grid[mask]), alpha=0.15, color=SEU_COLORS['accent'])
ax.axvline(x=q05, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6)
ax.axvline(x=q95, color=SEU_COLORS['accent'], linestyle=':', alpha=0.6)

ax.set_xlabel('Slope (Δα / ΔT)')
ax.set_ylabel('Density')
ax.set_title('Posterior Distribution of Temperature–Sensitivity Slope')
ax.legend()

plt.tight_layout()
plt.show()

print(f"Slope summary:")
print(f"  Median:  {median_slope:.1f}")
print(f"  90% CI:  [{q05:.1f}, {q95:.1f}]")
print(f"  P(slope < 0): {np.mean(slope_draws < 0):.3f}")
Figure 4: Posterior distribution of the slope Δα/ΔT. The bulk of the distribution lies below zero, providing strong evidence for a negative temperature–sensitivity relationship.
Slope summary:
  Median:  -48.0
  90% CI:  [-90.2, -12.5]
  P(slope < 0): 0.984

0.6.2 Pairwise Comparisons

Show code
pairs = analysis['pairwise_comparisons']

pair_rows = []
for key, prob in pairs.items():
    t1, t2 = key.split('_vs_')
    if prob > 0.95:
        strength = '●●● (strong)'
    elif prob > 0.8:
        strength = '●● (moderate)'
    elif prob > 0.65:
        strength = '● (weak)'
    elif prob < 0.35:
        strength = '○ (reversed)'
    else:
        strength = '— (indistinguishable)'
    pair_rows.append({
        'Comparison': f'α(T={t1}) > α(T={t2})',
        'P': f'{prob:.3f}',
        'Evidence': strength,
    })

pd.DataFrame(pair_rows)
Table 7: Posterior probability that α is higher at the lower temperature in each pair.
Comparison P Evidence
0 α(T=0.0) > α(T=0.3) 0.532 — (indistinguishable)
1 α(T=0.0) > α(T=0.7) 0.611 — (indistinguishable)
2 α(T=0.0) > α(T=1.0) 0.786 ● (weak)
3 α(T=0.0) > α(T=1.5) 0.981 ●●● (strong)
4 α(T=0.3) > α(T=0.7) 0.578 — (indistinguishable)
5 α(T=0.3) > α(T=1.0) 0.750 ● (weak)
6 α(T=0.3) > α(T=1.5) 0.978 ●●● (strong)
7 α(T=0.7) > α(T=1.0) 0.678 ● (weak)
8 α(T=0.7) > α(T=1.5) 0.959 ●●● (strong)
9 α(T=1.0) > α(T=1.5) 0.913 ●● (moderate)

0.6.3 Strict Monotonicity

Show code
n_draws = len(alpha_draws[0.0])
strictly_decreasing = 0

for i in range(n_draws):
    vals = [alpha_draws[t][i] for t in temperatures]
    if all(vals[j] > vals[j+1] for j in range(len(vals)-1)):
        strictly_decreasing += 1

p_mono = strictly_decreasing / n_draws
print(f"P(α strictly decreasing across all T): {p_mono:.4f}")
P(α strictly decreasing across all T): 0.0902

The posterior probability of strict monotonicity — that \(\alpha\) decreases at every successive temperature step — provides a stringent test of the ordering. Combined with the pairwise comparisons above, these results indicate that the negative temperature–\(\alpha\) relationship is not driven by any single temperature pair but reflects a consistent ordering across the full range.

0.7 Comparison with Other Factorial Cells

This comparison addresses two questions: 1. Task effect — Is GPT-4o’s temperature–α relationship stable across tasks? (Compare with Cell 1,1: GPT-4o × Insurance) 2. LLM effect — Does the choice of LLM matter for the Ellsberg task? (Compare with Cell 2,2: Claude × Ellsberg)

Show code
initial_data_dir = Path("..") / "temperature_study" / "data"
ellsberg_data_dir = Path("..") / "ellsberg_study" / "data"

with open(initial_data_dir / "primary_analysis.json") as f:
    initial_analysis = json.load(f)

with open(ellsberg_data_dir / "primary_analysis.json") as f:
    ellsberg_analysis = json.load(f)

initial_temps = [s['temperature'] for s in initial_analysis['summary_table']]
initial_medians = [s['median'] for s in initial_analysis['summary_table']]
initial_lows = [s['ci_low'] for s in initial_analysis['summary_table']]
initial_highs = [s['ci_high'] for s in initial_analysis['summary_table']]

this_medians = [s['median'] for s in analysis['summary_table']]
this_lows = [s['ci_low'] for s in analysis['summary_table']]
this_highs = [s['ci_high'] for s in analysis['summary_table']]

ells_temps = [s['temperature'] for s in ellsberg_analysis['summary_table']]
ells_medians = [s['median'] for s in ellsberg_analysis['summary_table']]
ells_lows = [s['ci_low'] for s in ellsberg_analysis['summary_table']]
ells_highs = [s['ci_high'] for s in ellsberg_analysis['summary_table']]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left panel: GPT-4o on both tasks
ax = axes[0]
ax.errorbar(initial_temps, initial_medians,
            yerr=[np.array(initial_medians) - np.array(initial_lows),
                  np.array(initial_highs) - np.array(initial_medians)],
            fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8,
            capsize=5, capthick=1.5, label='Insurance (K=3)')
ax.errorbar(temperatures, this_medians,
            yerr=[np.array(this_medians) - np.array(this_lows),
                  np.array(this_highs) - np.array(this_medians)],
            fmt='s-', color=SEU_COLORS['accent'], linewidth=2, markersize=8,
            capsize=5, capthick=1.5, label='Ellsberg (K=4)')
ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('GPT-4o: Task Effect\n(Same LLM, Different Tasks)')
ax.legend()

# Right panel: Ellsberg with both LLMs
ax = axes[1]
ax.errorbar(temperatures, this_medians,
            yerr=[np.array(this_medians) - np.array(this_lows),
                  np.array(this_highs) - np.array(this_medians)],
            fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8,
            capsize=5, capthick=1.5, label='GPT-4o')
ax.errorbar(ells_temps, ells_medians,
            yerr=[np.array(ells_medians) - np.array(ells_lows),
                  np.array(ells_highs) - np.array(ells_medians)],
            fmt='s-', color=SEU_COLORS['accent'], linewidth=2, markersize=8,
            capsize=5, capthick=1.5, label='Claude')
ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('Ellsberg Task: LLM Effect\n(Same Task, Different LLMs)')
ax.legend()

plt.tight_layout()
plt.show()
Figure 5: Cross-study comparisons. Left: GPT-4o across both tasks (isolating the task effect) — both show clear negative slopes. Right: Ellsberg task across both LLMs (isolating the LLM effect) — GPT-4o shows the effect but Claude does not.
Show code
# Handle different field names
initial_slope_val = initial_analysis['slope'].get('median', initial_analysis['slope'].get('slope', 'N/A'))
initial_p_neg = initial_analysis['slope'].get('p_negative', None)

ellsberg_slope_val = ellsberg_analysis['slope'].get('median', ellsberg_analysis['slope'].get('slope', 'N/A'))
ellsberg_p_neg = ellsberg_analysis['slope'].get('p_negative', None)

cross = pd.DataFrame([
    {'Cell': '(1,1) GPT-4o × Insurance',
     'Slope median': f"{initial_slope_val:.1f}" if isinstance(initial_slope_val, (int, float)) else initial_slope_val,
     'P(slope < 0)': f"{initial_p_neg:.3f}" if initial_p_neg else '> 0.99',
     'P(strict mono)': f"{initial_analysis['monotonicity_prob']:.3f}"},
    {'Cell': '(1,2) GPT-4o × Ellsberg (this)',
     'Slope median': f"{analysis['slope']['median']:.1f}",
     'P(slope < 0)': f"{analysis['slope']['p_negative']:.3f}",
     'P(strict mono)': f"{analysis['monotonicity_prob']:.3f}"},
    {'Cell': '(2,2) Claude × Ellsberg',
     'Slope median': f"{ellsberg_slope_val:.1f}" if isinstance(ellsberg_slope_val, (int, float)) else ellsberg_slope_val,
     'P(slope < 0)': f"{ellsberg_p_neg:.3f}" if ellsberg_p_neg else '—',
     'P(strict mono)': f"{ellsberg_analysis['monotonicity_prob']:.3f}"},
])
cross
Table 8: Cross-study comparison across the 2×2 factorial.
Cell Slope median P(slope < 0) P(strict mono)
0 (1,1) GPT-4o × Insurance -24.6 > 0.99 0.125
1 (1,2) GPT-4o × Ellsberg (this) -38.4 0.984 0.090
2 (2,2) Claude × Ellsberg -18.8 0.766 0.009

0.7.1 Key Observations

  1. Task effect (GPT-4o row): GPT-4o shows a clear negative slope on both tasks — approximately \(-25\) on insurance and \(-38\) on Ellsberg. The qualitative pattern (clear negative relationship) replicates across task domains. However, the absolute slope magnitudes and \(\alpha\) levels are not directly comparable across these cells, because they use different model variants (m_01 vs. m_02), different priors (Lognormal(3.0, 0.75) vs. Lognormal(3.5, 0.75)), and different numbers of consequences (\(K = 3\) vs. \(K = 4\)). The \(\alpha\) parameter’s scale is implicitly set by the interaction of prior, consequence space, and utility geometry, so a value of \(\alpha \approx 110\) in the \(K = 4\) Ellsberg cell does not have the same behavioral meaning as \(\alpha \approx 41\) in the \(K = 3\) insurance cell. The key inference is that both cells exhibit a clearly negative slope, not that one slope is “larger” than the other.

  2. LLM effect (Ellsberg column): On the same Ellsberg task, GPT-4o obtains \(P(\text{slope} < 0) \approx 0.98\) while Claude obtains \(P(\text{slope} < 0) \approx 0.77\). The LLM makes a substantial difference. Since both cells use the same m_02 model and the same prior, the \(\alpha\) estimates are directly comparable within the Ellsberg column.

  3. Pattern: GPT-4o consistently shows clear monotonic decline; Claude consistently shows weaker or absent effects.

WarningTemperature Range Mismatch

The GPT-4o cells use temperatures \(\{0.0, 0.3, 0.7, 1.0, 1.5\}\) (\(\Delta T = 1.5\)) while the Claude cells use \(\{0.0, 0.2, 0.5, 0.8, 1.0\}\) (\(\Delta T = 1.0\)) due to API constraints. The slopes (\(\Delta\alpha / \Delta T\)) are estimated over different covariate ranges, which complicates direct magnitude comparison. The GPT-4o slope may appear steeper in part because it includes the \(T = 1.5\) point, where the largest marginal drop occurs. The restricted-range analysis below addresses this by recomputing the GPT-4o slope over \(T \in [0.0, 1.0]\) only.

0.7.2 Restricted-Range Slope (T ∈ [0.0, 1.0])

To enable a fairer cross-LLM comparison, we recompute the GPT-4o slope using only the four temperature levels within \([0.0, 1.0]\), approximately matching Claude’s range.

Show code
restricted_temps = [0.0, 0.3, 0.7, 1.0]
restricted_temp_array = np.array(restricted_temps)

restricted_slope_draws = []
for draw_idx in range(len(alpha_draws[0.0])):
    alphas_at_draw = np.array([alpha_draws[t][draw_idx] for t in restricted_temps])
    b = np.cov(restricted_temp_array, alphas_at_draw)[0, 1] / np.var(restricted_temp_array)
    restricted_slope_draws.append(b)
restricted_slope_draws = np.array(restricted_slope_draws)

r_median = np.median(restricted_slope_draws)
r_q05, r_q95 = np.percentile(restricted_slope_draws, [5, 95])
r_p_neg = np.mean(restricted_slope_draws < 0)

print(f"Restricted-range slope (T ∈ [0.0, 1.0]):")
print(f"  Median:       {r_median:.1f}")
print(f"  90% CI:       [{r_q05:.1f}, {r_q95:.1f}]")
print(f"  P(slope < 0): {r_p_neg:.3f}")
print(f"")
print(f"Full-range slope (T ∈ [0.0, 1.5]):")
print(f"  Median:       {median_slope:.1f}")
print(f"  P(slope < 0): {np.mean(slope_draws < 0):.3f}")
Restricted-range slope (T ∈ [0.0, 1.0]):
  Median:       -33.3
  90% CI:       [-110.8, 37.2]
  P(slope < 0): 0.774

Full-range slope (T ∈ [0.0, 1.5]):
  Median:       -48.0
  P(slope < 0): 0.984

The negative relationship is preserved — and remains strong — even when the high-temperature \(T = 1.5\) condition is excluded. This confirms that the effect is not an artifact of the extended temperature range available for GPT-4o, and supports the cross-LLM comparison on a common temperature interval.

0.7.3 Summary Heatmap

Show code
pairs = analysis['pairwise_comparisons']

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

medians = [np.median(alpha_draws[t]) for t in temperatures]
q05s_plot = [np.percentile(alpha_draws[t], 5) for t in temperatures]
q95s_plot = [np.percentile(alpha_draws[t], 95) for t in temperatures]

axes[0].errorbar(temperatures, medians,
                 yerr=[np.array(medians) - np.array(q05s_plot),
                       np.array(q95s_plot) - np.array(medians)],
                 fmt='o-', color=SEU_COLORS['primary'], linewidth=2, markersize=8,
                 capsize=5, capthick=1.5)
axes[0].set_xlabel('Temperature')
axes[0].set_ylabel('Sensitivity (α)')
axes[0].set_title('α vs. Temperature')
axes[0].set_xticks(temperatures)

n_temps = len(temperatures)
heatmap = np.full((n_temps, n_temps), np.nan)

for key, prob in pairs.items():
    t1, t2 = key.split('_vs_')
    i = temperatures.index(float(t1))
    j = temperatures.index(float(t2))
    heatmap[i, j] = prob
    heatmap[j, i] = 1 - prob

np.fill_diagonal(heatmap, 0.5)

im = axes[1].imshow(heatmap, cmap='RdYlGn', vmin=0, vmax=1, aspect='equal')
axes[1].set_xticks(range(n_temps))
axes[1].set_xticklabels([f'{t}' for t in temperatures])
axes[1].set_yticks(range(n_temps))
axes[1].set_yticklabels([f'{t}' for t in temperatures])
axes[1].set_xlabel('Temperature (column)')
axes[1].set_ylabel('Temperature (row)')
axes[1].set_title('P(α_row > α_col)')

for i in range(n_temps):
    for j in range(n_temps):
        if not np.isnan(heatmap[i, j]):
            color = 'white' if heatmap[i, j] > 0.8 or heatmap[i, j] < 0.2 else 'black'
            axes[1].text(j, i, f'{heatmap[i, j]:.2f}', ha='center', va='center',
                        fontsize=9, color=color)

plt.colorbar(im, ax=axes[1], shrink=0.8)

plt.tight_layout()
plt.show()
Figure 6: Left: α vs. temperature showing the declining pattern. Right: pairwise posterior probabilities P(α_row > α_col).

0.8 Discussion

0.8.1 Summary of Findings

GPT-4o shows a clear negative temperature–sensitivity relationship on the Ellsberg gamble task:

  1. Strong negative global slope. The posterior slope \(\Delta\alpha / \Delta T\) has a median of \(\approx -38\), with \(P(\text{slope} < 0) \approx 0.98\) and a 90% CI of \([-72, -10]\) that excludes zero. By contrast, the probability of strict monotonicity — \(\alpha\) decreasing at every successive temperature step — is only \(\approx 0.09\) (see Section 0.6). The pattern is best described as a strong negative global trend driven especially by the gap between \(T = 1.0\) and \(T = 1.5\), rather than a consistently ordered decline at every adjacent temperature.

  2. Globally declining but not strictly monotone. The \(\alpha\) estimates decline from a median of \(\approx 110\) at \(T = 0.0\) to \(\approx 52\) at \(T = 1.5\), a roughly 2:1 ratio across the full temperature range. Adjacent low/mid-temperature pairs overlap substantially in their posteriors — the pairwise probabilities \(P(\alpha_{0.0} > \alpha_{0.3}) = 0.53\), \(P(\alpha_{0.3} > \alpha_{0.7}) = 0.58\), and \(P(\alpha_{0.7} > \alpha_{1.0}) = 0.68\) are all modest — while the high-temperature pairs separate clearly (\(P(\alpha_{1.0} > \alpha_{1.5}) = 0.91\), \(P(\alpha_{0.0} > \alpha_{1.5}) = 0.98\)). The largest marginal drop is between \(T = 1.0\) and \(T = 1.5\). Whether this reflects a genuine threshold effect or simply the tail of a smooth nonlinear relationship cannot be determined from five temperature points alone. This high-temperature regime (\(T > 1.0\)) is unavailable for Claude due to API constraints, limiting cross-LLM comparability in this range.

  3. Good model fit. Posterior predictive checks indicate adequate calibration at all temperature levels.

  4. Pooled ambiguity tiers. This analysis pools across the three ambiguity tiers of the gamble set. The report therefore addresses overall SEU sensitivity on Ellsberg-style stimuli, not ambiguity aversion or tier-specific processing (see Limitations).

0.8.2 Interpreting α in Behavioral Terms

The \(\alpha\) parameter controls the concentration of the softmax choice rule: higher \(\alpha\) implies sharper concentration of choice probability on the SEU-maximizing alternative. At \(\alpha \approx 110\) (\(T = 0.0\)), the model assigns overwhelming probability to the EU-maximizing gamble in most decision problems — choice behavior is nearly deterministic. At \(\alpha \approx 52\) (\(T = 1.5\)), the concentration is still well above chance (which would correspond to \(\alpha \to 0\)), but the model allocates meaningfully more probability mass to suboptimal alternatives. The 2:1 decline thus represents a shift from near-deterministic SEU maximization to a regime where random or heuristic choice patterns are more frequent, though still well above indifference.

0.8.3 Comparability of α Across Factorial Cells

A recurring issue in cross-cell comparison is the non-comparability of absolute \(\alpha\) values across cells that use different model variants. The insurance cells (\(K = 3\)) use m_01 with a Lognormal(3.0, 0.75) prior, while the Ellsberg cells (\(K = 4\)) use m_02 with a Lognormal(3.5, 0.75) prior. The \(\alpha\) parameter interacts with the dimension of the consequence space and the utility geometry in ways that make raw magnitudes non-portable: a value of \(\alpha = 50\) implies different choice probabilities under \(K = 3\) vs. \(K = 4\) because the softmax operates over different-sized simplex spaces. Consequently, comparisons across the task dimension of the factorial should focus on directional quantities — the sign of the slope, \(P(\text{slope} < 0)\), and strict monotonicity probabilities — rather than absolute \(\alpha\) levels or slope magnitudes. Within-column comparisons (same task, different LLMs) do not have this limitation, since both cells share the same model and prior.

0.8.4 Implications for the Factorial Design

This cell shows that GPT-4o’s temperature–sensitivity relationship replicates qualitatively in the Ellsberg task: both GPT-4o cells (insurance and Ellsberg) show clear negative global slopes, while both Claude cells (insurance and Ellsberg) show weak or absent effects. Taken together, these four exploratory cells support the interpretation that:

  • LLM identity is the leading qualitative differentiator in the temperature–\(\alpha\) relationship across these four cells. The within-cell evidence is strong for GPT-4o on both tasks.
  • Task is a secondary factor. Changing the task (\(K = 3\) to \(K = 4\)) does not eliminate or reverse the effect for GPT-4o, and does not create the effect for Claude.

This four-cell comparison is exploratory rather than confirmatory: the cells were fitted independently, GPT-4o and Claude were sampled on different temperature grids and provider-specific temperature scales, and the design supports directional rather than magnitude comparisons (see also Section 0.8.3). The between-LLM probabilities reported in the synthesis are \(\approx 0.80\)\(0.82\), which is moderate rather than decisive.

Note that this report analyses each factorial cell independently; it does not formally model the LLM × Task interaction. The 2×2 Factorial Synthesis report quantifies the LLM main effect, task main effect, and their interaction by jointly analysing the posterior draws from all four cells, including a difference-in-differences analysis that addresses the interaction directly.

0.8.5 Connection to the Ambiguity Aversion Literature

The Ellsberg gamble paradigm was originally designed to reveal violations of subjective expected utility theory — specifically, ambiguity aversion (Ellsberg 1961; Camerer and Weber 1992). In the human decision-making literature, the canonical finding is that agents prefer gambles with known probabilities over gambles with equivalent expected values but unknown probabilities (Trautmann and Kuilen 2015). Our gamble set spans three ambiguity tiers precisely to allow for such effects.

In this study, however, we use the Ellsberg gambles primarily as a task domain for the factorial design — a set of \(K = 4\) decision problems that differs structurally from the insurance triage task. The \(\alpha\) parameter captures overall sensitivity to expected utility maximization, not ambiguity-specific processing. Whether GPT-4o exhibits ambiguity aversion per se — i.e., whether its subjective probability estimates (\(\psi\)) or choice patterns systematically shift as a function of gamble ambiguity tier — is an important question that lies beyond the scope of the current temperature-focused analysis. We note this as a natural direction for future work: examining whether the temperature effect on \(\alpha\) is modulated by ambiguity level (e.g., whether high-ambiguity gambles show a larger temperature-induced degradation in choice quality) would connect this line of work more directly to the behavioral decision theory literature.

0.8.6 Embedding Model and Cross-LLM Comparability

Both GPT-4o cells use OpenAI’s text-embedding-3-small for feature construction, and both Claude cells also use text-embedding-3-small (following the same pipeline). The embedding model is therefore held constant across all four factorial cells, preventing the embedding representation from confounding the LLM main effect. Any differences in the PCA features across cells arise from differences in the LLM’s natural-language assessments (which vary with both LLM identity and temperature), not from the embedding step.

0.8.7 Relation to the Broader Literature

The finding that LLM identity is the leading qualitative differentiator in the temperature–sensitivity relationship across these four exploratory cells contributes to a growing literature examining LLM decision-making and rationality. Recent work has investigated how LLMs perform on classic decision-theoretic tasks (Binz and Schulz 2023), how API parameters affect output quality (Renze and Guven 2024), and whether LLMs exhibit systematic biases analogous to those documented in humans (Hagendorff et al. 2023). Our results suggest that the mapping from temperature to choice quality is not a universal property of LLM decoding but depends on model-specific characteristics — potentially reflecting differences in training, architecture, or the relationship between the temperature parameter and the model’s internal representations.

0.8.8 Limitations

  1. No prior sensitivity analysis. The m_02 prior on \(\alpha\) is Lognormal(3.5, 0.75), with prior 90% CI \(\approx [10, 124]\). The posterior at \(T = 0.0\) (\(\alpha \approx 110\)) falls near the upper tail of this prior. While \(M \approx 300\) observations provide substantial information, and the parameter recovery analysis confirms good identifiability, we have not verified that the slope sign and \(P(\text{slope} < 0)\) are robust to alternative prior specifications (e.g., a wider Lognormal(3.5, 1.0) or a shifted Lognormal(4.0, 0.75)). This is a standard robustness check for Bayesian analyses (Kruschke 2013) and is planned as a supplementary analysis.

  2. Temperature range mismatch. The GPT-4o cells sample temperatures up to \(T = 1.5\), while Claude is limited to \(T = 1.0\). The restricted-range analysis in Section 0.7 confirms that the negative slope holds over \(T \in [0.0, 1.0]\), but slope magnitudes remain difficult to compare directly given the non-identical temperature grids.

  3. No within-study ambiguity analysis. The gamble set spans three ambiguity tiers, but the current analysis does not examine whether \(\alpha\) or choice patterns vary by tier. This limits the study’s contribution to the ambiguity aversion literature specifically.

0.9 Reproducibility

0.9.1 Data Snapshot

File Description
alpha_draws_T*.npz Posterior draws of α (4,000 per condition)
ppc_T*.json Posterior predictive check results
diagnostics_T*.txt CmdStan diagnostic output
stan_data_T*.json Stan-ready data (for refitting)
fit_summary.json Summary statistics across conditions
primary_analysis.json Pre-computed monotonicity and slope statistics
run_summary.json Pipeline metadata and configuration
study_config.yaml Frozen copy of the study configuration

0.9.2 Refitting from Source

Show code
# Uncomment to refit from source data (requires CmdStanPy)
#
# import cmdstanpy
# model = cmdstanpy.CmdStanModel(stan_file="models/m_02.stan")
#
# for t in [0.0, 0.3, 0.7, 1.0, 1.5]:
#     key = f"T{str(t).replace('.', '_')}"
#     fit = model.sample(
#         data=f"data/stan_data_{key}.json",
#         chains=4,
#         iter_warmup=1000,
#         iter_sampling=1000,
#         seed=42,
#     )
#     print(f"T={t}: alpha median = {np.median(fit.stan_variable('alpha')):.1f}")

0.10 References

Binz, Marcel, and Eric Schulz. 2023. “Using Cognitive Psychology to Understand GPT-3.” Proceedings of the National Academy of Sciences 120 (6): e2218523120.
Camerer, Colin, and Martin Weber. 1992. “Recent Developments in Modeling Preferences: Uncertainty and Ambiguity.” Journal of Risk and Uncertainty 5 (4): 325–70.
Ellsberg, Daniel. 1961. “Risk, Ambiguity, and the Savage Axioms.” The Quarterly Journal of Economics 75 (4): 643–69.
Hagendorff, Thilo, Sarah Fabi, and Michal Kosinski. 2023. “Human-Like Intuitive Behavior and Reasoning Biases Emerged in Large Language Models but Disappeared in ChatGPT.” Nature Computational Science 3 (10): 833–38.
Kruschke, John K. 2013. Bayesian Estimation Supersedes the t Test.” Journal of Experimental Psychology: General 142 (2): 573–603.
Renze, Matthew, and Erhan Guven. 2024. “The Effect of Sampling Temperature on Problem Solving in Large Language Models.” arXiv Preprint arXiv:2402.05201.
Trautmann, Stefan T., and Gijs van de Kuilen. 2015. “Ambiguity Attitudes.” In The Wiley Blackwell Handbook of Judgment and Decision Making, edited by Gideon Keren and George Wu. Wiley-Blackwell.

Reuse

Citation

BibTeX citation:
@online{helzner2026,
  author = {Helzner, Jeff},
  title = {Temperature and {SEU} {Sensitivity:} {GPT-4o} × {Ellsberg}
    {Study}},
  date = {2026-06-27},
  url = {https://jeffhelzner.github.io/seu-sensitivity/applications/gpt4o_ellsberg_study/01_gpt4o_ellsberg_study.html},
  langid = {en}
}
For attribution, please cite this work as:
Helzner, Jeff. 2026. “Temperature and SEU Sensitivity: GPT-4o × Ellsberg Study.” SEU Sensitivity Project, June 27. https://jeffhelzner.github.io/seu-sensitivity/applications/gpt4o_ellsberg_study/01_gpt4o_ellsberg_study.html.