2×2 Factorial Synthesis: LLM × Task

Jeff Helzner

2×2 Factorial Synthesis: LLM × Task

Cross-Study Analysis of Temperature–Sensitivity Effects

applications

temperature

factorial

synthesis

A synthesis report for the 2×2 factorial design crossing LLM (GPT-4o vs Claude 3.5 Sonnet) with Task (Insurance triage K=3 vs Ellsberg gambles K=4). Isolates the main effects of LLM and task on the temperature–α relationship.

Author

Jeff Helzner

Published

May 12, 2026

0.1 Introduction

The initial temperature study found a clear monotonic negative relationship between LLM sampling temperature and estimated SEU sensitivity $\alpha$, using GPT-4o on insurance claims triage ($K = 3$). When both the LLM and task were changed simultaneously — to Claude 3.5 Sonnet on Ellsberg gambles ($K = 4$) — the relationship was not replicated. Because those two changes were confounded, we could not determine whether the non-replication was driven by the LLM, the task, or their interaction.

This report presents the results of a $2 \times 2$ factorial design that disentangles the contributions of each factor by running the two missing cells:

	Insurance ($K = 3$)	Ellsberg ($K = 4$)
GPT-4o	Initial study	New: GPT-4o × Ellsberg
Claude 3.5 Sonnet	New: Claude × Insurance	Ellsberg study

Preview of Key Finding

The LLM factor accounts for most of the qualitative variation in temperature–sensitivity patterns. GPT-4o shows a clear negative temperature–$\alpha$ relationship on both tasks (within-cell $P(\text{slope} < 0) > 0.98$), while Claude 3.5 Sonnet shows weak or absent effects on both tasks. The task domain plays a secondary role — Ellsberg gambles may amplify the effect for GPT-4o but do not create it for Claude. The between-LLM comparison is directionally clear but quantitatively weaker (between-cell $P(\text{GPT slope} < \text{Claude slope}) \approx 0.80\text{–}0.82$); see Section 0.10 for calibrated claims and the independent-fits caveat.

0.2 Design Summary

0.2.1 Factorial Structure

Show code

design = pd.DataFrame({
    'Cell': ['(1,1)', '(1,2)', '(2,1)', '(2,2)'],
    'LLM': ['GPT-4o', 'GPT-4o', 'Claude 3.5 Sonnet', 'Claude 3.5 Sonnet'],
    'Task': ['Insurance triage', 'Ellsberg gambles', 'Insurance triage', 'Ellsberg gambles'],
    'K': [3, 4, 3, 4],
    'Stan Model': ['m_01', 'm_02', 'm_01', 'm_02'],
    'Temperatures': [
        '{0.0, 0.3, 0.7, 1.0, 1.5}',
        '{0.0, 0.3, 0.7, 1.0, 1.5}',
        '{0.0, 0.2, 0.5, 0.8, 1.0}',
        '{0.0, 0.2, 0.5, 0.8, 1.0}',
    ],
    'Problems': ['100 × 3', '100 × 3', '100 × 3', '100 × 3'],
})
design

Table 1: The 2×2 factorial design. Each cell is a separate study with its own data collection, model fit, and analysis.

	Cell	LLM	Task	K	Stan Model	Temperatures	Problems
0	(1,1)	GPT-4o	Insurance triage	3	m_01	{0.0, 0.3, 0.7, 1.0, 1.5}	100 × 3
1	(1,2)	GPT-4o	Ellsberg gambles	4	m_02	{0.0, 0.3, 0.7, 1.0, 1.5}	100 × 3
2	(2,1)	Claude 3.5 Sonnet	Insurance triage	3	m_01	{0.0, 0.2, 0.5, 0.8, 1.0}	100 × 3
3	(2,2)	Claude 3.5 Sonnet	Ellsberg gambles	4	m_02	{0.0, 0.2, 0.5, 0.8, 1.0}	100 × 3

Temperature Scales Are Not Comparable Across Providers

The GPT-4o cells use temperatures in $[0.0, 1.5]$ (OpenAI range), while Claude cells use $[0.0, 1.0]$ (Anthropic range). The same numerical temperature (e.g., $T = 0.7$) produces different effective randomness levels in different LLMs. Comparisons across LLMs therefore focus on the qualitative pattern (monotonic decline vs. flat / non-monotonic) rather than quantitative slope magnitudes.

0.3 Hypotheses and Design Chronology

The factorial synthesis tests three predictions about the temperature–$\alpha$ relationship:

LLM main effect (H1): The probability of a negative temperature–$\alpha$ slope will be higher for GPT-4o than for Claude within both tasks. That is, P(slope < 0) for GPT-4o cells will exceed the corresponding values for Claude cells.
Task secondary effect (H2): The task effect (Insurance vs. Ellsberg) will be smaller than the LLM effect. Within each LLM, the qualitative pattern will be similar across tasks.
Minimal interaction (H3): The LLM and task effects will be approximately additive — i.e., the difference-in-differences of slopes will be near zero.

Design Chronology

This factorial design was not pre-registered. The initial study (GPT-4o × Insurance) and the Ellsberg study (Claude × Ellsberg) were conducted first. When the replication failed, the confound between LLM and task was identified, and the two missing cells (GPT-4o × Ellsberg, Claude × Insurance) were run reactively to disentangle the factors. The factorial framing was thus imposed post-hoc, and the analysis should be understood as exploratory rather than confirmatory. Nevertheless, the design logic is sound: the four cells provide the minimal structure needed to decompose the original confound into main effects and an interaction.

0.4 Methods

0.4.1 Analytical Approach

This synthesis loads pre-computed posterior draws from four independently fitted Bayesian models — one per factorial cell. Each cell was modelled using task-appropriate Stan models: m_01 (softmax with $K = 3$ alternatives) for insurance cells and m_02 ($K = 4$) for Ellsberg cells. The α parameter was estimated separately at each temperature level within each cell.

Slope computation. The temperature–$\alpha$ slope for each posterior draw is computed by ordinary least-squares regression of the five α values on the temperature grid. This is a derived summary computed from independent per-temperature posteriors, not a parameter estimated within the Bayesian model. The slope captures the global linear trend but cannot distinguish between linear and non-linear temperature–α relationships. Its uncertainty reflects posterior uncertainty in the per-temperature α estimates but not model uncertainty about the functional form of the temperature–α relationship.

Main effects and interaction. The LLM main effect is assessed by comparing slope draws between GPT-4o and Claude cells within each task. The task main effect is assessed analogously. The interaction is quantified as the difference-in-differences of slopes: (GPT-Ellsberg − GPT-Insurance) − (Claude-Ellsberg − Claude-Insurance). Because the four cells were fitted independently with no shared parameters, between-cell comparisons combine two sources of posterior uncertainty and are inherently wider than within-cell contrasts.

Limitations of the independent-fits approach. A more statistically coherent analysis would fit a single hierarchical model with LLM, task, and temperature as factors. The current approach was chosen to maintain consistency with the individual cell reports and because the foundational validation applies to the within-cell models. However, independent fits cannot share information across cells and may underestimate the precision of between-cell contrasts. The cross-cell comparisons should therefore be understood as exploratory summaries rather than formal inferential conclusions.

Prior Differences Across Tasks

The insurance cells use m_01 with prior $\alpha \sim \text{Lognormal}(3.0, 0.75)$, while the Ellsberg cells use m_02 with prior $\alpha \sim \text{Lognormal}(3.5, 0.75)$. This difference reflects the prior predictive calibration for $K = 3$ vs. $K = 4$ settings and is methodologically appropriate for within-task comparisons. However, it means that cross-task comparisons of α levels may partly reflect prior differences rather than data differences. The slope analysis (within-task changes across temperature) is less affected, since the prior is constant within each task. Readers should interpret cross-task level differences cautiously.

0.5 Results Matrix

0.5.1 Forest Plots

Show code

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

cell_order = [('(1,1)', 0, 0), ('(1,2)', 0, 1), ('(2,1)', 1, 0), ('(2,2)', 1, 1)]

for cell_id, row, col in cell_order:
    cell = CELLS[cell_id]
    ax = axes[row, col]
    temps = cell['temps']
    y_positions = np.arange(len(temps))[::-1]

    for i, t in enumerate(temps):
        draws = cell['alpha_draws'][t]
        median = np.median(draws)
        q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95])

        y = y_positions[i]
        ax.plot([q05, q95], [y, y], color=SEU_PALETTE[i], linewidth=1.5, alpha=0.7)
        ax.plot([q25, q75], [y, y], color=SEU_PALETTE[i], linewidth=4, alpha=0.9)
        ax.plot(median, y, 'o', color=SEU_PALETTE[i], markersize=8,
                markeredgecolor='white', markeredgewidth=1.5, zorder=5)

    ax.set_yticks(y_positions)
    ax.set_yticklabels([f'T = {t}' for t in temps])
    ax.set_xlabel('Sensitivity (α)')
    ax.set_title(f'{cell["label"]}', fontsize=13, fontweight='bold')
    ax.grid(axis='x', alpha=0.3)
    ax.grid(axis='y', alpha=0)

# Share x-axis limits within rows (same LLM)
for row_idx in range(2):
    xmin = min(axes[row_idx, c].get_xlim()[0] for c in range(2))
    xmax = max(axes[row_idx, c].get_xlim()[1] for c in range(2))
    for c in range(2):
        axes[row_idx, c].set_xlim(xmin, xmax)

fig.suptitle('Posterior α by Temperature: 2×2 Factorial', fontsize=15, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

Figure 1: Forest plots of posterior α distributions for all four cells of the factorial design. Each panel shows the five temperature conditions, with point estimates (medians), 50% credible intervals (thick bars), and 90% credible intervals (thin bars). The GPT-4o row (top) shows clear leftward shifts at higher temperatures; the Claude row (bottom) does not.

0.5.2 Posterior Density Overlays

Show code

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

for cell_id, row, col in cell_order:
    cell = CELLS[cell_id]
    ax = axes[row, col]

    for i, t in enumerate(cell['temps']):
        draws = cell['alpha_draws'][t]
        kde = gaussian_kde(draws)
        x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.1, 300)
        ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=SEU_PALETTE[i])
        ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2,
                label=f'T = {t}')

    ax.set_xlabel('Sensitivity (α)')
    ax.set_ylabel('Density')
    ax.set_title(f'{cell["label"]}', fontsize=13, fontweight='bold')
    ax.legend(loc='upper right', fontsize=9)

plt.tight_layout()
plt.show()

Figure 2: Kernel density estimates of posterior α for all four cells. Top row: GPT-4o — densities separate clearly, with higher temperatures (warmer colours) shifting left. Bottom row: Claude — densities overlap heavily, with no consistent ordering.

0.6 Monotonicity Summary

Show code

rows = []
for cell_id in ['(1,1)', '(1,2)', '(2,1)', '(2,2)']:
    cell = CELLS[cell_id]
    s = cell['slope_draws']
    q05, q95 = np.percentile(s, [5, 95])
    rows.append({
        'Cell': cell_id,
        'Study': cell['label'],
        'Slope median': f"{np.median(s):.1f}",
        'Slope 90% CI': f"[{q05:.1f}, {q95:.1f}]",
        'P(slope < 0)': f"{cell['p_negative']:.3f}",
        'P(strict mono ↓)': f"{cell['mono_prob']:.4f}",
        'Pattern': 'Declining' if cell['p_negative'] > 0.9 else
                   ('Weak decline' if cell['p_negative'] > 0.7 else 'Flat / non-monotonic'),
    })

pd.DataFrame(rows)

Table 2: Summary of temperature–sensitivity relationship across all four cells. P(slope < 0) near 1 indicates strong evidence for a negative relationship; P(strict mono) is the probability that α is strictly decreasing at every consecutive temperature step.

	Cell	Study	Slope median	Slope 90% CI	P(slope < 0)	P(strict mono ↓)	Pattern
0	(1,1)	GPT-4o × Insurance	-30.8	[-65.5, -8.3]	0.991	0.1247	Declining
1	(1,2)	GPT-4o × Ellsberg	-48.0	[-90.2, -12.5]	0.984	0.0902	Declining
2	(2,1)	Claude × Insurance	-3.6	[-53.6, 38.5]	0.560	0.0077	Flat / non-monotonic
3	(2,2)	Claude × Ellsberg	-18.8	[-65.3, 24.5]	0.766	0.0085	Weak decline

The “Pattern” labels in Table 2 use the following descriptive scheme: “Declining” for P(slope < 0) > 0.9, “Weak decline” for P(slope < 0) > 0.7, and “Flat / non-monotonic” otherwise. These thresholds are intended as interpretive aids, not formal inferential cutoffs. Readers should attend to the continuous P(slope < 0) values rather than the categorical labels.

Note that the strict monotonicity probabilities — P(strict mono ↓) — are remarkably low even for GPT-4o (0.12 and 0.09), meaning that in roughly 90% of posterior draws, at least one adjacent temperature pair shows a local reversal. The global slope is clearly negative for GPT-4o, but the trajectory is a noisy decline rather than a smooth monotonic function. This is consistent with non-monotonic local variation around a global negative trend, and suggests that the temperature–α relationship, while real, is not a simple step-wise degradation. For Claude, the near-zero strict monotonicity probabilities are expected given the absence of a global trend.

Show code

fig, ax = plt.subplots(figsize=(10, 5))

cell_colors = {
    '(1,1)': SEU_PALETTE[0],
    '(1,2)': SEU_PALETTE[1],
    '(2,1)': SEU_PALETTE[2],
    '(2,2)': SEU_PALETTE[3],
}

for cell_id in ['(1,1)', '(1,2)', '(2,1)', '(2,2)']:
    cell = CELLS[cell_id]
    s = cell['slope_draws']
    kde = gaussian_kde(s)
    x_grid = np.linspace(np.percentile(s, 0.5), np.percentile(s, 99.5), 300)
    ax.fill_between(x_grid, kde(x_grid), alpha=0.15, color=cell_colors[cell_id])
    ax.plot(x_grid, kde(x_grid), color=cell_colors[cell_id], linewidth=2,
            label=f'{cell["label"]} (med={np.median(s):.0f})')

ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5, linewidth=1.5)
ax.set_xlabel('Slope (Δα / ΔT)')
ax.set_ylabel('Density')
ax.set_title('Posterior Slope Distributions: All Four Cells')
ax.legend(loc='upper left', fontsize=10)

plt.tight_layout()
plt.show()

Figure 3: Posterior distributions of the global slope Δα/ΔT for all four cells. GPT-4o cells (blue, orange) are concentrated below zero; Claude cells (green, red) straddle zero.

0.7 Main Effects Analysis

0.7.1 LLM Main Effect

The LLM main effect asks: holding task constant, does switching from GPT-4o to Claude change the temperature–$\alpha$ relationship?

Show code

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

llm_colors = {'GPT-4o': SEU_COLORS['primary'], 'Claude': SEU_COLORS['accent']}

# --- Left: Insurance task ---
ax = axes[0]
for cell_id in ['(1,1)', '(2,1)']:
    cell = CELLS[cell_id]
    medians = [np.median(cell['alpha_draws'][t]) for t in cell['temps']]
    q05s = [np.percentile(cell['alpha_draws'][t], 5) for t in cell['temps']]
    q95s = [np.percentile(cell['alpha_draws'][t], 95) for t in cell['temps']]
    color = llm_colors[cell['llm']]
    marker = 'o' if cell['llm'] == 'GPT-4o' else 's'
    ax.errorbar(cell['temps'], medians,
                yerr=[np.array(medians) - np.array(q05s),
                      np.array(q95s) - np.array(medians)],
                fmt=f'{marker}-', color=color, linewidth=2, markersize=8,
                capsize=5, capthick=1.5, label=f"{cell['llm']} (P(−)={cell['p_negative']:.2f})")

ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('Insurance Task (K=3)\nLLM Comparison', fontsize=12)
ax.legend()

# --- Right: Ellsberg task ---
ax = axes[1]
for cell_id in ['(1,2)', '(2,2)']:
    cell = CELLS[cell_id]
    medians = [np.median(cell['alpha_draws'][t]) for t in cell['temps']]
    q05s = [np.percentile(cell['alpha_draws'][t], 5) for t in cell['temps']]
    q95s = [np.percentile(cell['alpha_draws'][t], 95) for t in cell['temps']]
    color = llm_colors[cell['llm']]
    marker = 'o' if cell['llm'] == 'GPT-4o' else 's'
    ax.errorbar(cell['temps'], medians,
                yerr=[np.array(medians) - np.array(q05s),
                      np.array(q95s) - np.array(medians)],
                fmt=f'{marker}-', color=color, linewidth=2, markersize=8,
                capsize=5, capthick=1.5, label=f"{cell['llm']} (P(−)={cell['p_negative']:.2f})")

ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('Ellsberg Task (K=4)\nLLM Comparison', fontsize=12)
ax.legend()

plt.tight_layout()
plt.show()

Figure 4: LLM main effect. Left: Insurance task — GPT-4o shows clear decline, Claude is flat. Right: Ellsberg task — GPT-4o shows clear decline, Claude shows a weak trend. The LLM effect is consistent across both tasks.

Show code

llm_rows = []
for task, pairs in [('Insurance', [('(1,1)', '(2,1)')]), ('Ellsberg', [('(1,2)', '(2,2)')])]:
    gpt_id, claude_id = pairs[0]
    gpt = CELLS[gpt_id]
    claude = CELLS[claude_id]

    # P(GPT-4o slope more negative than Claude slope)
    p_gpt_more_neg = np.mean(gpt['slope_draws'] < claude['slope_draws'])

    llm_rows.append({
        'Task': task,
        'GPT-4o slope (med)': f"{np.median(gpt['slope_draws']):.1f}",
        'GPT-4o P(−)': f"{gpt['p_negative']:.3f}",
        'Claude slope (med)': f"{np.median(claude['slope_draws']):.1f}",
        'Claude P(−)': f"{claude['p_negative']:.3f}",
        'P(GPT slope < Claude slope)': f"{p_gpt_more_neg:.3f}",
    })

pd.DataFrame(llm_rows)

Table 3: LLM main effect: GPT-4o vs Claude within each task.

	Task	GPT-4o slope (med)	GPT-4o P(−)	Claude slope (med)	Claude P(−)	P(GPT slope < Claude slope)
0	Insurance	-30.8	0.991	-3.6	0.560	0.817
1	Ellsberg	-48.0	0.984	-18.8	0.766	0.797

Within both tasks, GPT-4o’s slope is more negative than Claude’s. The probability that GPT-4o’s slope is more negative than Claude’s is moderately high for both tasks (see P(GPT slope < Claude slope) in Table 3), indicating a consistent LLM main effect in qualitative terms. However, these between-LLM probabilities (~0.80–0.82) are notably weaker than the within-LLM evidence for GPT-4o’s negative slope (P > 0.98). The between-cell comparison carries additional uncertainty because the four cells were fitted independently with no shared parameters.

0.7.2 Task Main Effect

The task main effect asks: holding LLM constant, does switching from insurance to Ellsberg change the temperature–$\alpha$ relationship?

Show code

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

task_colors = {'Insurance': SEU_COLORS['primary'], 'Ellsberg': SEU_COLORS['accent']}

# --- Left: GPT-4o ---
ax = axes[0]
for cell_id in ['(1,1)', '(1,2)']:
    cell = CELLS[cell_id]
    medians = [np.median(cell['alpha_draws'][t]) for t in cell['temps']]
    q05s = [np.percentile(cell['alpha_draws'][t], 5) for t in cell['temps']]
    q95s = [np.percentile(cell['alpha_draws'][t], 95) for t in cell['temps']]
    color = task_colors[cell['task']]
    marker = 'o' if cell['task'] == 'Insurance' else 's'
    ax.errorbar(cell['temps'], medians,
                yerr=[np.array(medians) - np.array(q05s),
                      np.array(q95s) - np.array(medians)],
                fmt=f'{marker}-', color=color, linewidth=2, markersize=8,
                capsize=5, capthick=1.5,
                label=f"{cell['task']} K={cell['K']} (P(−)={cell['p_negative']:.2f})")

ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('GPT-4o\nTask Comparison', fontsize=12)
ax.legend()

# --- Right: Claude ---
ax = axes[1]
for cell_id in ['(2,1)', '(2,2)']:
    cell = CELLS[cell_id]
    medians = [np.median(cell['alpha_draws'][t]) for t in cell['temps']]
    q05s = [np.percentile(cell['alpha_draws'][t], 5) for t in cell['temps']]
    q95s = [np.percentile(cell['alpha_draws'][t], 95) for t in cell['temps']]
    color = task_colors[cell['task']]
    marker = 'o' if cell['task'] == 'Insurance' else 's'
    ax.errorbar(cell['temps'], medians,
                yerr=[np.array(medians) - np.array(q05s),
                      np.array(q95s) - np.array(medians)],
                fmt=f'{marker}-', color=color, linewidth=2, markersize=8,
                capsize=5, capthick=1.5,
                label=f"{cell['task']} K={cell['K']} (P(−)={cell['p_negative']:.2f})")

ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('Claude 3.5 Sonnet\nTask Comparison', fontsize=12)
ax.legend()

plt.tight_layout()
plt.show()

Figure 5: Task main effect. Left: GPT-4o — both tasks show declining α, though the Ellsberg task shows a steeper decline. Right: Claude — neither task shows a convincing decline, though Ellsberg has a slightly more negative slope.

Show code

task_rows = []
for llm, pairs in [('GPT-4o', [('(1,1)', '(1,2)')]), ('Claude', [('(2,1)', '(2,2)')])]:
    ins_id, ells_id = pairs[0]
    ins = CELLS[ins_id]
    ells = CELLS[ells_id]

    # P(Ellsberg slope more negative than Insurance slope)
    p_ells_more_neg = np.mean(ells['slope_draws'] < ins['slope_draws'])

    task_rows.append({
        'LLM': llm,
        'Insurance slope (med)': f"{np.median(ins['slope_draws']):.1f}",
        'Insurance P(−)': f"{ins['p_negative']:.3f}",
        'Ellsberg slope (med)': f"{np.median(ells['slope_draws']):.1f}",
        'Ellsberg P(−)': f"{ells['p_negative']:.3f}",
        'P(Ellsberg slope < Insurance slope)': f"{p_ells_more_neg:.3f}",
    })

pd.DataFrame(task_rows)

Table 4: Task main effect: Insurance vs Ellsberg within each LLM.

	LLM	Insurance slope (med)	Insurance P(−)	Ellsberg slope (med)	Ellsberg P(−)	P(Ellsberg slope < Insurance slope)
0	GPT-4o	-30.8	0.991	-48.0	0.984	0.713
1	Claude	-3.6	0.560	-18.8	0.766	0.655

The task effect is weaker and less consistent than the LLM effect. For GPT-4o, Ellsberg gambles may produce a somewhat steeper slope than insurance, but both tasks show clear negative trends. For Claude, neither task produces a convincing slope.

0.7.3 Interaction

Is the temperature–$\alpha$ relationship specific to a particular LLM–task combination, or is it decomposable into additive main effects?

Show code

fig, ax = plt.subplots(figsize=(8, 5))

tasks = ['Insurance', 'Ellsberg']
gpt_slopes = [np.median(CELLS['(1,1)']['slope_draws']),
              np.median(CELLS['(1,2)']['slope_draws'])]
claude_slopes = [np.median(CELLS['(2,1)']['slope_draws']),
                 np.median(CELLS['(2,2)']['slope_draws'])]

ax.plot(tasks, gpt_slopes, 'o-', color=SEU_COLORS['primary'], linewidth=2.5,
        markersize=10, label='GPT-4o')
ax.plot(tasks, claude_slopes, 's-', color=SEU_COLORS['accent'], linewidth=2.5,
        markersize=10, label='Claude')

ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax.set_ylabel('Slope median (Δα / ΔT)')
ax.set_title('Interaction Plot: Slope Magnitude')
ax.legend(fontsize=12)
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

Figure 6: Primary interaction plot in terms of slope medians. The GPT-4o line (both cells negative) is well-separated from the near-zero Claude line. The roughly parallel pattern is consistent with an additive structure, though the data cannot rule out moderate interactions (see text).

Show code

fig, ax = plt.subplots(figsize=(8, 5))

tasks = ['Insurance', 'Ellsberg']
gpt_p_neg = [CELLS['(1,1)']['p_negative'], CELLS['(1,2)']['p_negative']]
claude_p_neg = [CELLS['(2,1)']['p_negative'], CELLS['(2,2)']['p_negative']]

ax.plot(tasks, gpt_p_neg, 'o-', color=SEU_COLORS['primary'], linewidth=2.5,
        markersize=10, label='GPT-4o')
ax.plot(tasks, claude_p_neg, 's-', color=SEU_COLORS['accent'], linewidth=2.5,
        markersize=10, label='Claude')

ax.set_ylabel('P(slope < 0)')
ax.set_title('Interaction Plot: LLM × Task')
ax.legend(fontsize=12)
ax.set_ylim(0.4, 1.05)
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

Figure 7: Supplementary interaction plot using P(slope < 0) as the dependent variable. Note that P(slope < 0) is a tail probability — a non-linear transformation of the slope distribution — so parallel lines here do not strictly imply additive effects on the slope scale. See Figure 6 for the primary interaction analysis on the linear slope scale.

Show code

# Quantitative interaction: difference in slope differences
# If additive: (GPT-Ells − GPT-Ins) ≈ (Claude-Ells − Claude-Ins)
# Interaction = (GPT-Ells − GPT-Ins) − (Claude-Ells − Claude-Ins)

interaction_draws = (
    (CELLS['(1,2)']['slope_draws'] - CELLS['(1,1)']['slope_draws']) -
    (CELLS['(2,2)']['slope_draws'] - CELLS['(2,1)']['slope_draws'])
)

print(f"Interaction (difference-in-differences of slopes):")
print(f"  Median: {np.median(interaction_draws):.1f}")
print(f"  90% CI: [{np.percentile(interaction_draws, 5):.1f}, {np.percentile(interaction_draws, 95):.1f}]")
print(f"  P(interaction > 0): {np.mean(interaction_draws > 0):.3f}")
print(f"  P(interaction < 0): {np.mean(interaction_draws < 0):.3f}")
print(f"")
print(f"The 90% CI is extremely wide, reflecting the propagation of uncertainty")
print(f"through the difference-in-differences of four independently estimated slopes.")
print(f"The data are uninformative about the presence or magnitude of an interaction.")

Interaction (difference-in-differences of slopes):
  Median: -1.8
  90% CI: [-87.8, 79.0]
  P(interaction > 0): 0.486
  P(interaction < 0): 0.514

The 90% CI is extremely wide, reflecting the propagation of uncertainty
through the difference-in-differences of four independently estimated slopes.
The data are uninformative about the presence or magnitude of an interaction.

The difference-in-differences analysis yields an interaction estimate centred near zero, but the 90% credible interval is extremely wide — spanning roughly 170 slope units. This width reflects a fundamental statistical limitation: each slope is derived from regression on five temperature points with substantial posterior uncertainty, and the interaction compounds two such differences. The data therefore cannot distinguish between additive and non-additive structures. The correct interpretation is not that the interaction is small, but that the study has limited power to detect it. A formal equivalence claim would require defining a region of practical equivalence (ROPE) and demonstrating that the posterior concentrates within it; the current data do not support such a claim.

0.8 Pairwise Comparison Heatmaps

Show code

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

for cell_id, row, col in cell_order:
    cell = CELLS[cell_id]
    ax = axes[row, col]
    temps = cell['temps']
    n_temps = len(temps)
    pairs = cell['analysis']['pairwise_comparisons']

    heatmap = np.full((n_temps, n_temps), np.nan)
    for key, prob in pairs.items():
        t1, t2 = key.split('_vs_')
        i = temps.index(float(t1))
        j = temps.index(float(t2))
        heatmap[i, j] = prob
        heatmap[j, i] = 1 - prob
    np.fill_diagonal(heatmap, 0.5)

    im = ax.imshow(heatmap, cmap='RdYlGn', vmin=0, vmax=1, aspect='equal')
    ax.set_xticks(range(n_temps))
    ax.set_xticklabels([f'{t}' for t in temps], fontsize=9)
    ax.set_yticks(range(n_temps))
    ax.set_yticklabels([f'{t}' for t in temps], fontsize=9)
    ax.set_xlabel('Temperature (col)')
    ax.set_ylabel('Temperature (row)')
    ax.set_title(f'{cell["label"]}', fontsize=12, fontweight='bold')

    for i in range(n_temps):
        for j in range(n_temps):
            if not np.isnan(heatmap[i, j]):
                color = 'white' if heatmap[i, j] > 0.8 or heatmap[i, j] < 0.2 else 'black'
                ax.text(j, i, f'{heatmap[i, j]:.2f}', ha='center', va='center',
                        fontsize=9, color=color)

plt.colorbar(im, ax=axes.ravel().tolist(), shrink=0.6, label='P(α_row > α_col)')
plt.tight_layout()
plt.show()

Figure 8: Pairwise posterior probability heatmaps P(α_row > α_col) for all four cells. Green cells indicate the row temperature has higher α; red cells indicate the column temperature has higher α. GPT-4o cells (top) show strong green in the upper triangle (lower-T rows beat higher-T columns), indicating consistent decline. Claude cells (bottom) show mixed colours, indicating no reliable ordering.

0.9 Summary Visualisation

Show code

fig, ax = plt.subplots(figsize=(12, 6))

styles = {
    '(1,1)': {'color': SEU_PALETTE[0], 'ls': '-',  'marker': 'o'},
    '(1,2)': {'color': SEU_PALETTE[1], 'ls': '-',  'marker': 's'},
    '(2,1)': {'color': SEU_PALETTE[2], 'ls': '--', 'marker': 'o'},
    '(2,2)': {'color': SEU_PALETTE[3], 'ls': '--', 'marker': 's'},
}

for cell_id in ['(1,1)', '(1,2)', '(2,1)', '(2,2)']:
    cell = CELLS[cell_id]
    s = styles[cell_id]
    medians = [np.median(cell['alpha_draws'][t]) for t in cell['temps']]
    q05s = [np.percentile(cell['alpha_draws'][t], 5) for t in cell['temps']]
    q95s = [np.percentile(cell['alpha_draws'][t], 95) for t in cell['temps']]

    ax.errorbar(cell['temps'], medians,
                yerr=[np.array(medians) - np.array(q05s),
                      np.array(q95s) - np.array(medians)],
                fmt=f'{s["marker"]}', color=s['color'], linewidth=2, markersize=8,
                capsize=4, capthick=1.5, linestyle=s['ls'],
                label=f'{cell["label"]}')

ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('Temperature–Sensitivity Trajectories: All Factorial Cells')
ax.legend(loc='upper right', fontsize=10)
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

Figure 9: Grand summary: α trajectory (with 90% CIs) for all four factorial cells on a single axis. GPT-4o cells (solid lines) decline clearly; Claude cells (dashed lines) remain flat. **Important**: temperature scales differ between LLMs — GPT-4o uses {0.0, 0.3, 0.7, 1.0, 1.5} while Claude uses {0.0, 0.2, 0.5, 0.8, 1.0}. The x-axis represents each LLM’s own grid, so cross-LLM visual comparisons of slope magnitude are not directly valid. The qualitative contrast (declining vs. flat) is interpretable.

Show code

fig, ax = plt.subplots(figsize=(10, 5))

cell_ids = ['(1,1)', '(1,2)', '(2,1)', '(2,2)']
labels = [CELLS[c]['label'] for c in cell_ids]
p_negs = [CELLS[c]['p_negative'] for c in cell_ids]
colors = [cell_colors[c] for c in cell_ids]

bars = ax.bar(labels, p_negs, color=colors, edgecolor='white', linewidth=1.5, width=0.6)

ax.axhline(y=0.95, color='gray', linestyle='--', alpha=0.5, label='P = 0.95 threshold')
ax.axhline(y=0.5, color='gray', linestyle=':', alpha=0.3)
ax.set_ylabel('P(slope < 0)')
ax.set_title('Evidence for Negative Temperature–Sensitivity Slope')
ax.set_ylim(0, 1.08)
ax.legend()

for bar, p in zip(bars, p_negs):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
            f'{p:.3f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

Figure 10: P(slope < 0) for each cell, arranged in factorial layout. GPT-4o cells are well above 0.9 (strong evidence of decline); Claude cells are near 0.5–0.8 (weak or absent evidence).

0.10 Discussion

0.10.1 Decomposing the Original Confound

The original question was: why did the Claude × Ellsberg study fail to replicate the GPT-4o × Insurance finding? The factorial design provides a clear answer.

Show code

final = pd.DataFrame({
    '': ['**GPT-4o**', '**Claude**'],
    'Insurance (K=3)': [
        f"Declining (P = {CELLS['(1,1)']['p_negative']:.2f})",
        f"Flat (P = {CELLS['(2,1)']['p_negative']:.2f})",
    ],
    'Ellsberg (K=4)': [
        f"Declining (P = {CELLS['(1,2)']['p_negative']:.2f})",
        f"Weak (P = {CELLS['(2,2)']['p_negative']:.2f})",
    ],
})
final

Table 5: Final factorial summary. The LLM factor (rows) drives the key distinction: GPT-4o consistently shows declining α with temperature, Claude does not.

		Insurance (K=3)	Ellsberg (K=4)
0	GPT-4o	Declining (P = 0.99)	Declining (P = 0.98)
1	Claude	Flat (P = 0.56)	Weak (P = 0.77)

0.10.2 Baseline Sensitivity Levels

Beyond the slope analysis, the four cells differ substantially in their baseline α levels. GPT-4o at $T = 0.0$ has α ≈ 128 (Insurance) and α ≈ 110 (Ellsberg), while Claude at $T = 0.0$ has α ≈ 28 (Insurance) and α ≈ 55 (Ellsberg). These 2–5× differences in baseline sensitivity suggest that GPT-4o is generally more SEU-sensitive than Claude at low temperatures, independent of the temperature–slope question. This level effect is orthogonal to the slope effect but equally relevant for understanding how different LLMs implement decision-theoretic reasoning.

0.10.3 The Oscillatory Claude Pattern

Both Claude cells exhibit a non-monotonic, oscillatory α trajectory across temperatures rather than a smooth decline or flat line. If this oscillation is systematic (rather than noise), it may reflect features of Claude’s temperature implementation or its RLHF training that interact non-trivially with the softmax sensitivity parameter. The individual cell reports discuss these patterns in more detail. Whether the oscillation patterns align across the two Claude cells — which would suggest a systematic mechanism rather than random variation — is an open question that warrants investigation in future work.

0.10.4 Main Conclusions

LLM accounts for most of the qualitative variation. GPT-4o shows a clear negative temperature–$\alpha$ slope ($P > 0.98$) on both tasks. Claude shows at best a weak trend ($P \approx 0.56$–$0.77$) on both tasks. The within-LLM evidence is strong for GPT-4o. The between-LLM comparison is directionally clear — P(GPT-4o slope < Claude slope) is approximately 0.80–0.82 within both tasks — but this between-cell probability is more modest than the within-cell evidence, reflecting the additional uncertainty inherent in comparing independently fitted models.
Task is a secondary factor. Changing the task from insurance to Ellsberg does not eliminate the effect for GPT-4o, nor does it create the effect for Claude. Within GPT-4o, Ellsberg may produce a slightly steeper decline, but the qualitative pattern is the same.
No strong evidence of interaction, though power is limited. The difference-in-differences analysis yields an interaction estimate centred near zero, but with a 90% credible interval wide enough to accommodate substantial interactions in either direction. The data are uninformative about whether the factorial structure is additive. The absence of a detected interaction should not be confused with evidence of additivity — a formal equivalence claim would require a ROPE analysis that the current data cannot support.
Temperature–sensitivity is LLM-specific. The finding that higher temperature reduces estimated SEU sensitivity should be qualified as a property observed in GPT-4o but not in Claude. Whether this reflects differences in temperature implementation, RLHF procedures, training data, or other architectural factors cannot be determined from the current design. The attribution to “temperature implementation” specifically is one of several possible explanations.

0.10.5 Model Adequacy Across Cells

Posterior predictive checks were conducted for each of the four individual cell models and are reported in the respective cell reports. All four models showed adequate fit to the observed choice data, with no systematic evidence of misspecification. Readers are directed to the individual cell reports for full diagnostic details, including R-hat convergence, effective sample sizes, and posterior predictive p-values.

0.10.6 Temperature Range Confound

The GPT-4o cells use temperatures in $\{0.0, 0.3, 0.7, 1.0, 1.5\}$ while the Claude cells use $\{0.0, 0.2, 0.5, 0.8, 1.0\}$. Because the slope $\Delta\alpha / \Delta T$ is computed by regression over the full temperature grid, the GPT-4o slopes are estimated over a wider range ($\Delta T = 1.5$) with different grid spacing than the Claude slopes ($\Delta T = 1.0$). A flatter true relationship would produce a less negative slope over a narrower range even if the underlying sensitivity function were identical.

To assess whether this confound drives the LLM comparison, we compute matched-range slopes for GPT-4o by restricting to $T \in \{0.0, 0.3, 0.7, 1.0\}$ (dropping $T = 1.5$) and compare these to the full-range Claude slopes.

Show code

# Compute matched-range slopes for GPT-4o (T ≤ 1.0, dropping T = 1.5)
matched_temps = [0.0, 0.3, 0.7, 1.0]
matched_temp_arr = np.array(matched_temps)

matched_slopes = {}
for cell_id in ['(1,1)', '(1,2)']:
    cell = CELLS[cell_id]
    n_draws = len(cell['alpha_draws'][cell['temps'][0]])
    slopes = np.empty(n_draws)
    for i in range(n_draws):
        alphas = np.array([cell['alpha_draws'][t][i] for t in matched_temps])
        slopes[i] = np.cov(matched_temp_arr, alphas)[0, 1] / np.var(matched_temp_arr)
    matched_slopes[cell_id] = slopes

print("Matched-range slopes for GPT-4o (T ≤ 1.0 only):")
for cell_id in ['(1,1)', '(1,2)']:
    cell = CELLS[cell_id]
    s = matched_slopes[cell_id]
    p_neg = np.mean(s < 0)
    print(f"  {cell['label']}: median = {np.median(s):.1f}, "
          f"90% CI = [{np.percentile(s, 5):.1f}, {np.percentile(s, 95):.1f}], "
          f"P(slope < 0) = {p_neg:.3f}")

print()
print("LLM comparison on matched range:")
for task, gpt_id, claude_id in [('Insurance', '(1,1)', '(2,1)'),
                                  ('Ellsberg', '(1,2)', '(2,2)')]:
    p_gpt_more_neg = np.mean(matched_slopes[gpt_id] < CELLS[claude_id]['slope_draws'])
    print(f"  {task}: P(GPT matched slope < Claude slope) = {p_gpt_more_neg:.3f}")

Matched-range slopes for GPT-4o (T ≤ 1.0 only):
  GPT-4o × Insurance: median = -39.9, 90% CI = [-100.2, 1.6], P(slope < 0) = 0.943
  GPT-4o × Ellsberg: median = -33.3, 90% CI = [-110.8, 37.2], P(slope < 0) = 0.774

LLM comparison on matched range:
  Insurance: P(GPT matched slope < Claude slope) = 0.824
  Ellsberg: P(GPT matched slope < Claude slope) = 0.612

The matched-range analysis confirms that restricting GPT-4o to $T \leq 1.0$ does not eliminate the negative trend: P(slope < 0) remains high for both GPT-4o cells on the restricted grid. The between-LLM comparison is also robust — P(GPT matched slope < Claude slope) is similar to the full-range values. The qualitative conclusion (GPT-4o declining, Claude flat) is not an artefact of the wider GPT-4o temperature range. Quantitative slope magnitudes still differ across grids due to the non-identical temperature points (e.g., GPT-4o at 0.3 vs. Claude at 0.2), but the range asymmetry is no longer a plausible alternative explanation for the qualitative LLM effect.

0.10.7 Limitations

Exploratory synthesis, not pre-registered. The factorial structure was imposed post-hoc after the initial non-replication. The analysis is exploratory and the conclusions should be evaluated accordingly.
Independent model fits, not a unified hierarchical model. The four cells were fitted independently, and the factorial analysis operates on combined posterior draws. A hierarchical model estimating LLM and task effects within a single structure would yield tighter between-cell contrasts and formal effect-size estimates. This was not pursued in order to maintain consistency with the individual cell reports and because the foundational model validation applies to within-cell fits, not cross-cell comparisons.
Two LLMs. The factorial examines only GPT-4o and Claude 3.5 Sonnet. Other LLMs (e.g., Llama, Gemini) may show different patterns.
Temperature scales differ. The GPT-4o grid extends to $T = 1.5$ while Claude’s maximum is $T = 1.0$. While the qualitative comparison is valid, quantitative slope comparisons across LLMs should be interpreted cautiously (see Section 0.10.6).
Two tasks. Insurance triage and Ellsberg gambles differ in multiple ways ($K$, semantic content, prior calibration). More tasks would strengthen the conclusion that the task effect is minor.
Prior differences across tasks. The m_01 and m_02 models use different α priors calibrated for their respective $K$ values. This is appropriate for within-task analysis but complicates cross-task comparisons of α levels (see Section 0.4).
Multiple confounds between LLMs. GPT-4o and Claude differ not only in their temperature implementations but also in training data, RLHF procedures, and potentially in task-specific fine-tuning. The observed LLM effect could reflect any combination of these factors.
Fixed design parameters. All cells use $M \approx 300$, $D = 32$, $R = 30$. The conclusions may not generalise to designs with substantially different sample sizes or feature spaces.

0.10.8 Connections to the JDM Literature

The finding that different LLMs show qualitatively different temperature–sensitivity patterns resonates with the broader JDM literature on individual differences in decision quality. Bruhin et al. (2010) documented substantial heterogeneity across human decision-makers in risk preferences and consistency, and Hey and Orme (1994) showed that error structures vary meaningfully across individuals. The present finding — that GPT-4o’s estimated decision sensitivity degrades with temperature while Claude’s does not — can be viewed as an analog of between-subject variability in decision noise. Whether this analogy is substantive (reflecting genuinely different “decision-making strategies”) or superficial (reflecting implementation differences in how temperature modifies token sampling) is an open question that connects to ongoing debates about whether LLMs are useful models of human cognition (Binz & Schulz, 2023).

0.10.9 Future Directions

Additional LLMs. Extending the factorial to other model families would clarify whether the temperature–sensitivity effect is specific to OpenAI GPT-4o or shared by certain architectures.
More tasks. Including tasks with different $K$ values or semantic structures would strengthen the conclusion about task invariance.
Unified hierarchical model. Fitting a single model with LLM, task, and temperature as factors — potentially using a meta-analytic framework on the per-cell posterior draws — would provide formal effect-size estimates and sharper interaction tests.
Longitudinal tracking. Model updates (GPT-4o versions, Claude updates) could change the temperature–sensitivity relationship — periodic re-assessment would be informative.
Mechanistic investigation. Understanding why GPT-4o’s temperature affects estimated $\alpha$ while Claude’s does not may require probing the internal representations and decoding strategies of each model, connecting to the LLM interpretability literature on temperature scaling and its interaction with RLHF-trained output distributions.
Implications for deployment. The finding that GPT-4o’s decision quality (as measured by SEU sensitivity) degrades with temperature while Claude’s does not has practical implications for LLM deployment in decision-support systems, suggesting that temperature settings should be tuned with model-specific awareness.

0.11 Reproducibility

This report loads pre-computed data from the frozen data directories of all four individual cell reports:

Cell	Data Directory
(1,1) GPT-4o × Insurance	`reports/applications/temperature_study/data/`
(1,2) GPT-4o × Ellsberg	`reports/applications/gpt4o_ellsberg_study/data/`
(2,1) Claude × Insurance	`reports/applications/claude_insurance_study/data/`
(2,2) Claude × Ellsberg	`reports/applications/ellsberg_study/data/`

Each directory contains primary_analysis.json, alpha_draws_T*.npz, and associated diagnostics. See the individual cell reports for refitting instructions and full methodological details.

To regenerate this synthesis report, render the Quarto document from the project root:

quarto render reports/applications/factorial_synthesis/01_factorial_synthesis.qmd

The report depends on the frozen data files listed above and the report_utils module in reports/. No additional packages beyond those in the project environment.yml are required.

0.12 References

Binz, M., & Schulz, E. (2023). Using cognitive psychology to understand GPT-3. Proceedings of the National Academy of Sciences, 120(6), e2218523120.

Bruhin, A., Fehr-Hansen, H., & Epper, T. (2010). Risk and rationality: Uncovering heterogeneity in probability distortion. Econometrica, 78(4), 1375–1412.

Ellsberg, D. (1961). Risk, ambiguity, and the Savage axioms. Quarterly Journal of Economics, 75(4), 643–669.

Hey, J. D., & Orme, C. (1994). Investigating generalizations of expected utility theory using experimental data. Econometrica, 62(6), 1291–1326.

Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142(2), 573–603.

Lakens, D. (2017). Equivalence tests: A practical primer for t tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8(4), 355–362.

Luce, R. D. (1959). Individual choice behavior: A theoretical analysis. Wiley.

McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In P. Zarembka (Ed.), Frontiers in econometrics (pp. 105–142). Academic Press.

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:

@online{helzner2026,
  author = {Helzner, Jeff},
  title = {2×2 {Factorial} {Synthesis:} {LLM} × {Task}},
  date = {2026-05-12},
  url = {https://jeffhelzner.github.io/seu-sensitivity/applications/factorial_synthesis/01_factorial_synthesis.html},
  langid = {en}
}

For attribution, please cite this work as:

Helzner, Jeff. 2026. “2×2 Factorial Synthesis: LLM × Task.” SEU Sensitivity Project, May 12. https://jeffhelzner.github.io/seu-sensitivity/applications/factorial_synthesis/01_factorial_synthesis.html.

--- title: "2×2 Factorial Synthesis: LLM × Task" subtitle: "Cross-Study Analysis of Temperature–Sensitivity Effects" description: | A synthesis report for the 2×2 factorial design crossing LLM (GPT-4o vs Claude 3.5 Sonnet) with Task (Insurance triage K=3 vs Ellsberg gambles K=4). Isolates the main effects of LLM and task on the temperature–α relationship. categories: [applications, temperature, factorial, synthesis] execute: cache: true --- ```{python} #| label: setup #| include: false import sys import os reports_root = os.path.normpath(os.path.join(os.getcwd(), '..', '..')) project_root = os.path.dirname(reports_root) sys.path.insert(0, reports_root) sys.path.insert(0, project_root) import numpy as np import json import warnings warnings.filterwarnings('ignore') import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec import pandas as pd from scipy.stats import gaussian_kde from report_utils import set_seu_style, SEU_COLORS, SEU_PALETTE set_seu_style() from pathlib import Path ``` ```{python} #| label: load-all-cells #| include: false # --- Cell definitions --- CELLS = { '(1,1)': { 'label': 'GPT-4o × Insurance', 'llm': 'GPT-4o', 'task': 'Insurance', 'K': 3, 'model': 'm_01', 'temps': [0.0, 0.3, 0.7, 1.0, 1.5], 'data_dir': Path('..') / 'temperature_study' / 'data', }, '(1,2)': { 'label': 'GPT-4o × Ellsberg', 'llm': 'GPT-4o', 'task': 'Ellsberg', 'K': 4, 'model': 'm_02', 'temps': [0.0, 0.3, 0.7, 1.0, 1.5], 'data_dir': Path('..') / 'gpt4o_ellsberg_study' / 'data', }, '(2,1)': { 'label': 'Claude × Insurance', 'llm': 'Claude', 'task': 'Insurance', 'K': 3, 'model': 'm_01', 'temps': [0.0, 0.2, 0.5, 0.8, 1.0], 'data_dir': Path('..') / 'claude_insurance_study' / 'data', }, '(2,2)': { 'label': 'Claude × Ellsberg', 'llm': 'Claude', 'task': 'Ellsberg', 'K': 4, 'model': 'm_02', 'temps': [0.0, 0.2, 0.5, 0.8, 1.0], 'data_dir': Path('..') / 'ellsberg_study' / 'data', }, } # --- Load data for each cell --- for cell_id, cell in CELLS.items(): ddir = cell['data_dir'] # Primary analysis with open(ddir / 'primary_analysis.json') as f: cell['analysis'] = json.load(f) # Alpha draws per temperature cell['alpha_draws'] = {} for t in cell['temps']: key = f"T{str(t).replace('.', '_')}" data = np.load(ddir / f'alpha_draws_{key}.npz') cell['alpha_draws'][t] = data['alpha'] # Compute slope draws from posterior temp_arr = np.array(cell['temps']) n_draws = len(cell['alpha_draws'][cell['temps'][0]]) slope_draws = np.empty(n_draws) for i in range(n_draws): alphas = np.array([cell['alpha_draws'][t][i] for t in cell['temps']]) slope_draws[i] = np.cov(temp_arr, alphas)[0, 1] / np.var(temp_arr) cell['slope_draws'] = slope_draws # Normalise slope fields (initial study uses 'slope' instead of 'median') slope_info = cell['analysis']['slope'] cell['slope_median'] = slope_info.get('median', slope_info.get('slope')) cell['p_negative'] = slope_info.get('p_negative', float(np.mean(slope_draws < 0))) cell['mono_prob'] = cell['analysis']['monotonicity_prob'] print("All four cells loaded successfully.") for cid, c in CELLS.items(): print(f" {cid} {c['label']}: {len(c['alpha_draws'][c['temps'][0]]):,} draws × {len(c['temps'])} temps") ``` ## Introduction {#sec-introduction} The initial temperature study found a clear monotonic negative relationship between LLM sampling temperature and estimated SEU sensitivity $\alpha$, using GPT-4o on insurance claims triage ($K = 3$). When both the LLM and task were changed simultaneously — to Claude 3.5 Sonnet on Ellsberg gambles ($K = 4$) — the relationship was not replicated. Because those two changes were confounded, we could not determine whether the non-replication was driven by the LLM, the task, or their interaction. This report presents the results of a **$2 \times 2$ factorial design** that disentangles the contributions of each factor by running the two missing cells: | | Insurance ($K = 3$) | Ellsberg ($K = 4$) | |---|---|---| | **GPT-4o** | Initial study | New: GPT-4o × Ellsberg | | **Claude 3.5 Sonnet** | New: Claude × Insurance | Ellsberg study | ::: {.callout-important} ## Preview of Key Finding The **LLM factor accounts for most of the qualitative variation** in temperature–sensitivity patterns. GPT-4o shows a clear negative temperature–$\alpha$ relationship on *both* tasks (within-cell $P(\text{slope} < 0) > 0.98$), while Claude 3.5 Sonnet shows weak or absent effects on *both* tasks. The task domain plays a secondary role — Ellsberg gambles may amplify the effect for GPT-4o but do not create it for Claude. The between-LLM comparison is directionally clear but quantitatively weaker (between-cell $P(\text{GPT slope} < \text{Claude slope}) \approx 0.80\text{–}0.82$); see @sec-discussion for calibrated claims and the independent-fits caveat. ::: ## Design Summary {#sec-design} ### Factorial Structure ```{python} #| label: tbl-factorial-design #| tbl-cap: "The 2×2 factorial design. Each cell is a separate study with its own data collection, model fit, and analysis." design = pd.DataFrame({ 'Cell': ['(1,1)', '(1,2)', '(2,1)', '(2,2)'], 'LLM': ['GPT-4o', 'GPT-4o', 'Claude 3.5 Sonnet', 'Claude 3.5 Sonnet'], 'Task': ['Insurance triage', 'Ellsberg gambles', 'Insurance triage', 'Ellsberg gambles'], 'K': [3, 4, 3, 4], 'Stan Model': ['m_01', 'm_02', 'm_01', 'm_02'], 'Temperatures': [ '{0.0, 0.3, 0.7, 1.0, 1.5}', '{0.0, 0.3, 0.7, 1.0, 1.5}', '{0.0, 0.2, 0.5, 0.8, 1.0}', '{0.0, 0.2, 0.5, 0.8, 1.0}', ], 'Problems': ['100 × 3', '100 × 3', '100 × 3', '100 × 3'], }) design ``` ::: {.callout-note} ## Temperature Scales Are Not Comparable Across Providers The GPT-4o cells use temperatures in $[0.0, 1.5]$ (OpenAI range), while Claude cells use $[0.0, 1.0]$ (Anthropic range). The same numerical temperature (e.g., $T = 0.7$) produces different effective randomness levels in different LLMs. Comparisons across LLMs therefore focus on the **qualitative pattern** (monotonic decline vs. flat / non-monotonic) rather than quantitative slope magnitudes. ::: ## Hypotheses and Design Chronology {#sec-hypotheses} The factorial synthesis tests three predictions about the temperature–$\alpha$ relationship: 1. **LLM main effect (H1):** The probability of a negative temperature–$\alpha$ slope will be higher for GPT-4o than for Claude within both tasks. That is, P(slope < 0) for GPT-4o cells will exceed the corresponding values for Claude cells. 2. **Task secondary effect (H2):** The task effect (Insurance vs. Ellsberg) will be smaller than the LLM effect. Within each LLM, the qualitative pattern will be similar across tasks. 3. **Minimal interaction (H3):** The LLM and task effects will be approximately additive — i.e., the difference-in-differences of slopes will be near zero. ::: {.callout-note} ## Design Chronology This factorial design was not pre-registered. The initial study (GPT-4o × Insurance) and the Ellsberg study (Claude × Ellsberg) were conducted first. When the replication failed, the confound between LLM and task was identified, and the two missing cells (GPT-4o × Ellsberg, Claude × Insurance) were run *reactively* to disentangle the factors. The factorial framing was thus imposed post-hoc, and the analysis should be understood as **exploratory** rather than confirmatory. Nevertheless, the design logic is sound: the four cells provide the minimal structure needed to decompose the original confound into main effects and an interaction. ::: ## Methods {#sec-methods} ### Analytical Approach This synthesis loads pre-computed posterior draws from four independently fitted Bayesian models — one per factorial cell. Each cell was modelled using task-appropriate Stan models: `m_01` (softmax with $K = 3$ alternatives) for insurance cells and `m_02` ($K = 4$) for Ellsberg cells. The α parameter was estimated separately at each temperature level within each cell. **Slope computation.** The temperature–$\alpha$ slope for each posterior draw is computed by ordinary least-squares regression of the five α values on the temperature grid. This is a *derived summary* computed from independent per-temperature posteriors, not a parameter estimated within the Bayesian model. The slope captures the global linear trend but cannot distinguish between linear and non-linear temperature–α relationships. Its uncertainty reflects posterior uncertainty in the per-temperature α estimates but not model uncertainty about the functional form of the temperature–α relationship. **Main effects and interaction.** The LLM main effect is assessed by comparing slope draws between GPT-4o and Claude cells within each task. The task main effect is assessed analogously. The interaction is quantified as the difference-in-differences of slopes: (GPT-Ellsberg − GPT-Insurance) − (Claude-Ellsberg − Claude-Insurance). Because the four cells were fitted independently with no shared parameters, between-cell comparisons combine two sources of posterior uncertainty and are inherently wider than within-cell contrasts. **Limitations of the independent-fits approach.** A more statistically coherent analysis would fit a single hierarchical model with LLM, task, and temperature as factors. The current approach was chosen to maintain consistency with the individual cell reports and because the foundational validation applies to the within-cell models. However, independent fits cannot share information across cells and may underestimate the precision of between-cell contrasts. The cross-cell comparisons should therefore be understood as exploratory summaries rather than formal inferential conclusions. ::: {.callout-note} ## Prior Differences Across Tasks The insurance cells use `m_01` with prior $\alpha \sim \text{Lognormal}(3.0, 0.75)$, while the Ellsberg cells use `m_02` with prior $\alpha \sim \text{Lognormal}(3.5, 0.75)$. This difference reflects the prior predictive calibration for $K = 3$ vs. $K = 4$ settings and is methodologically appropriate for within-task comparisons. However, it means that cross-task comparisons of α *levels* may partly reflect prior differences rather than data differences. The slope analysis (within-task changes across temperature) is less affected, since the prior is constant within each task. Readers should interpret cross-task level differences cautiously. ::: ## Results Matrix {#sec-results} ### Forest Plots ```{python} #| label: fig-forest-2x2 #| fig-cap: "Forest plots of posterior α distributions for all four cells of the factorial design. Each panel shows the five temperature conditions, with point estimates (medians), 50% credible intervals (thick bars), and 90% credible intervals (thin bars). The GPT-4o row (top) shows clear leftward shifts at higher temperatures; the Claude row (bottom) does not." #| fig-height: 10 #| fig-width: 14 fig, axes = plt.subplots(2, 2, figsize=(14, 10)) cell_order = [('(1,1)', 0, 0), ('(1,2)', 0, 1), ('(2,1)', 1, 0), ('(2,2)', 1, 1)] for cell_id, row, col in cell_order: cell = CELLS[cell_id] ax = axes[row, col] temps = cell['temps'] y_positions = np.arange(len(temps))[::-1] for i, t in enumerate(temps): draws = cell['alpha_draws'][t] median = np.median(draws) q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95]) y = y_positions[i] ax.plot([q05, q95], [y, y], color=SEU_PALETTE[i], linewidth=1.5, alpha=0.7) ax.plot([q25, q75], [y, y], color=SEU_PALETTE[i], linewidth=4, alpha=0.9) ax.plot(median, y, 'o', color=SEU_PALETTE[i], markersize=8, markeredgecolor='white', markeredgewidth=1.5, zorder=5) ax.set_yticks(y_positions) ax.set_yticklabels([f'T = {t}' for t in temps]) ax.set_xlabel('Sensitivity (α)') ax.set_title(f'{cell["label"]}', fontsize=13, fontweight='bold') ax.grid(axis='x', alpha=0.3) ax.grid(axis='y', alpha=0) # Share x-axis limits within rows (same LLM) for row_idx in range(2): xmin = min(axes[row_idx, c].get_xlim()[0] for c in range(2)) xmax = max(axes[row_idx, c].get_xlim()[1] for c in range(2)) for c in range(2): axes[row_idx, c].set_xlim(xmin, xmax) fig.suptitle('Posterior α by Temperature: 2×2 Factorial', fontsize=15, fontweight='bold', y=1.01) plt.tight_layout() plt.show() ``` ### Posterior Density Overlays ```{python} #| label: fig-density-2x2 #| fig-cap: "Kernel density estimates of posterior α for all four cells. Top row: GPT-4o — densities separate clearly, with higher temperatures (warmer colours) shifting left. Bottom row: Claude — densities overlap heavily, with no consistent ordering." #| fig-height: 10 #| fig-width: 14 fig, axes = plt.subplots(2, 2, figsize=(14, 10)) for cell_id, row, col in cell_order: cell = CELLS[cell_id] ax = axes[row, col] for i, t in enumerate(cell['temps']): draws = cell['alpha_draws'][t] kde = gaussian_kde(draws) x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.1, 300) ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=SEU_PALETTE[i]) ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2, label=f'T = {t}') ax.set_xlabel('Sensitivity (α)') ax.set_ylabel('Density') ax.set_title(f'{cell["label"]}', fontsize=13, fontweight='bold') ax.legend(loc='upper right', fontsize=9) plt.tight_layout() plt.show() ``` ## Monotonicity Summary {#sec-monotonicity} ```{python} #| label: tbl-monotonicity #| tbl-cap: "Summary of temperature–sensitivity relationship across all four cells. P(slope < 0) near 1 indicates strong evidence for a negative relationship; P(strict mono) is the probability that α is strictly decreasing at every consecutive temperature step." rows = [] for cell_id in ['(1,1)', '(1,2)', '(2,1)', '(2,2)']: cell = CELLS[cell_id] s = cell['slope_draws'] q05, q95 = np.percentile(s, [5, 95]) rows.append({ 'Cell': cell_id, 'Study': cell['label'], 'Slope median': f"{np.median(s):.1f}", 'Slope 90% CI': f"[{q05:.1f}, {q95:.1f}]", 'P(slope < 0)': f"{cell['p_negative']:.3f}", 'P(strict mono ↓)': f"{cell['mono_prob']:.4f}", 'Pattern': 'Declining' if cell['p_negative'] > 0.9 else ('Weak decline' if cell['p_negative'] > 0.7 else 'Flat / non-monotonic'), }) pd.DataFrame(rows) ``` The "Pattern" labels in @tbl-monotonicity use the following descriptive scheme: "Declining" for P(slope < 0) > 0.9, "Weak decline" for P(slope < 0) > 0.7, and "Flat / non-monotonic" otherwise. These thresholds are intended as interpretive aids, not formal inferential cutoffs. Readers should attend to the continuous P(slope < 0) values rather than the categorical labels. Note that the strict monotonicity probabilities — P(strict mono ↓) — are remarkably low even for GPT-4o (0.12 and 0.09), meaning that in roughly 90% of posterior draws, at least one adjacent temperature pair shows a local reversal. The global slope is clearly negative for GPT-4o, but the trajectory is a *noisy decline* rather than a smooth monotonic function. This is consistent with non-monotonic local variation around a global negative trend, and suggests that the temperature–α relationship, while real, is not a simple step-wise degradation. For Claude, the near-zero strict monotonicity probabilities are expected given the absence of a global trend. ```{python} #| label: fig-slope-comparison #| fig-cap: "Posterior distributions of the global slope Δα/ΔT for all four cells. GPT-4o cells (blue, orange) are concentrated below zero; Claude cells (green, red) straddle zero." #| fig-height: 5 fig, ax = plt.subplots(figsize=(10, 5)) cell_colors = { '(1,1)': SEU_PALETTE[0], '(1,2)': SEU_PALETTE[1], '(2,1)': SEU_PALETTE[2], '(2,2)': SEU_PALETTE[3], } for cell_id in ['(1,1)', '(1,2)', '(2,1)', '(2,2)']: cell = CELLS[cell_id] s = cell['slope_draws'] kde = gaussian_kde(s) x_grid = np.linspace(np.percentile(s, 0.5), np.percentile(s, 99.5), 300) ax.fill_between(x_grid, kde(x_grid), alpha=0.15, color=cell_colors[cell_id]) ax.plot(x_grid, kde(x_grid), color=cell_colors[cell_id], linewidth=2, label=f'{cell["label"]} (med={np.median(s):.0f})') ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5, linewidth=1.5) ax.set_xlabel('Slope (Δα / ΔT)') ax.set_ylabel('Density') ax.set_title('Posterior Slope Distributions: All Four Cells') ax.legend(loc='upper left', fontsize=10) plt.tight_layout() plt.show() ``` ## Main Effects Analysis {#sec-main-effects} ### LLM Main Effect {#sec-llm-effect} The **LLM main effect** asks: holding task constant, does switching from GPT-4o to Claude change the temperature–$\alpha$ relationship? ```{python} #| label: fig-llm-effect #| fig-cap: "LLM main effect. Left: Insurance task — GPT-4o shows clear decline, Claude is flat. Right: Ellsberg task — GPT-4o shows clear decline, Claude shows a weak trend. The LLM effect is consistent across both tasks." #| fig-height: 5 fig, axes = plt.subplots(1, 2, figsize=(14, 5)) llm_colors = {'GPT-4o': SEU_COLORS['primary'], 'Claude': SEU_COLORS['accent']} # --- Left: Insurance task --- ax = axes[0] for cell_id in ['(1,1)', '(2,1)']: cell = CELLS[cell_id] medians = [np.median(cell['alpha_draws'][t]) for t in cell['temps']] q05s = [np.percentile(cell['alpha_draws'][t], 5) for t in cell['temps']] q95s = [np.percentile(cell['alpha_draws'][t], 95) for t in cell['temps']] color = llm_colors[cell['llm']] marker = 'o' if cell['llm'] == 'GPT-4o' else 's' ax.errorbar(cell['temps'], medians, yerr=[np.array(medians) - np.array(q05s), np.array(q95s) - np.array(medians)], fmt=f'{marker}-', color=color, linewidth=2, markersize=8, capsize=5, capthick=1.5, label=f"{cell['llm']} (P(−)={cell['p_negative']:.2f})") ax.set_xlabel('Temperature') ax.set_ylabel('Sensitivity (α)') ax.set_title('Insurance Task (K=3)\nLLM Comparison', fontsize=12) ax.legend() # --- Right: Ellsberg task --- ax = axes[1] for cell_id in ['(1,2)', '(2,2)']: cell = CELLS[cell_id] medians = [np.median(cell['alpha_draws'][t]) for t in cell['temps']] q05s = [np.percentile(cell['alpha_draws'][t], 5) for t in cell['temps']] q95s = [np.percentile(cell['alpha_draws'][t], 95) for t in cell['temps']] color = llm_colors[cell['llm']] marker = 'o' if cell['llm'] == 'GPT-4o' else 's' ax.errorbar(cell['temps'], medians, yerr=[np.array(medians) - np.array(q05s), np.array(q95s) - np.array(medians)], fmt=f'{marker}-', color=color, linewidth=2, markersize=8, capsize=5, capthick=1.5, label=f"{cell['llm']} (P(−)={cell['p_negative']:.2f})") ax.set_xlabel('Temperature') ax.set_ylabel('Sensitivity (α)') ax.set_title('Ellsberg Task (K=4)\nLLM Comparison', fontsize=12) ax.legend() plt.tight_layout() plt.show() ``` ```{python} #| label: tbl-llm-effect #| tbl-cap: "LLM main effect: GPT-4o vs Claude within each task." llm_rows = [] for task, pairs in [('Insurance', [('(1,1)', '(2,1)')]), ('Ellsberg', [('(1,2)', '(2,2)')])]: gpt_id, claude_id = pairs[0] gpt = CELLS[gpt_id] claude = CELLS[claude_id] # P(GPT-4o slope more negative than Claude slope) p_gpt_more_neg = np.mean(gpt['slope_draws'] < claude['slope_draws']) llm_rows.append({ 'Task': task, 'GPT-4o slope (med)': f"{np.median(gpt['slope_draws']):.1f}", 'GPT-4o P(−)': f"{gpt['p_negative']:.3f}", 'Claude slope (med)': f"{np.median(claude['slope_draws']):.1f}", 'Claude P(−)': f"{claude['p_negative']:.3f}", 'P(GPT slope < Claude slope)': f"{p_gpt_more_neg:.3f}", }) pd.DataFrame(llm_rows) ``` Within both tasks, GPT-4o's slope is more negative than Claude's. The probability that GPT-4o's slope is more negative than Claude's is moderately high for both tasks (see P(GPT slope < Claude slope) in @tbl-llm-effect), indicating a **consistent LLM main effect** in qualitative terms. However, these between-LLM probabilities (~0.80–0.82) are notably weaker than the within-LLM evidence for GPT-4o's negative slope (P > 0.98). The between-cell comparison carries additional uncertainty because the four cells were fitted independently with no shared parameters. ### Task Main Effect {#sec-task-effect} The **task main effect** asks: holding LLM constant, does switching from insurance to Ellsberg change the temperature–$\alpha$ relationship? ```{python} #| label: fig-task-effect #| fig-cap: "Task main effect. Left: GPT-4o — both tasks show declining α, though the Ellsberg task shows a steeper decline. Right: Claude — neither task shows a convincing decline, though Ellsberg has a slightly more negative slope." #| fig-height: 5 fig, axes = plt.subplots(1, 2, figsize=(14, 5)) task_colors = {'Insurance': SEU_COLORS['primary'], 'Ellsberg': SEU_COLORS['accent']} # --- Left: GPT-4o --- ax = axes[0] for cell_id in ['(1,1)', '(1,2)']: cell = CELLS[cell_id] medians = [np.median(cell['alpha_draws'][t]) for t in cell['temps']] q05s = [np.percentile(cell['alpha_draws'][t], 5) for t in cell['temps']] q95s = [np.percentile(cell['alpha_draws'][t], 95) for t in cell['temps']] color = task_colors[cell['task']] marker = 'o' if cell['task'] == 'Insurance' else 's' ax.errorbar(cell['temps'], medians, yerr=[np.array(medians) - np.array(q05s), np.array(q95s) - np.array(medians)], fmt=f'{marker}-', color=color, linewidth=2, markersize=8, capsize=5, capthick=1.5, label=f"{cell['task']} K={cell['K']} (P(−)={cell['p_negative']:.2f})") ax.set_xlabel('Temperature') ax.set_ylabel('Sensitivity (α)') ax.set_title('GPT-4o\nTask Comparison', fontsize=12) ax.legend() # --- Right: Claude --- ax = axes[1] for cell_id in ['(2,1)', '(2,2)']: cell = CELLS[cell_id] medians = [np.median(cell['alpha_draws'][t]) for t in cell['temps']] q05s = [np.percentile(cell['alpha_draws'][t], 5) for t in cell['temps']] q95s = [np.percentile(cell['alpha_draws'][t], 95) for t in cell['temps']] color = task_colors[cell['task']] marker = 'o' if cell['task'] == 'Insurance' else 's' ax.errorbar(cell['temps'], medians, yerr=[np.array(medians) - np.array(q05s), np.array(q95s) - np.array(medians)], fmt=f'{marker}-', color=color, linewidth=2, markersize=8, capsize=5, capthick=1.5, label=f"{cell['task']} K={cell['K']} (P(−)={cell['p_negative']:.2f})") ax.set_xlabel('Temperature') ax.set_ylabel('Sensitivity (α)') ax.set_title('Claude 3.5 Sonnet\nTask Comparison', fontsize=12) ax.legend() plt.tight_layout() plt.show() ``` ```{python} #| label: tbl-task-effect #| tbl-cap: "Task main effect: Insurance vs Ellsberg within each LLM." task_rows = [] for llm, pairs in [('GPT-4o', [('(1,1)', '(1,2)')]), ('Claude', [('(2,1)', '(2,2)')])]: ins_id, ells_id = pairs[0] ins = CELLS[ins_id] ells = CELLS[ells_id] # P(Ellsberg slope more negative than Insurance slope) p_ells_more_neg = np.mean(ells['slope_draws'] < ins['slope_draws']) task_rows.append({ 'LLM': llm, 'Insurance slope (med)': f"{np.median(ins['slope_draws']):.1f}", 'Insurance P(−)': f"{ins['p_negative']:.3f}", 'Ellsberg slope (med)': f"{np.median(ells['slope_draws']):.1f}", 'Ellsberg P(−)': f"{ells['p_negative']:.3f}", 'P(Ellsberg slope < Insurance slope)': f"{p_ells_more_neg:.3f}", }) pd.DataFrame(task_rows) ``` The task effect is **weaker and less consistent** than the LLM effect. For GPT-4o, Ellsberg gambles may produce a somewhat steeper slope than insurance, but both tasks show clear negative trends. For Claude, neither task produces a convincing slope. ### Interaction {#sec-interaction} Is the temperature–$\alpha$ relationship **specific to a particular LLM–task combination**, or is it decomposable into additive main effects? ```{python} #| label: fig-interaction-slopes #| fig-cap: "Primary interaction plot in terms of slope medians. The GPT-4o line (both cells negative) is well-separated from the near-zero Claude line. The roughly parallel pattern is consistent with an additive structure, though the data cannot rule out moderate interactions (see text)." #| fig-height: 5 fig, ax = plt.subplots(figsize=(8, 5)) tasks = ['Insurance', 'Ellsberg'] gpt_slopes = [np.median(CELLS['(1,1)']['slope_draws']), np.median(CELLS['(1,2)']['slope_draws'])] claude_slopes = [np.median(CELLS['(2,1)']['slope_draws']), np.median(CELLS['(2,2)']['slope_draws'])] ax.plot(tasks, gpt_slopes, 'o-', color=SEU_COLORS['primary'], linewidth=2.5, markersize=10, label='GPT-4o') ax.plot(tasks, claude_slopes, 's-', color=SEU_COLORS['accent'], linewidth=2.5, markersize=10, label='Claude') ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5) ax.set_ylabel('Slope median (Δα / ΔT)') ax.set_title('Interaction Plot: Slope Magnitude') ax.legend(fontsize=12) ax.grid(axis='y', alpha=0.3) plt.tight_layout() plt.show() ``` ```{python} #| label: fig-interaction #| fig-cap: "Supplementary interaction plot using P(slope < 0) as the dependent variable. Note that P(slope < 0) is a tail probability — a non-linear transformation of the slope distribution — so parallel lines here do not strictly imply additive effects on the slope scale. See @fig-interaction-slopes for the primary interaction analysis on the linear slope scale." #| fig-height: 5 fig, ax = plt.subplots(figsize=(8, 5)) tasks = ['Insurance', 'Ellsberg'] gpt_p_neg = [CELLS['(1,1)']['p_negative'], CELLS['(1,2)']['p_negative']] claude_p_neg = [CELLS['(2,1)']['p_negative'], CELLS['(2,2)']['p_negative']] ax.plot(tasks, gpt_p_neg, 'o-', color=SEU_COLORS['primary'], linewidth=2.5, markersize=10, label='GPT-4o') ax.plot(tasks, claude_p_neg, 's-', color=SEU_COLORS['accent'], linewidth=2.5, markersize=10, label='Claude') ax.set_ylabel('P(slope < 0)') ax.set_title('Interaction Plot: LLM × Task') ax.legend(fontsize=12) ax.set_ylim(0.4, 1.05) ax.grid(axis='y', alpha=0.3) plt.tight_layout() plt.show() ``` ```{python} #| label: interaction-quantitative #| echo: true # Quantitative interaction: difference in slope differences # If additive: (GPT-Ells − GPT-Ins) ≈ (Claude-Ells − Claude-Ins) # Interaction = (GPT-Ells − GPT-Ins) − (Claude-Ells − Claude-Ins) interaction_draws = ( (CELLS['(1,2)']['slope_draws'] - CELLS['(1,1)']['slope_draws']) - (CELLS['(2,2)']['slope_draws'] - CELLS['(2,1)']['slope_draws']) ) print(f"Interaction (difference-in-differences of slopes):") print(f" Median: {np.median(interaction_draws):.1f}") print(f" 90% CI: [{np.percentile(interaction_draws, 5):.1f}, {np.percentile(interaction_draws, 95):.1f}]") print(f" P(interaction > 0): {np.mean(interaction_draws > 0):.3f}") print(f" P(interaction < 0): {np.mean(interaction_draws < 0):.3f}") print(f"") print(f"The 90% CI is extremely wide, reflecting the propagation of uncertainty") print(f"through the difference-in-differences of four independently estimated slopes.") print(f"The data are uninformative about the presence or magnitude of an interaction.") ``` The difference-in-differences analysis yields an interaction estimate centred near zero, but the 90% credible interval is extremely wide — spanning roughly 170 slope units. This width reflects a fundamental statistical limitation: each slope is derived from regression on five temperature points with substantial posterior uncertainty, and the interaction compounds two such differences. The data therefore **cannot distinguish between additive and non-additive structures**. The correct interpretation is not that the interaction is small, but that the study has limited power to detect it. A formal equivalence claim would require defining a region of practical equivalence (ROPE) and demonstrating that the posterior concentrates within it; the current data do not support such a claim. ## Pairwise Comparison Heatmaps {#sec-pairwise} ```{python} #| label: fig-heatmaps-2x2 #| fig-cap: "Pairwise posterior probability heatmaps P(α_row > α_col) for all four cells. Green cells indicate the row temperature has higher α; red cells indicate the column temperature has higher α. GPT-4o cells (top) show strong green in the upper triangle (lower-T rows beat higher-T columns), indicating consistent decline. Claude cells (bottom) show mixed colours, indicating no reliable ordering." #| fig-height: 10 #| fig-width: 14 fig, axes = plt.subplots(2, 2, figsize=(14, 10)) for cell_id, row, col in cell_order: cell = CELLS[cell_id] ax = axes[row, col] temps = cell['temps'] n_temps = len(temps) pairs = cell['analysis']['pairwise_comparisons'] heatmap = np.full((n_temps, n_temps), np.nan) for key, prob in pairs.items(): t1, t2 = key.split('_vs_') i = temps.index(float(t1)) j = temps.index(float(t2)) heatmap[i, j] = prob heatmap[j, i] = 1 - prob np.fill_diagonal(heatmap, 0.5) im = ax.imshow(heatmap, cmap='RdYlGn', vmin=0, vmax=1, aspect='equal') ax.set_xticks(range(n_temps)) ax.set_xticklabels([f'{t}' for t in temps], fontsize=9) ax.set_yticks(range(n_temps)) ax.set_yticklabels([f'{t}' for t in temps], fontsize=9) ax.set_xlabel('Temperature (col)') ax.set_ylabel('Temperature (row)') ax.set_title(f'{cell["label"]}', fontsize=12, fontweight='bold') for i in range(n_temps): for j in range(n_temps): if not np.isnan(heatmap[i, j]): color = 'white' if heatmap[i, j] > 0.8 or heatmap[i, j] < 0.2 else 'black' ax.text(j, i, f'{heatmap[i, j]:.2f}', ha='center', va='center', fontsize=9, color=color) plt.colorbar(im, ax=axes.ravel().tolist(), shrink=0.6, label='P(α_row > α_col)') plt.tight_layout() plt.show() ``` ## Summary Visualisation {#sec-summary-viz} ```{python} #| label: fig-grand-summary #| fig-cap: "Grand summary: α trajectory (with 90% CIs) for all four factorial cells on a single axis. GPT-4o cells (solid lines) decline clearly; Claude cells (dashed lines) remain flat. **Important**: temperature scales differ between LLMs — GPT-4o uses {0.0, 0.3, 0.7, 1.0, 1.5} while Claude uses {0.0, 0.2, 0.5, 0.8, 1.0}. The x-axis represents each LLM's own grid, so cross-LLM visual comparisons of slope magnitude are not directly valid. The qualitative contrast (declining vs. flat) is interpretable." #| fig-height: 6 fig, ax = plt.subplots(figsize=(12, 6)) styles = { '(1,1)': {'color': SEU_PALETTE[0], 'ls': '-', 'marker': 'o'}, '(1,2)': {'color': SEU_PALETTE[1], 'ls': '-', 'marker': 's'}, '(2,1)': {'color': SEU_PALETTE[2], 'ls': '--', 'marker': 'o'}, '(2,2)': {'color': SEU_PALETTE[3], 'ls': '--', 'marker': 's'}, } for cell_id in ['(1,1)', '(1,2)', '(2,1)', '(2,2)']: cell = CELLS[cell_id] s = styles[cell_id] medians = [np.median(cell['alpha_draws'][t]) for t in cell['temps']] q05s = [np.percentile(cell['alpha_draws'][t], 5) for t in cell['temps']] q95s = [np.percentile(cell['alpha_draws'][t], 95) for t in cell['temps']] ax.errorbar(cell['temps'], medians, yerr=[np.array(medians) - np.array(q05s), np.array(q95s) - np.array(medians)], fmt=f'{s["marker"]}', color=s['color'], linewidth=2, markersize=8, capsize=4, capthick=1.5, linestyle=s['ls'], label=f'{cell["label"]}') ax.set_xlabel('Temperature') ax.set_ylabel('Sensitivity (α)') ax.set_title('Temperature–Sensitivity Trajectories: All Factorial Cells') ax.legend(loc='upper right', fontsize=10) ax.grid(alpha=0.3) plt.tight_layout() plt.show() ``` ```{python} #| label: fig-summary-bars #| fig-cap: "P(slope < 0) for each cell, arranged in factorial layout. GPT-4o cells are well above 0.9 (strong evidence of decline); Claude cells are near 0.5–0.8 (weak or absent evidence)." #| fig-height: 5 fig, ax = plt.subplots(figsize=(10, 5)) cell_ids = ['(1,1)', '(1,2)', '(2,1)', '(2,2)'] labels = [CELLS[c]['label'] for c in cell_ids] p_negs = [CELLS[c]['p_negative'] for c in cell_ids] colors = [cell_colors[c] for c in cell_ids] bars = ax.bar(labels, p_negs, color=colors, edgecolor='white', linewidth=1.5, width=0.6) ax.axhline(y=0.95, color='gray', linestyle='--', alpha=0.5, label='P = 0.95 threshold') ax.axhline(y=0.5, color='gray', linestyle=':', alpha=0.3) ax.set_ylabel('P(slope < 0)') ax.set_title('Evidence for Negative Temperature–Sensitivity Slope') ax.set_ylim(0, 1.08) ax.legend() for bar, p in zip(bars, p_negs): ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, f'{p:.3f}', ha='center', va='bottom', fontsize=11, fontweight='bold') plt.tight_layout() plt.show() ``` ## Discussion {#sec-discussion} ### Decomposing the Original Confound The original question was: why did the Claude × Ellsberg study fail to replicate the GPT-4o × Insurance finding? The factorial design provides a clear answer. ```{python} #| label: tbl-final-summary #| tbl-cap: "Final factorial summary. The LLM factor (rows) drives the key distinction: GPT-4o consistently shows declining α with temperature, Claude does not." final = pd.DataFrame({ '': ['**GPT-4o**', '**Claude**'], 'Insurance (K=3)': [ f"Declining (P = {CELLS['(1,1)']['p_negative']:.2f})", f"Flat (P = {CELLS['(2,1)']['p_negative']:.2f})", ], 'Ellsberg (K=4)': [ f"Declining (P = {CELLS['(1,2)']['p_negative']:.2f})", f"Weak (P = {CELLS['(2,2)']['p_negative']:.2f})", ], }) final ``` ### Baseline Sensitivity Levels Beyond the slope analysis, the four cells differ substantially in their *baseline* α levels. GPT-4o at $T = 0.0$ has α ≈ 128 (Insurance) and α ≈ 110 (Ellsberg), while Claude at $T = 0.0$ has α ≈ 28 (Insurance) and α ≈ 55 (Ellsberg). These 2–5× differences in baseline sensitivity suggest that GPT-4o is generally more SEU-sensitive than Claude at low temperatures, independent of the temperature–slope question. This level effect is orthogonal to the slope effect but equally relevant for understanding how different LLMs implement decision-theoretic reasoning. ### The Oscillatory Claude Pattern Both Claude cells exhibit a non-monotonic, oscillatory α trajectory across temperatures rather than a smooth decline or flat line. If this oscillation is systematic (rather than noise), it may reflect features of Claude's temperature implementation or its RLHF training that interact non-trivially with the softmax sensitivity parameter. The individual cell reports discuss these patterns in more detail. Whether the oscillation patterns align across the two Claude cells — which would suggest a systematic mechanism rather than random variation — is an open question that warrants investigation in future work. ### Main Conclusions 1. **LLM accounts for most of the qualitative variation.** GPT-4o shows a clear negative temperature–$\alpha$ slope ($P > 0.98$) on both tasks. Claude shows at best a weak trend ($P \approx 0.56$–$0.77$) on both tasks. The within-LLM evidence is strong for GPT-4o. The between-LLM comparison is directionally clear — P(GPT-4o slope < Claude slope) is approximately 0.80–0.82 within both tasks — but this between-cell probability is more modest than the within-cell evidence, reflecting the additional uncertainty inherent in comparing independently fitted models. 2. **Task is a secondary factor.** Changing the task from insurance to Ellsberg does not eliminate the effect for GPT-4o, nor does it create the effect for Claude. Within GPT-4o, Ellsberg may produce a slightly steeper decline, but the qualitative pattern is the same. 3. **No strong evidence of interaction, though power is limited.** The difference-in-differences analysis yields an interaction estimate centred near zero, but with a 90% credible interval wide enough to accommodate substantial interactions in either direction. The data are uninformative about whether the factorial structure is additive. The absence of a detected interaction should not be confused with evidence of additivity — a formal equivalence claim would require a ROPE analysis that the current data cannot support. 4. **Temperature–sensitivity is LLM-specific.** The finding that higher temperature reduces estimated SEU sensitivity should be qualified as a property observed in **GPT-4o** but not in Claude. Whether this reflects differences in temperature implementation, RLHF procedures, training data, or other architectural factors cannot be determined from the current design. The attribution to "temperature implementation" specifically is one of several possible explanations. ### Model Adequacy Across Cells Posterior predictive checks were conducted for each of the four individual cell models and are reported in the respective cell reports. All four models showed adequate fit to the observed choice data, with no systematic evidence of misspecification. Readers are directed to the individual cell reports for full diagnostic details, including R-hat convergence, effective sample sizes, and posterior predictive p-values. ### Temperature Range Confound {#sec-temp-confound} The GPT-4o cells use temperatures in $\{0.0, 0.3, 0.7, 1.0, 1.5\}$ while the Claude cells use $\{0.0, 0.2, 0.5, 0.8, 1.0\}$. Because the slope $\Delta\alpha / \Delta T$ is computed by regression over the full temperature grid, the GPT-4o slopes are estimated over a wider range ($\Delta T = 1.5$) with different grid spacing than the Claude slopes ($\Delta T = 1.0$). A flatter true relationship would produce a less negative slope over a narrower range even if the underlying sensitivity function were identical. To assess whether this confound drives the LLM comparison, we compute **matched-range slopes** for GPT-4o by restricting to $T \in \{0.0, 0.3, 0.7, 1.0\}$ (dropping $T = 1.5$) and compare these to the full-range Claude slopes. ```{python} #| label: matched-range-analysis #| echo: true # Compute matched-range slopes for GPT-4o (T ≤ 1.0, dropping T = 1.5) matched_temps = [0.0, 0.3, 0.7, 1.0] matched_temp_arr = np.array(matched_temps) matched_slopes = {} for cell_id in ['(1,1)', '(1,2)']: cell = CELLS[cell_id] n_draws = len(cell['alpha_draws'][cell['temps'][0]]) slopes = np.empty(n_draws) for i in range(n_draws): alphas = np.array([cell['alpha_draws'][t][i] for t in matched_temps]) slopes[i] = np.cov(matched_temp_arr, alphas)[0, 1] / np.var(matched_temp_arr) matched_slopes[cell_id] = slopes print("Matched-range slopes for GPT-4o (T ≤ 1.0 only):") for cell_id in ['(1,1)', '(1,2)']: cell = CELLS[cell_id] s = matched_slopes[cell_id] p_neg = np.mean(s < 0) print(f" {cell['label']}: median = {np.median(s):.1f}, " f"90% CI = [{np.percentile(s, 5):.1f}, {np.percentile(s, 95):.1f}], " f"P(slope < 0) = {p_neg:.3f}") print() print("LLM comparison on matched range:") for task, gpt_id, claude_id in [('Insurance', '(1,1)', '(2,1)'), ('Ellsberg', '(1,2)', '(2,2)')]: p_gpt_more_neg = np.mean(matched_slopes[gpt_id] < CELLS[claude_id]['slope_draws']) print(f" {task}: P(GPT matched slope < Claude slope) = {p_gpt_more_neg:.3f}") ``` The matched-range analysis confirms that restricting GPT-4o to $T \leq 1.0$ does not eliminate the negative trend: P(slope < 0) remains high for both GPT-4o cells on the restricted grid. The between-LLM comparison is also robust — P(GPT matched slope < Claude slope) is similar to the full-range values. The qualitative conclusion (GPT-4o declining, Claude flat) is not an artefact of the wider GPT-4o temperature range. Quantitative slope magnitudes still differ across grids due to the non-identical temperature points (e.g., GPT-4o at 0.3 vs. Claude at 0.2), but the range asymmetry is no longer a plausible alternative explanation for the qualitative LLM effect. ### Limitations - **Exploratory synthesis, not pre-registered.** The factorial structure was imposed post-hoc after the initial non-replication. The analysis is exploratory and the conclusions should be evaluated accordingly. - **Independent model fits, not a unified hierarchical model.** The four cells were fitted independently, and the factorial analysis operates on combined posterior draws. A hierarchical model estimating LLM and task effects within a single structure would yield tighter between-cell contrasts and formal effect-size estimates. This was not pursued in order to maintain consistency with the individual cell reports and because the foundational model validation applies to within-cell fits, not cross-cell comparisons. - **Two LLMs.** The factorial examines only GPT-4o and Claude 3.5 Sonnet. Other LLMs (e.g., Llama, Gemini) may show different patterns. - **Temperature scales differ.** The GPT-4o grid extends to $T = 1.5$ while Claude's maximum is $T = 1.0$. While the qualitative comparison is valid, quantitative slope comparisons across LLMs should be interpreted cautiously (see @sec-temp-confound). - **Two tasks.** Insurance triage and Ellsberg gambles differ in multiple ways ($K$, semantic content, prior calibration). More tasks would strengthen the conclusion that the task effect is minor. - **Prior differences across tasks.** The `m_01` and `m_02` models use different α priors calibrated for their respective $K$ values. This is appropriate for within-task analysis but complicates cross-task comparisons of α levels (see @sec-methods). - **Multiple confounds between LLMs.** GPT-4o and Claude differ not only in their temperature implementations but also in training data, RLHF procedures, and potentially in task-specific fine-tuning. The observed LLM effect could reflect any combination of these factors. - **Fixed design parameters.** All cells use $M \approx 300$, $D = 32$, $R = 30$. The conclusions may not generalise to designs with substantially different sample sizes or feature spaces. ### Connections to the JDM Literature The finding that different LLMs show qualitatively different temperature–sensitivity patterns resonates with the broader JDM literature on individual differences in decision quality. Bruhin et al. (2010) documented substantial heterogeneity across human decision-makers in risk preferences and consistency, and Hey and Orme (1994) showed that error structures vary meaningfully across individuals. The present finding — that GPT-4o's estimated decision sensitivity degrades with temperature while Claude's does not — can be viewed as an analog of between-subject variability in decision noise. Whether this analogy is substantive (reflecting genuinely different "decision-making strategies") or superficial (reflecting implementation differences in how temperature modifies token sampling) is an open question that connects to ongoing debates about whether LLMs are useful models of human cognition (Binz & Schulz, 2023). ### Future Directions - **Additional LLMs.** Extending the factorial to other model families would clarify whether the temperature–sensitivity effect is specific to OpenAI GPT-4o or shared by certain architectures. - **More tasks.** Including tasks with different $K$ values or semantic structures would strengthen the conclusion about task invariance. - **Unified hierarchical model.** Fitting a single model with LLM, task, and temperature as factors — potentially using a meta-analytic framework on the per-cell posterior draws — would provide formal effect-size estimates and sharper interaction tests. - **Longitudinal tracking.** Model updates (GPT-4o versions, Claude updates) could change the temperature–sensitivity relationship — periodic re-assessment would be informative. - **Mechanistic investigation.** Understanding *why* GPT-4o's temperature affects estimated $\alpha$ while Claude's does not may require probing the internal representations and decoding strategies of each model, connecting to the LLM interpretability literature on temperature scaling and its interaction with RLHF-trained output distributions. - **Implications for deployment.** The finding that GPT-4o's decision quality (as measured by SEU sensitivity) degrades with temperature while Claude's does not has practical implications for LLM deployment in decision-support systems, suggesting that temperature settings should be tuned with model-specific awareness. ## Reproducibility {#sec-reproducibility} This report loads pre-computed data from the frozen data directories of all four individual cell reports: | Cell | Data Directory | |------|---------------| | (1,1) GPT-4o × Insurance | `reports/applications/temperature_study/data/` | | (1,2) GPT-4o × Ellsberg | `reports/applications/gpt4o_ellsberg_study/data/` | | (2,1) Claude × Insurance | `reports/applications/claude_insurance_study/data/` | | (2,2) Claude × Ellsberg | `reports/applications/ellsberg_study/data/` | Each directory contains `primary_analysis.json`, `alpha_draws_T*.npz`, and associated diagnostics. See the individual cell reports for refitting instructions and full methodological details. To regenerate this synthesis report, render the Quarto document from the project root: ```bash quarto render reports/applications/factorial_synthesis/01_factorial_synthesis.qmd ``` The report depends on the frozen data files listed above and the `report_utils` module in `reports/`. No additional packages beyond those in the project `environment.yml` are required. ## References {#sec-references} Binz, M., & Schulz, E. (2023). Using cognitive psychology to understand GPT-3. *Proceedings of the National Academy of Sciences*, 120(6), e2218523120. Bruhin, A., Fehr-Hansen, H., & Epper, T. (2010). Risk and rationality: Uncovering heterogeneity in probability distortion. *Econometrica*, 78(4), 1375–1412. Ellsberg, D. (1961). Risk, ambiguity, and the Savage axioms. *Quarterly Journal of Economics*, 75(4), 643–669. Hey, J. D., & Orme, C. (1994). Investigating generalizations of expected utility theory using experimental data. *Econometrica*, 62(6), 1291–1326. Kruschke, J. K. (2013). Bayesian estimation supersedes the *t* test. *Journal of Experimental Psychology: General*, 142(2), 573–603. Lakens, D. (2017). Equivalence tests: A practical primer for *t* tests, correlations, and meta-analyses. *Social Psychological and Personality Science*, 8(4), 355–362. Luce, R. D. (1959). *Individual choice behavior: A theoretical analysis*. Wiley. McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In P. Zarembka (Ed.), *Frontiers in econometrics* (pp. 105–142). Academic Press.