2×2 Factorial Synthesis: LLM × Task

Cross-Study Analysis of Temperature–Sensitivity Effects

applications
temperature
factorial
synthesis

A synthesis report for the 2×2 factorial design crossing LLM (GPT-4o vs Claude 3.5 Sonnet) with Task (Insurance triage K=3 vs Ellsberg gambles K=4). Isolates the main effects of LLM and task on the temperature–α relationship.

Author
Published

June 27, 2026

0.1 Introduction

The initial temperature study found a clear monotonic negative relationship between LLM sampling temperature and estimated SEU sensitivity \(\alpha\), using GPT-4o on insurance claims triage (\(K = 3\)). When both the LLM and task were changed simultaneously — to Claude 3.5 Sonnet on Ellsberg gambles (\(K = 4\)) — the relationship was not replicated. Because those two changes were confounded, we could not determine whether the non-replication was driven by the LLM, the task, or their interaction.

This report presents the results of a \(2 \times 2\) factorial design that disentangles the contributions of each factor by running the two missing cells:

Insurance (\(K = 3\)) Ellsberg (\(K = 4\))
GPT-4o Initial study New: GPT-4o × Ellsberg
Claude 3.5 Sonnet New: Claude × Insurance Ellsberg study
ImportantPreview of Key Finding

The LLM factor accounts for most of the qualitative variation in temperature–sensitivity patterns. GPT-4o shows a clear negative temperature–\(\alpha\) relationship on both tasks (within-cell \(P(\text{slope} < 0) > 0.98\)), while Claude 3.5 Sonnet shows weak or absent effects on both tasks. The task domain plays a secondary role — Ellsberg gambles may amplify the effect for GPT-4o but do not create it for Claude. The between-LLM comparison is directionally clear but quantitatively weaker (between-cell \(P(\text{GPT slope} < \text{Claude slope}) \approx 0.80\text{–}0.82\)); see Section 0.10 for calibrated claims and the independent-fits caveat.

0.2 Design Summary

0.2.1 Factorial Structure

Show code
design = pd.DataFrame({
    'Cell': ['(1,1)', '(1,2)', '(2,1)', '(2,2)'],
    'LLM': ['GPT-4o', 'GPT-4o', 'Claude 3.5 Sonnet', 'Claude 3.5 Sonnet'],
    'Task': ['Insurance triage', 'Ellsberg gambles', 'Insurance triage', 'Ellsberg gambles'],
    'K': [3, 4, 3, 4],
    'Stan Model': ['m_01', 'm_02', 'm_01', 'm_02'],
    'Temperatures': [
        '{0.0, 0.3, 0.7, 1.0, 1.5}',
        '{0.0, 0.3, 0.7, 1.0, 1.5}',
        '{0.0, 0.2, 0.5, 0.8, 1.0}',
        '{0.0, 0.2, 0.5, 0.8, 1.0}',
    ],
    'Problems': ['100 × 3', '100 × 3', '100 × 3', '100 × 3'],
})
design
Table 1: The 2×2 factorial design. Each cell is a separate study with its own data collection, model fit, and analysis.
Cell LLM Task K Stan Model Temperatures Problems
0 (1,1) GPT-4o Insurance triage 3 m_01 {0.0, 0.3, 0.7, 1.0, 1.5} 100 × 3
1 (1,2) GPT-4o Ellsberg gambles 4 m_02 {0.0, 0.3, 0.7, 1.0, 1.5} 100 × 3
2 (2,1) Claude 3.5 Sonnet Insurance triage 3 m_01 {0.0, 0.2, 0.5, 0.8, 1.0} 100 × 3
3 (2,2) Claude 3.5 Sonnet Ellsberg gambles 4 m_02 {0.0, 0.2, 0.5, 0.8, 1.0} 100 × 3
NoteTemperature Scales Are Not Comparable Across Providers

The GPT-4o cells use temperatures in \([0.0, 1.5]\) (OpenAI range), while Claude cells use \([0.0, 1.0]\) (Anthropic range). The same numerical temperature (e.g., \(T = 0.7\)) produces different effective randomness levels in different LLMs. Comparisons across LLMs therefore focus on the qualitative pattern (monotonic decline vs. flat / non-monotonic) rather than quantitative slope magnitudes.

0.3 Hypotheses and Design Chronology

The factorial synthesis tests three predictions about the temperature–\(\alpha\) relationship:

  1. LLM main effect (H1): The probability of a negative temperature–\(\alpha\) slope will be higher for GPT-4o than for Claude within both tasks. That is, P(slope < 0) for GPT-4o cells will exceed the corresponding values for Claude cells.

  2. Task secondary effect (H2): The task effect (Insurance vs. Ellsberg) will be smaller than the LLM effect. Within each LLM, the qualitative pattern will be similar across tasks.

  3. Minimal interaction (H3): The LLM and task effects will be approximately additive — i.e., the difference-in-differences of slopes will be near zero.

NoteDesign Chronology

This factorial design was not pre-registered. The initial study (GPT-4o × Insurance) and the Ellsberg study (Claude × Ellsberg) were conducted first. When the replication failed, the confound between LLM and task was identified, and the two missing cells (GPT-4o × Ellsberg, Claude × Insurance) were run reactively to disentangle the factors. The factorial framing was thus imposed post-hoc, and the analysis should be understood as exploratory rather than confirmatory. Nevertheless, the design logic is sound: the four cells provide the minimal structure needed to decompose the original confound into main effects and an interaction.

0.4 Methods

0.4.1 Analytical Approach

This synthesis loads pre-computed posterior draws from four independently fitted Bayesian models — one per factorial cell. Each cell was modeled using task-appropriate Stan models: m_01 (softmax with \(K = 3\) alternatives) for insurance cells and m_02 (\(K = 4\)) for Ellsberg cells. The α parameter was estimated separately at each temperature level within each cell.

Slope computation. The temperature–\(\alpha\) slope for each posterior draw is computed by ordinary least-squares regression of the five α values on the temperature grid. This is a derived summary computed from independent per-temperature posteriors, not a parameter estimated within the Bayesian model. The slope captures the global linear trend but cannot distinguish between linear and non-linear temperature–α relationships. Its uncertainty reflects posterior uncertainty in the per-temperature α estimates but not model uncertainty about the functional form of the temperature–α relationship.

Main effects and interaction. The LLM main effect is assessed by comparing slope draws between GPT-4o and Claude cells within each task. The task main effect is assessed analogously. The interaction is quantified as the difference-in-differences of slopes: (GPT-Ellsberg − GPT-Insurance) − (Claude-Ellsberg − Claude-Insurance). Because the four cells were fitted independently with no shared parameters, between-cell comparisons combine two sources of posterior uncertainty and are inherently wider than within-cell contrasts.

Limitations of the independent-fits approach. A more statistically coherent analysis would fit a single hierarchical model with LLM, task, and temperature as factors. The current approach was chosen to maintain consistency with the individual cell reports and because the foundational validation applies to the within-cell models. However, independent fits cannot share information across cells and may underestimate the precision of between-cell contrasts. The cross-cell comparisons should therefore be understood as exploratory summaries rather than formal inferential conclusions.

NotePrior Differences Across Tasks

The insurance cells use m_01 with prior \(\alpha \sim \text{Lognormal}(3.0, 0.75)\), while the Ellsberg cells use m_02 with prior \(\alpha \sim \text{Lognormal}(3.5, 0.75)\). This difference reflects the prior predictive calibration for \(K = 3\) vs. \(K = 4\) settings and is methodologically appropriate for within-task comparisons. However, it means that cross-task comparisons of α levels may partly reflect prior differences rather than data differences. The slope analysis (within-task changes across temperature) is less affected, since the prior is constant within each task. Readers should interpret cross-task level differences cautiously.

0.5 Results Matrix

0.5.1 Forest Plots

Show code
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

cell_order = [('(1,1)', 0, 0), ('(1,2)', 0, 1), ('(2,1)', 1, 0), ('(2,2)', 1, 1)]

for cell_id, row, col in cell_order:
    cell = CELLS[cell_id]
    ax = axes[row, col]
    temps = cell['temps']
    y_positions = np.arange(len(temps))[::-1]

    for i, t in enumerate(temps):
        draws = cell['alpha_draws'][t]
        median = np.median(draws)
        q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95])

        y = y_positions[i]
        ax.plot([q05, q95], [y, y], color=SEU_PALETTE[i], linewidth=1.5, alpha=0.7)
        ax.plot([q25, q75], [y, y], color=SEU_PALETTE[i], linewidth=4, alpha=0.9)
        ax.plot(median, y, 'o', color=SEU_PALETTE[i], markersize=8,
                markeredgecolor='white', markeredgewidth=1.5, zorder=5)

    ax.set_yticks(y_positions)
    ax.set_yticklabels([f'T = {t}' for t in temps])
    ax.set_xlabel('Sensitivity (α)')
    ax.set_title(f'{cell["label"]}', fontsize=13, fontweight='bold')
    ax.grid(axis='x', alpha=0.3)
    ax.grid(axis='y', alpha=0)

# Share x-axis limits within rows (same LLM)
for row_idx in range(2):
    xmin = min(axes[row_idx, c].get_xlim()[0] for c in range(2))
    xmax = max(axes[row_idx, c].get_xlim()[1] for c in range(2))
    for c in range(2):
        axes[row_idx, c].set_xlim(xmin, xmax)

fig.suptitle('Posterior α by Temperature: 2×2 Factorial', fontsize=15, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()
Figure 1: Forest plots of posterior α distributions for all four cells of the factorial design. Each panel shows the five temperature conditions, with point estimates (medians), 50% credible intervals (thick bars), and 90% credible intervals (thin bars). The GPT-4o row (top) shifts leftward at the highest temperatures; the Claude row (bottom) does not show a comparable shift. Note that GPT-4o and Claude use different temperature grids and provider-specific scales — interpret qualitative patterns rather than slope magnitudes across rows.

0.5.2 Posterior Density Overlays

Show code
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

for cell_id, row, col in cell_order:
    cell = CELLS[cell_id]
    ax = axes[row, col]

    for i, t in enumerate(cell['temps']):
        draws = cell['alpha_draws'][t]
        kde = gaussian_kde(draws)
        x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.1, 300)
        ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=SEU_PALETTE[i])
        ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2,
                label=f'T = {t}')

    ax.set_xlabel('Sensitivity (α)')
    ax.set_ylabel('Density')
    ax.set_title(f'{cell["label"]}', fontsize=13, fontweight='bold')
    ax.legend(loc='upper right', fontsize=9)

plt.tight_layout()
plt.show()
Figure 2: Kernel density estimates of posterior α for all four cells. Top row: GPT-4o — the highest temperature (\(T=1.5\), warmest color) separates clearly; adjacent low/mid-temperature densities overlap substantially. Bottom row: Claude — densities overlap heavily with no consistent adjacent ordering, although Claude × Ellsberg has a weak negative global slope (\(P \approx 0.77\)). GPT-4o and Claude use different temperature grids and provider-specific scales — interpret qualitative patterns rather than slope magnitudes across rows.

0.6 Monotonicity Summary

Show code
rows = []
for cell_id in ['(1,1)', '(1,2)', '(2,1)', '(2,2)']:
    cell = CELLS[cell_id]
    s = cell['slope_draws']
    q05, q95 = np.percentile(s, [5, 95])
    rows.append({
        'Cell': cell_id,
        'Study': cell['label'],
        'Slope median': f"{np.median(s):.1f}",
        'Slope 90% CI': f"[{q05:.1f}, {q95:.1f}]",
        'P(slope < 0)': f"{cell['p_negative']:.3f}",
        'P(strict mono ↓)': f"{cell['mono_prob']:.4f}",
        'Pattern': 'Strong neg. global slope' if cell['p_negative'] > 0.95 else
                   ('Weak neg. global slope' if cell['p_negative'] > 0.7 else 'Little global-slope evidence'),
    })

pd.DataFrame(rows)
Table 2: Summary of temperature–sensitivity relationship across all four cells. P(slope < 0) near 1 indicates strong evidence for a negative relationship; P(strict mono) is the probability that α is strictly decreasing at every consecutive temperature step.
Cell Study Slope median Slope 90% CI P(slope < 0) P(strict mono ↓) Pattern
0 (1,1) GPT-4o × Insurance -30.8 [-65.5, -8.3] 0.991 0.1247 Strong neg. global slope
1 (1,2) GPT-4o × Ellsberg -48.0 [-90.2, -12.5] 0.984 0.0902 Strong neg. global slope
2 (2,1) Claude × Insurance -3.6 [-53.6, 38.5] 0.560 0.0077 Little global-slope evidence
3 (2,2) Claude × Ellsberg -18.8 [-65.3, 24.5] 0.766 0.0085 Weak neg. global slope

The “Pattern” labels in Table 2 use the following calibrated vocabulary (used consistently across the three cell reports and this synthesis):

  • Strong neg. global slope\(P(\text{slope} < 0) > 0.95\) and 90% CI excludes zero.
  • Weak neg. global slope\(P(\text{slope} < 0) \in [0.7, 0.95)\) or 90% CI spans zero.
  • Little global-slope evidence\(P(\text{slope} < 0) < 0.7\).

These labels deliberately refer to the global slope (a single regression coefficient over the temperature grid). They do not assert strict monotonicity: even a strong negative global slope is compatible with substantial adjacent-pair reversals, as the next paragraph quantifies. Readers should attend to the continuous \(P(\text{slope} < 0)\) and \(P(\text{strict mono} \downarrow)\) columns rather than the categorical label alone.

Note that the strict monotonicity probabilities — P(strict mono ↓) — are remarkably low even for GPT-4o (0.12 and 0.09), meaning that in roughly 90% of posterior draws, at least one adjacent temperature pair shows a local reversal. The global slope is clearly negative for GPT-4o, but the trajectory is a noisy decline rather than a smooth monotonic function. This is consistent with non-monotonic local variation around a global negative trend, and suggests that the temperature–α relationship, while real, is not a simple step-wise degradation. For Claude, the near-zero strict monotonicity probabilities are expected given the absence of a global trend.

Show code
fig, ax = plt.subplots(figsize=(10, 5))

cell_colors = {
    '(1,1)': SEU_PALETTE[0],
    '(1,2)': SEU_PALETTE[1],
    '(2,1)': SEU_PALETTE[2],
    '(2,2)': SEU_PALETTE[3],
}

for cell_id in ['(1,1)', '(1,2)', '(2,1)', '(2,2)']:
    cell = CELLS[cell_id]
    s = cell['slope_draws']
    kde = gaussian_kde(s)
    x_grid = np.linspace(np.percentile(s, 0.5), np.percentile(s, 99.5), 300)
    ax.fill_between(x_grid, kde(x_grid), alpha=0.15, color=cell_colors[cell_id])
    ax.plot(x_grid, kde(x_grid), color=cell_colors[cell_id], linewidth=2,
            label=f'{cell["label"]} (med={np.median(s):.0f})')

ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5, linewidth=1.5)
ax.set_xlabel('Slope (Δα / ΔT)')
ax.set_ylabel('Density')
ax.set_title('Posterior Slope Distributions: All Four Cells')
ax.legend(loc='upper left', fontsize=10)

plt.tight_layout()
plt.show()
Figure 3: Posterior distributions of the global slope Δα/ΔT for all four cells. GPT-4o cells (blue, orange) are concentrated below zero; Claude cells (green, red) straddle zero.

0.7 Main Effects Analysis

0.7.1 LLM Main Effect

The LLM main effect asks: holding task constant, does switching from GPT-4o to Claude change the temperature–\(\alpha\) relationship?

Show code
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

llm_colors = {'GPT-4o': SEU_COLORS['primary'], 'Claude': SEU_COLORS['accent']}

# --- Left: Insurance task ---
ax = axes[0]
for cell_id in ['(1,1)', '(2,1)']:
    cell = CELLS[cell_id]
    medians = [np.median(cell['alpha_draws'][t]) for t in cell['temps']]
    q05s = [np.percentile(cell['alpha_draws'][t], 5) for t in cell['temps']]
    q95s = [np.percentile(cell['alpha_draws'][t], 95) for t in cell['temps']]
    color = llm_colors[cell['llm']]
    marker = 'o' if cell['llm'] == 'GPT-4o' else 's'
    ax.errorbar(cell['temps'], medians,
                yerr=[np.array(medians) - np.array(q05s),
                      np.array(q95s) - np.array(medians)],
                fmt=f'{marker}-', color=color, linewidth=2, markersize=8,
                capsize=5, capthick=1.5, label=f"{cell['llm']} (P(−)={cell['p_negative']:.2f})")

ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('Insurance Task (K=3)\nLLM Comparison', fontsize=12)
ax.legend()

# --- Right: Ellsberg task ---
ax = axes[1]
for cell_id in ['(1,2)', '(2,2)']:
    cell = CELLS[cell_id]
    medians = [np.median(cell['alpha_draws'][t]) for t in cell['temps']]
    q05s = [np.percentile(cell['alpha_draws'][t], 5) for t in cell['temps']]
    q95s = [np.percentile(cell['alpha_draws'][t], 95) for t in cell['temps']]
    color = llm_colors[cell['llm']]
    marker = 'o' if cell['llm'] == 'GPT-4o' else 's'
    ax.errorbar(cell['temps'], medians,
                yerr=[np.array(medians) - np.array(q05s),
                      np.array(q95s) - np.array(medians)],
                fmt=f'{marker}-', color=color, linewidth=2, markersize=8,
                capsize=5, capthick=1.5, label=f"{cell['llm']} (P(−)={cell['p_negative']:.2f})")

ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('Ellsberg Task (K=4)\nLLM Comparison', fontsize=12)
ax.legend()

plt.tight_layout()
plt.show()
Figure 4: LLM main effect. Left: Insurance task — GPT-4o shows a strong negative global slope, Claude shows little global-slope evidence. Right: Ellsberg task — GPT-4o shows a strong negative global slope, Claude shows a weak negative global slope. The qualitative LLM contrast is consistent across both tasks. Temperature grids and provider scales differ — interpret qualitative patterns rather than slope magnitudes across LLMs.
Show code
llm_rows = []
for task, pairs in [('Insurance', [('(1,1)', '(2,1)')]), ('Ellsberg', [('(1,2)', '(2,2)')])]:
    gpt_id, claude_id = pairs[0]
    gpt = CELLS[gpt_id]
    claude = CELLS[claude_id]

    # P(GPT-4o slope more negative than Claude slope)
    p_gpt_more_neg = np.mean(gpt['slope_draws'] < claude['slope_draws'])

    llm_rows.append({
        'Task': task,
        'GPT-4o slope (med)': f"{np.median(gpt['slope_draws']):.1f}",
        'GPT-4o P(−)': f"{gpt['p_negative']:.3f}",
        'Claude slope (med)': f"{np.median(claude['slope_draws']):.1f}",
        'Claude P(−)': f"{claude['p_negative']:.3f}",
        'P(GPT slope < Claude slope)': f"{p_gpt_more_neg:.3f}",
    })

pd.DataFrame(llm_rows)
Table 3: LLM main effect: GPT-4o vs Claude within each task.
Task GPT-4o slope (med) GPT-4o P(−) Claude slope (med) Claude P(−) P(GPT slope < Claude slope)
0 Insurance -30.8 0.991 -3.6 0.560 0.817
1 Ellsberg -48.0 0.984 -18.8 0.766 0.797

Within both tasks, GPT-4o’s slope is more negative than Claude’s. The probability that GPT-4o’s slope is more negative than Claude’s is moderately high for both tasks (see P(GPT slope < Claude slope) in Table 3), indicating a consistent LLM main effect in qualitative terms. However, these between-LLM probabilities (~0.80–0.82) are notably weaker than the within-LLM evidence for GPT-4o’s negative slope (P > 0.98). The between-cell comparison carries additional uncertainty because the four cells were fitted independently with no shared parameters.

0.7.2 Task Main Effect

The task main effect asks: holding LLM constant, does switching from insurance to Ellsberg change the temperature–\(\alpha\) relationship?

Show code
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

task_colors = {'Insurance': SEU_COLORS['primary'], 'Ellsberg': SEU_COLORS['accent']}

# --- Left: GPT-4o ---
ax = axes[0]
for cell_id in ['(1,1)', '(1,2)']:
    cell = CELLS[cell_id]
    medians = [np.median(cell['alpha_draws'][t]) for t in cell['temps']]
    q05s = [np.percentile(cell['alpha_draws'][t], 5) for t in cell['temps']]
    q95s = [np.percentile(cell['alpha_draws'][t], 95) for t in cell['temps']]
    color = task_colors[cell['task']]
    marker = 'o' if cell['task'] == 'Insurance' else 's'
    ax.errorbar(cell['temps'], medians,
                yerr=[np.array(medians) - np.array(q05s),
                      np.array(q95s) - np.array(medians)],
                fmt=f'{marker}-', color=color, linewidth=2, markersize=8,
                capsize=5, capthick=1.5,
                label=f"{cell['task']} K={cell['K']} (P(−)={cell['p_negative']:.2f})")

ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('GPT-4o\nTask Comparison', fontsize=12)
ax.legend()

# --- Right: Claude ---
ax = axes[1]
for cell_id in ['(2,1)', '(2,2)']:
    cell = CELLS[cell_id]
    medians = [np.median(cell['alpha_draws'][t]) for t in cell['temps']]
    q05s = [np.percentile(cell['alpha_draws'][t], 5) for t in cell['temps']]
    q95s = [np.percentile(cell['alpha_draws'][t], 95) for t in cell['temps']]
    color = task_colors[cell['task']]
    marker = 'o' if cell['task'] == 'Insurance' else 's'
    ax.errorbar(cell['temps'], medians,
                yerr=[np.array(medians) - np.array(q05s),
                      np.array(q95s) - np.array(medians)],
                fmt=f'{marker}-', color=color, linewidth=2, markersize=8,
                capsize=5, capthick=1.5,
                label=f"{cell['task']} K={cell['K']} (P(−)={cell['p_negative']:.2f})")

ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('Claude 3.5 Sonnet\nTask Comparison', fontsize=12)
ax.legend()

plt.tight_layout()
plt.show()
Figure 5: Task main effect. Left: GPT-4o — both tasks show declining α, though the Ellsberg task shows a steeper decline. Right: Claude — neither task shows a convincing decline, though Ellsberg has a slightly more negative slope.
Show code
task_rows = []
for llm, pairs in [('GPT-4o', [('(1,1)', '(1,2)')]), ('Claude', [('(2,1)', '(2,2)')])]:
    ins_id, ells_id = pairs[0]
    ins = CELLS[ins_id]
    ells = CELLS[ells_id]

    # P(Ellsberg slope more negative than Insurance slope)
    p_ells_more_neg = np.mean(ells['slope_draws'] < ins['slope_draws'])

    task_rows.append({
        'LLM': llm,
        'Insurance slope (med)': f"{np.median(ins['slope_draws']):.1f}",
        'Insurance P(−)': f"{ins['p_negative']:.3f}",
        'Ellsberg slope (med)': f"{np.median(ells['slope_draws']):.1f}",
        'Ellsberg P(−)': f"{ells['p_negative']:.3f}",
        'P(Ellsberg slope < Insurance slope)': f"{p_ells_more_neg:.3f}",
    })

pd.DataFrame(task_rows)
Table 4: Task main effect: Insurance vs Ellsberg within each LLM.
LLM Insurance slope (med) Insurance P(−) Ellsberg slope (med) Ellsberg P(−) P(Ellsberg slope < Insurance slope)
0 GPT-4o -30.8 0.991 -48.0 0.984 0.713
1 Claude -3.6 0.560 -18.8 0.766 0.655

The task effect is weaker and less consistent than the LLM effect. For GPT-4o, Ellsberg gambles may produce a somewhat steeper slope than insurance, but both tasks show clear negative trends. For Claude, neither task produces a convincing slope.

0.7.3 Interaction

Is the temperature–\(\alpha\) relationship specific to a particular LLM–task combination, or is it decomposable into additive main effects?

Show code
fig, ax = plt.subplots(figsize=(8, 5))

tasks = ['Insurance', 'Ellsberg']
gpt_slopes = [np.median(CELLS['(1,1)']['slope_draws']),
              np.median(CELLS['(1,2)']['slope_draws'])]
claude_slopes = [np.median(CELLS['(2,1)']['slope_draws']),
                 np.median(CELLS['(2,2)']['slope_draws'])]

ax.plot(tasks, gpt_slopes, 'o-', color=SEU_COLORS['primary'], linewidth=2.5,
        markersize=10, label='GPT-4o')
ax.plot(tasks, claude_slopes, 's-', color=SEU_COLORS['accent'], linewidth=2.5,
        markersize=10, label='Claude')

ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax.set_ylabel('Slope median (Δα / ΔT)')
ax.set_title('Interaction Plot: Slope Magnitude')
ax.legend(fontsize=12)
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()
Figure 6: Primary interaction plot in terms of slope medians. The GPT-4o line (both cells negative) is well-separated from the near-zero Claude line. The roughly parallel pattern is consistent with an additive structure, though the data cannot rule out moderate interactions (see text).
Show code
fig, ax = plt.subplots(figsize=(8, 5))

tasks = ['Insurance', 'Ellsberg']
gpt_p_neg = [CELLS['(1,1)']['p_negative'], CELLS['(1,2)']['p_negative']]
claude_p_neg = [CELLS['(2,1)']['p_negative'], CELLS['(2,2)']['p_negative']]

ax.plot(tasks, gpt_p_neg, 'o-', color=SEU_COLORS['primary'], linewidth=2.5,
        markersize=10, label='GPT-4o')
ax.plot(tasks, claude_p_neg, 's-', color=SEU_COLORS['accent'], linewidth=2.5,
        markersize=10, label='Claude')

ax.set_ylabel('P(slope < 0)')
ax.set_title('Interaction Plot: LLM × Task')
ax.legend(fontsize=12)
ax.set_ylim(0.4, 1.05)
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()
Figure 7: Supplementary interaction plot using P(slope < 0) as the dependent variable. Note that P(slope < 0) is a tail probability — a non-linear transformation of the slope distribution — so parallel lines here do not strictly imply additive effects on the slope scale. See Figure 6 for the primary interaction analysis on the linear slope scale.
Show code
# Quantitative interaction: difference in slope differences
# If additive: (GPT-Ells − GPT-Ins) ≈ (Claude-Ells − Claude-Ins)
# Interaction = (GPT-Ells − GPT-Ins) − (Claude-Ells − Claude-Ins)

interaction_draws = (
    (CELLS['(1,2)']['slope_draws'] - CELLS['(1,1)']['slope_draws']) -
    (CELLS['(2,2)']['slope_draws'] - CELLS['(2,1)']['slope_draws'])
)

print(f"Interaction (difference-in-differences of slopes):")
print(f"  Median: {np.median(interaction_draws):.1f}")
print(f"  90% CI: [{np.percentile(interaction_draws, 5):.1f}, {np.percentile(interaction_draws, 95):.1f}]")
print(f"  P(interaction > 0): {np.mean(interaction_draws > 0):.3f}")
print(f"  P(interaction < 0): {np.mean(interaction_draws < 0):.3f}")
print(f"")
print(f"The 90% CI is extremely wide, reflecting the propagation of uncertainty")
print(f"through the difference-in-differences of four independently estimated slopes.")
print(f"The data are uninformative about the presence or magnitude of an interaction.")
Interaction (difference-in-differences of slopes):
  Median: -1.8
  90% CI: [-87.8, 79.0]
  P(interaction > 0): 0.486
  P(interaction < 0): 0.514

The 90% CI is extremely wide, reflecting the propagation of uncertainty
through the difference-in-differences of four independently estimated slopes.
The data are uninformative about the presence or magnitude of an interaction.

The difference-in-differences analysis yields an interaction estimate centered near zero, but the 90% credible interval is extremely wide — spanning roughly 170 slope units. This width reflects a fundamental statistical limitation: each slope is derived from regression on five temperature points with substantial posterior uncertainty, and the interaction compounds two such differences. The data therefore cannot distinguish between additive and non-additive structures. The correct interpretation is not that the interaction is small, but that the study has limited power to detect it. A formal equivalence claim would require defining a region of practical equivalence (ROPE) and demonstrating that the posterior concentrates within it; the current data do not support such a claim.

0.8 Pairwise Comparison Heatmaps

Show code
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

for cell_id, row, col in cell_order:
    cell = CELLS[cell_id]
    ax = axes[row, col]
    temps = cell['temps']
    n_temps = len(temps)
    pairs = cell['analysis']['pairwise_comparisons']

    heatmap = np.full((n_temps, n_temps), np.nan)
    for key, prob in pairs.items():
        t1, t2 = key.split('_vs_')
        i = temps.index(float(t1))
        j = temps.index(float(t2))
        heatmap[i, j] = prob
        heatmap[j, i] = 1 - prob
    np.fill_diagonal(heatmap, 0.5)

    im = ax.imshow(heatmap, cmap='RdYlGn', vmin=0, vmax=1, aspect='equal')
    ax.set_xticks(range(n_temps))
    ax.set_xticklabels([f'{t}' for t in temps], fontsize=9)
    ax.set_yticks(range(n_temps))
    ax.set_yticklabels([f'{t}' for t in temps], fontsize=9)
    ax.set_xlabel('Temperature (col)')
    ax.set_ylabel('Temperature (row)')
    ax.set_title(f'{cell["label"]}', fontsize=12, fontweight='bold')

    for i in range(n_temps):
        for j in range(n_temps):
            if not np.isnan(heatmap[i, j]):
                color = 'white' if heatmap[i, j] > 0.8 or heatmap[i, j] < 0.2 else 'black'
                ax.text(j, i, f'{heatmap[i, j]:.2f}', ha='center', va='center',
                        fontsize=9, color=color)

plt.colorbar(im, ax=axes.ravel().tolist(), shrink=0.6, label='P(α_row > α_col)')
plt.tight_layout()
plt.show()
Figure 8: Pairwise posterior probability heatmaps P(α_row > α_col) for all four cells. Green cells indicate the row temperature has higher α; red cells indicate the column temperature has higher α. GPT-4o cells (top) show pronounced green at high-T vs. low-T contrasts (the global decline) but more modest separation at adjacent low/mid-temperature pairs. Claude cells (bottom) show mixed colors, indicating no reliable ordering. Temperature grids and provider scales differ across rows.

0.9 Summary Visualisation

Show code
fig, ax = plt.subplots(figsize=(12, 6))

styles = {
    '(1,1)': {'color': SEU_PALETTE[0], 'ls': '-',  'marker': 'o'},
    '(1,2)': {'color': SEU_PALETTE[1], 'ls': '-',  'marker': 's'},
    '(2,1)': {'color': SEU_PALETTE[2], 'ls': '--', 'marker': 'o'},
    '(2,2)': {'color': SEU_PALETTE[3], 'ls': '--', 'marker': 's'},
}

for cell_id in ['(1,1)', '(1,2)', '(2,1)', '(2,2)']:
    cell = CELLS[cell_id]
    s = styles[cell_id]
    medians = [np.median(cell['alpha_draws'][t]) for t in cell['temps']]
    q05s = [np.percentile(cell['alpha_draws'][t], 5) for t in cell['temps']]
    q95s = [np.percentile(cell['alpha_draws'][t], 95) for t in cell['temps']]

    ax.errorbar(cell['temps'], medians,
                yerr=[np.array(medians) - np.array(q05s),
                      np.array(q95s) - np.array(medians)],
                fmt=f'{s["marker"]}', color=s['color'], linewidth=2, markersize=8,
                capsize=4, capthick=1.5, linestyle=s['ls'],
                label=f'{cell["label"]}')

ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('Temperature–Sensitivity Trajectories: All Factorial Cells')
ax.legend(loc='upper right', fontsize=10)
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()
Figure 9: Grand summary: α trajectory (with 90% CIs) for all four factorial cells on a single axis. GPT-4o cells (solid lines) show a negative global trajectory; Claude cells (dashed lines) are approximately flat. Important: temperature grids and provider-specific temperature scales differ between LLMs — GPT-4o uses {0.0, 0.3, 0.7, 1.0, 1.5} on the OpenAI scale while Claude uses {0.0, 0.2, 0.5, 0.8, 1.0} on the Anthropic scale. The x-axis represents each LLM’s own grid, so cross-LLM visual comparisons of slope magnitude are not directly valid. The qualitative contrast (negative global slope vs. little global-slope evidence) is interpretable.
Show code
fig, ax = plt.subplots(figsize=(10, 5))

cell_ids = ['(1,1)', '(1,2)', '(2,1)', '(2,2)']
labels = [CELLS[c]['label'] for c in cell_ids]
p_negs = [CELLS[c]['p_negative'] for c in cell_ids]
colors = [cell_colors[c] for c in cell_ids]

bars = ax.bar(labels, p_negs, color=colors, edgecolor='white', linewidth=1.5, width=0.6)

ax.axhline(y=0.95, color='gray', linestyle='--', alpha=0.5, label='P = 0.95 threshold')
ax.axhline(y=0.5, color='gray', linestyle=':', alpha=0.3)
ax.set_ylabel('P(slope < 0)')
ax.set_title('Evidence for Negative Temperature–Sensitivity Slope')
ax.set_ylim(0, 1.08)
ax.legend()

for bar, p in zip(bars, p_negs):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
            f'{p:.3f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()
Figure 10: P(slope < 0) for each cell, arranged in factorial layout. GPT-4o cells are well above 0.9 (strong evidence of decline); Claude cells are near 0.5–0.8 (weak or absent evidence).

0.10 Discussion

0.10.1 Decomposing the Original Confound

The original question was: why did the Claude × Ellsberg study fail to replicate the GPT-4o × Insurance finding? The factorial design provides a clear answer.

Show code
final = pd.DataFrame({
    '': ['**GPT-4o**', '**Claude**'],
    'Insurance (K=3)': [
        f"Strong neg. (P = {CELLS['(1,1)']['p_negative']:.2f})",
        f"Little evid. (P = {CELLS['(2,1)']['p_negative']:.2f})",
    ],
    'Ellsberg (K=4)': [
        f"Strong neg. (P = {CELLS['(1,2)']['p_negative']:.2f})",
        f"Weak neg. (P = {CELLS['(2,2)']['p_negative']:.2f})",
    ],
})
final
Table 5: Final factorial summary. Cells are labeled using the calibrated vocabulary defined in Section 0.6 (Strong neg. = \(P(\text{slope}<0) > 0.95\); Weak neg. = \(\in [0.7, 0.95)\); Little evid. = \(< 0.7\)). LLM identity (rows) is the leading qualitative differentiator across these four exploratory cells: GPT-4o shows a strong negative global slope on both tasks, Claude does not.
Insurance (K=3) Ellsberg (K=4)
0 **GPT-4o** Strong neg. (P = 0.99) Strong neg. (P = 0.98)
1 **Claude** Little evid. (P = 0.56) Weak neg. (P = 0.77)

0.10.2 Baseline Sensitivity Levels

Show code
baseline_rows = []
for cid in ['(1,1)', '(1,2)', '(2,1)', '(2,2)']:
    c = CELLS[cid]
    draws = c['alpha_draws'][0.0]
    med = np.median(draws)
    q05, q95 = np.percentile(draws, [5, 95])
    baseline_rows.append({
        'Cell': cid,
        'Study': c['label'],
        'Model': c['model'],
        'α median (T=0)': f"{med:.1f}",
        '90% CI': f"[{q05:.1f}, {q95:.1f}]",
    })
pd.DataFrame(baseline_rows)
Table 6: Posterior median α at the lowest temperature (T=0.0) for each cell, computed from the saved alpha_draws artifacts. Cross-task α magnitudes are not directly comparable because Insurance cells use m_01 with Lognormal(3.0, 0.75) and Ellsberg cells use m_02 with Lognormal(3.5, 0.75); K also differs (3 vs 4).
Cell Study Model α median (T=0) 90% CI
0 (1,1) GPT-4o × Insurance m_01 74.1 [47.3, 121.1]
1 (1,2) GPT-4o × Ellsberg m_02 110.4 [74.4, 167.2]
2 (2,1) Claude × Insurance m_01 70.5 [44.6, 113.4]
3 (2,2) Claude × Ellsberg m_02 85.4 [58.9, 127.2]

Within each task — where the model and prior are held fixed and cross-cell α levels are directly comparable — GPT-4o sits at a higher baseline than Claude. On Insurance, GPT-4o’s median α at \(T=0.0\) is roughly comparable to Claude’s, but with substantial overlap in their 90% credible intervals. On Ellsberg, GPT-4o’s baseline is meaningfully higher than Claude’s. Across tasks the picture is mixed: under the m_02 prior used for Ellsberg, even Claude’s baseline exceeds either Insurance cell’s baseline under the m_01 prior, so any cross-task α comparison is dominated by the prior/model difference rather than by LLM identity. We therefore restrict baseline-level claims to within-task contrasts, and treat α magnitudes as comparable only across cells that share both the model variant and the prior.

0.10.3 The Oscillatory Claude Pattern

Both Claude cells exhibit a non-monotonic, oscillatory α trajectory across temperatures rather than a smooth decline or flat line. If this oscillation is systematic (rather than noise), it may reflect features of Claude’s temperature implementation or its RLHF training that interact non-trivially with the softmax sensitivity parameter. The individual cell reports discuss these patterns in more detail. Whether the oscillation patterns align across the two Claude cells — which would suggest a systematic mechanism rather than random variation — is an open question that warrants investigation in future work.

0.10.4 Main Conclusions

  1. LLM accounts for most of the qualitative variation. GPT-4o shows a clear negative temperature–\(\alpha\) slope (\(P > 0.98\)) on both tasks. Claude shows at best a weak trend (\(P \approx 0.56\)\(0.77\)) on both tasks. The within-LLM evidence is strong for GPT-4o. The between-LLM comparison is directionally clear — P(GPT-4o slope < Claude slope) is approximately 0.80–0.82 within both tasks — but this between-cell probability is more modest than the within-cell evidence, reflecting the additional uncertainty inherent in comparing independently fitted models.

  2. Task is a secondary factor. Changing the task from insurance to Ellsberg does not eliminate the effect for GPT-4o, nor does it create the effect for Claude. Within GPT-4o, Ellsberg may produce a slightly steeper decline, but the qualitative pattern is the same.

  3. No strong evidence of interaction, though power is limited. The difference-in-differences analysis yields an interaction estimate centered near zero, but with a 90% credible interval wide enough to accommodate substantial interactions in either direction. The data are uninformative about whether the factorial structure is additive. The absence of a detected interaction should not be confused with evidence of additivity — a formal equivalence claim would require a ROPE analysis that the current data cannot support.

  4. Temperature–sensitivity is LLM-specific. The finding that higher temperature reduces estimated SEU sensitivity should be qualified as a property observed in GPT-4o but not in Claude. Whether this reflects differences in temperature implementation, RLHF procedures, training data, or other architectural factors cannot be determined from the current design. The attribution to “temperature implementation” specifically is one of several possible explanations.

0.10.5 Model Adequacy Across Cells

Posterior predictive checks were conducted for each of the four individual cell models and are reported in the respective cell reports. All four models showed adequate fit to the observed choice data, with no systematic evidence of misspecification. Readers are directed to the individual cell reports for full diagnostic details, including R-hat convergence, effective sample sizes, and posterior predictive p-values.

0.10.6 Temperature Range Confound

The GPT-4o cells use temperatures in \(\{0.0, 0.3, 0.7, 1.0, 1.5\}\) while the Claude cells use \(\{0.0, 0.2, 0.5, 0.8, 1.0\}\). Because the slope \(\Delta\alpha / \Delta T\) is computed by regression over the full temperature grid, the GPT-4o slopes are estimated over a wider range (\(\Delta T = 1.5\)) with different grid spacing than the Claude slopes (\(\Delta T = 1.0\)). A flatter true relationship would produce a less negative slope over a narrower range even if the underlying sensitivity function were identical.

To assess whether this confound drives the LLM comparison, we compute matched-range slopes for GPT-4o by restricting to \(T \in \{0.0, 0.3, 0.7, 1.0\}\) (dropping \(T = 1.5\)) and compare these to the full-range Claude slopes.

Show code
# Compute matched-range slopes for GPT-4o (T ≤ 1.0, dropping T = 1.5)
matched_temps = [0.0, 0.3, 0.7, 1.0]
matched_temp_arr = np.array(matched_temps)

matched_slopes = {}
for cell_id in ['(1,1)', '(1,2)']:
    cell = CELLS[cell_id]
    n_draws = len(cell['alpha_draws'][cell['temps'][0]])
    slopes = np.empty(n_draws)
    for i in range(n_draws):
        alphas = np.array([cell['alpha_draws'][t][i] for t in matched_temps])
        slopes[i] = np.cov(matched_temp_arr, alphas)[0, 1] / np.var(matched_temp_arr)
    matched_slopes[cell_id] = slopes

print("Matched-range slopes for GPT-4o (T ≤ 1.0 only):")
for cell_id in ['(1,1)', '(1,2)']:
    cell = CELLS[cell_id]
    s = matched_slopes[cell_id]
    p_neg = np.mean(s < 0)
    print(f"  {cell['label']}: median = {np.median(s):.1f}, "
          f"90% CI = [{np.percentile(s, 5):.1f}, {np.percentile(s, 95):.1f}], "
          f"P(slope < 0) = {p_neg:.3f}")

print()
print("LLM comparison on matched range:")
for task, gpt_id, claude_id in [('Insurance', '(1,1)', '(2,1)'),
                                  ('Ellsberg', '(1,2)', '(2,2)')]:
    p_gpt_more_neg = np.mean(matched_slopes[gpt_id] < CELLS[claude_id]['slope_draws'])
    print(f"  {task}: P(GPT matched slope < Claude slope) = {p_gpt_more_neg:.3f}")
Matched-range slopes for GPT-4o (T ≤ 1.0 only):
  GPT-4o × Insurance: median = -39.9, 90% CI = [-100.2, 1.6], P(slope < 0) = 0.943
  GPT-4o × Ellsberg: median = -33.3, 90% CI = [-110.8, 37.2], P(slope < 0) = 0.774

LLM comparison on matched range:
  Insurance: P(GPT matched slope < Claude slope) = 0.824
  Ellsberg: P(GPT matched slope < Claude slope) = 0.612

The matched-range analysis confirms that restricting GPT-4o to \(T \leq 1.0\) does not eliminate the negative trend: P(slope < 0) remains high for both GPT-4o cells on the restricted grid. The between-LLM comparison is also robust — P(GPT matched slope < Claude slope) is similar to the full-range values. The qualitative conclusion (GPT-4o declining, Claude flat) is not an artifact of the wider GPT-4o temperature range. Quantitative slope magnitudes still differ across grids due to the non-identical temperature points (e.g., GPT-4o at 0.3 vs. Claude at 0.2), but the range asymmetry is no longer a plausible alternative explanation for the qualitative LLM effect.

0.10.7 Limitations

  • Exploratory synthesis, not pre-registered. The factorial structure was imposed post-hoc after the initial non-replication. The analysis is exploratory and the conclusions should be evaluated accordingly.
  • Independent model fits, not a unified hierarchical model. The four cells were fitted independently, and the factorial analysis operates on combined posterior draws. A hierarchical model estimating LLM and task effects within a single structure would yield tighter between-cell contrasts and formal effect-size estimates. This was not pursued in order to maintain consistency with the individual cell reports and because the foundational model validation applies to within-cell fits, not cross-cell comparisons.
  • Two LLMs. The factorial examines only GPT-4o and Claude 3.5 Sonnet. Other LLMs (e.g., Llama, Gemini) may show different patterns.
  • Temperature scales differ. The GPT-4o grid extends to \(T = 1.5\) while Claude’s maximum is \(T = 1.0\). While the qualitative comparison is valid, quantitative slope comparisons across LLMs should be interpreted cautiously (see Section 0.10.6).
  • Two tasks. Insurance triage and Ellsberg gambles differ in multiple ways (\(K\), semantic content, prior calibration). More tasks would strengthen the conclusion that the task effect is minor.
  • Prior differences across tasks. The m_01 and m_02 models use different α priors calibrated for their respective \(K\) values. This is appropriate for within-task analysis but complicates cross-task comparisons of α levels (see Section 0.4).
  • Multiple confounds between LLMs. GPT-4o and Claude differ not only in their temperature implementations but also in training data, RLHF procedures, and potentially in task-specific fine-tuning. The observed LLM effect could reflect any combination of these factors.
  • Fixed design parameters. All cells use \(M \approx 300\), \(D = 32\), \(R = 30\). The conclusions may not generalize to designs with substantially different sample sizes or feature spaces.
  • Ellsberg ambiguity tiers are pooled. Both Ellsberg cells pool across the three ambiguity tiers of the gamble set, so this synthesis speaks to overall SEU sensitivity on Ellsberg-style stimuli, not to ambiguity aversion or tier-specific processing.

0.10.8 Connections to the JDM Literature

The finding that different LLMs show qualitatively different temperature–sensitivity patterns resonates with the broader JDM literature on individual differences in decision quality. Bruhin et al. (2010) documented substantial heterogeneity across human decision-makers in risk preferences and consistency, and Hey and Orme (1994) showed that error structures vary meaningfully across individuals. The present finding — that GPT-4o’s estimated decision sensitivity degrades with temperature while Claude’s does not — can be viewed as an analog of between-subject variability in decision noise. Whether this analogy is substantive (reflecting genuinely different “decision-making strategies”) or superficial (reflecting implementation differences in how temperature modifies token sampling) is an open question that connects to ongoing debates about whether LLMs are useful models of human cognition (Binz and Schulz 2023).

0.10.9 Future Directions

  • Additional LLMs. Extending the factorial to other model families would clarify whether the temperature–sensitivity effect is specific to OpenAI GPT-4o or shared by certain architectures.
  • More tasks. Including tasks with different \(K\) values or semantic structures would strengthen the conclusion about task invariance.
  • Unified hierarchical model. Fitting a single model with LLM, task, and temperature as factors — potentially using a meta-analytic framework on the per-cell posterior draws — would provide formal effect-size estimates and sharper interaction tests.
  • Longitudinal tracking. Model updates (GPT-4o versions, Claude updates) could change the temperature–sensitivity relationship — periodic re-assessment would be informative.
  • Mechanistic investigation. Understanding why GPT-4o’s temperature affects estimated \(\alpha\) while Claude’s does not may require probing the internal representations and decoding strategies of each model, connecting to the LLM interpretability literature on temperature scaling and its interaction with RLHF-trained output distributions.
  • Implications for deployment. The finding that GPT-4o’s decision quality (as measured by SEU sensitivity) degrades with temperature while Claude’s does not has practical implications for LLM deployment in decision-support systems, suggesting that temperature settings should be tuned with model-specific awareness.

0.11 Reproducibility

This report loads pre-computed data from the frozen data directories of all four individual cell reports:

Cell Data Directory
(1,1) GPT-4o × Insurance reports/applications/temperature_study/data/
(1,2) GPT-4o × Ellsberg reports/applications/gpt4o_ellsberg_study/data/
(2,1) Claude × Insurance reports/applications/claude_insurance_study/data/
(2,2) Claude × Ellsberg reports/applications/ellsberg_study/data/

Each directory contains primary_analysis.json, alpha_draws_T*.npz, and associated diagnostics. See the individual cell reports for refitting instructions and full methodological details.

To regenerate this synthesis report, render the Quarto document from the project root:

quarto render reports/applications/factorial_synthesis/01_factorial_synthesis.qmd

The report depends on the frozen data files listed above and the report_utils module in reports/. No additional packages beyond those in the project environment.yml are required.

0.12 References

Binz, Marcel, and Eric Schulz. 2023. “Using Cognitive Psychology to Understand GPT-3.” Proceedings of the National Academy of Sciences 120 (6): e2218523120.
Bruhin, Adrian, Helga Fehr-Duda, and Thomas Epper. 2010. “Risk and Rationality: Uncovering Heterogeneity in Probability Distortion.” Econometrica 78 (4): 1375–412.
Hey, John D., and Chris Orme. 1994. “Investigating Generalizations of Expected Utility Theory Using Experimental Data.” Econometrica 62 (6): 1291–326.

McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In P. Zarembka (Ed.), Frontiers in econometrics (pp. 105–142). Academic Press.

Reuse

Citation

BibTeX citation:
@online{helzner2026,
  author = {Helzner, Jeff},
  title = {2×2 {Factorial} {Synthesis:} {LLM} × {Task}},
  date = {2026-06-27},
  url = {https://jeffhelzner.github.io/seu-sensitivity/applications/factorial_synthesis/01_factorial_synthesis.html},
  langid = {en}
}
For attribution, please cite this work as:
Helzner, Jeff. 2026. “2×2 Factorial Synthesis: LLM × Task.” SEU Sensitivity Project, June 27. https://jeffhelzner.github.io/seu-sensitivity/applications/factorial_synthesis/01_factorial_synthesis.html.