Temperature and SEU Sensitivity: EU-Prompt Variation

Jeff Helzner

Temperature and SEU Sensitivity: EU-Prompt Variation

Application Report: Temperature Study 3

applications

temperature

m_01

eu-prompt

Tests whether explicitly instructing the LLM to maximize expected utility increases estimated sensitivity (α) compared to the base temperature study. Uses the same problems, assessments, and embeddings — only the choice prompt is modified to include an explicit EU-maximization instruction. The EU-prompt does not increase α; point estimates are lower at most temperatures, though the evidence is suggestive rather than conclusive.

Author

Jeff Helzner

Published

May 12, 2026

0.1 Introduction

The initial temperature study established that higher LLM sampling temperature systematically reduces estimated sensitivity (α) to subjective expected utility maximization. The global slope was Δα/ΔT ≈ −25, with strong pairwise separation at the extremes (T=0.0 vs T≥1.0) and near-equality at moderate levels (T=0.3 ≈ T=0.7).

This study asks a complementary question: does explicitly telling the LLM to maximize expected utility increase α?

Our directional prediction was that the EU-prompt would increase $\alpha$: if the model is informed of the correct decision criterion, its choices should align more closely with EU maximization. This prediction follows naturally from the assumption that LLM performance improves with more specific instructions — an assumption common in the prompting literature but, as we will see, not obviously correct in this domain. We note that this study is exploratory in the sense that the prediction was not pre-registered; the design was motivated by the question of whether explicit instructions could modulate sensitivity, with the directional expectation serving as the natural baseline hypothesis against which to evaluate the results.

The intervention is minimal. The choice prompt is identical to the base study except for one added paragraph:

Select the claim that maximizes your expected utility with respect to the three outcomes above. For each claim, consider the subjective probability you assign to each outcome and your utility for that outcome. Choose the claim for which the probability-weighted utility across the three outcomes is highest.

Everything else is held constant:

Same 100 problems × 3 presentations
Same assessment texts (reused from the base study, not re-collected)
Same PCA-reduced embeddings
Same model: m_01 with Lognormal(3.0, 0.75) prior on α
Same 5 temperature levels: 0.0, 0.3, 0.7, 1.0, 1.5

This design isolates the prompt intervention to the decision rule while holding belief formation (assessments) constant — a clean decomposition of the two stages of SEU.

0.2 Experimental Design

0.2.1 Prompt Modification

The base study’s choice prompt presents three outcomes and asks the LLM to select a claim. The EU-prompt variant adds an explicit instruction to maximize expected utility with respect to those outcomes — referencing subjective probabilities, utilities, and probability-weighted utility.

The assessment prompt is not modified. Assessments are reused directly from the base study, ensuring that the feature representation (w vectors) is identical.

Complete Choice Prompts

The two choice prompts differ by a single paragraph (shown in bold below). The system prompt is identical in both conditions.

System prompt (both conditions):

You are a claims analyst selecting which insurance claim to forward to experienced fraud investigators for further review.

Base study choice prompt:

Based on the following analyst assessments, select ONE claim to send to a team of two experienced fraud investigators.

Your decision will be evaluated based on the investigators’ assessments:

Best outcome: Both investigators agree your selection warrants investigation

Middle outcome: One investigator agrees, one does not

Worst outcome: Neither investigator agrees with your selection

The assessments are:

{assessments_list}

Which claim do you select for investigation?

Respond with ONLY the claim number ({num_range}).

EU-prompt choice prompt:

Based on the following analyst assessments, select ONE claim to send to a team of two experienced fraud investigators.

Your decision will be evaluated based on the investigators’ assessments:

Best outcome: Both investigators agree your selection warrants investigation

Middle outcome: One investigator agrees, one does not

Worst outcome: Neither investigator agrees with your selection

Select the claim that maximizes your expected utility with respect to the three outcomes above. For each claim, consider the subjective probability you assign to each outcome and your utility for that outcome. Choose the claim for which the probability-weighted utility across the three outcomes is highest.

The assessments are:

{assessments_list}

Which claim do you select for investigation?

Respond with ONLY the claim number ({num_range}).

0.2.2 Data Reuse

Component	Base Study	EU-Prompt Study
Problems	Generated	Reused
Assessments	Collected	Reused
Embeddings	Computed	Reused
Choices	Collected	New (EU prompt)
Model fitting	m_01	m_01 (same prior)

This means the EU-prompt study required only 1,500 new API calls (choices), not the full ~3,150 of the base study.

0.3 Results

0.3.1 Loading Posterior Draws

Show code

import matplotlib.pyplot as plt
from report_utils import set_seu_style, SEU_COLORS, SEU_PALETTE
set_seu_style()

temperatures = [0.0, 0.3, 0.7, 1.0, 1.5]
temp_labels = {t: f"T={t}" for t in temperatures}

def temp_key(t):
    return f"T{str(t).replace('.', '_')}"

# EU-prompt alpha draws
eu_draws = {}
for t in temperatures:
    data = np.load(data_dir / f"alpha_draws_{temp_key(t)}.npz")
    eu_draws[t] = data['alpha']

# Base study alpha draws (from sibling report's frozen data)
base_data_dir = Path("..") / "temperature_study" / "data"
base_draws = {}
for t in temperatures:
    data = np.load(base_data_dir / f"alpha_draws_{temp_key(t)}.npz")
    base_draws[t] = data['alpha']

# Pre-computed cross-study analysis
with open(data_dir / "cross_study_analysis.json") as f:
    cross = json.load(f)

# Fit summaries
with open(data_dir / "fit_summary.json") as f:
    eu_fit = json.load(f)

with open(base_data_dir / "fit_summary.json") as f:
    base_fit = json.load(f)

  T=0.0: 4,000 posterior draws loaded (EU-prompt)
  T=0.3: 4,000 posterior draws loaded (EU-prompt)
  T=0.7: 4,000 posterior draws loaded (EU-prompt)
  T=1.0: 4,000 posterior draws loaded (EU-prompt)
  T=1.5: 4,000 posterior draws loaded (EU-prompt)

0.3.2 MCMC Diagnostics

Show code

import pandas as pd
import re

diag_rows = []
for t in temperatures:
    with open(data_dir / f"diagnostics_{temp_key(t)}.txt") as f:
        diag_text = f.read()

    if "No divergent transitions" in diag_text or "0 of" in diag_text:
        n_div = 0
    else:
        match = re.search(r'(\d+) of (\d+)', diag_text)
        n_div = int(match.group(1)) if match else 0

    rhat_ok = "R-hat values satisfactory" in diag_text or "split R-hat values satisfactory" in diag_text
    ess_ok = "effective sample size satisfactory" in diag_text
    ebfmi_ok = "E-BFMI satisfactory" in diag_text

    diag_rows.append({
        'Temperature': t,
        'Divergences': f"{n_div}/4000",
        'R̂': '✓' if rhat_ok else '✗',
        'ESS': '✓' if ess_ok else '✗',
        'E-BFMI': '✓' if ebfmi_ok else '✗',
    })

pd.DataFrame(diag_rows)

Table 1: MCMC diagnostics for all five temperature conditions under the EU-prompt. All fits used 4 chains with 1,000 warmup and 1,000 sampling iterations each (4,000 post-warmup draws total).

	Temperature	Divergences	R̂	ESS	E-BFMI
0	0.0	2/4000	✓	✓	✓
1	0.3	0/4000	✓	✓	✓
2	0.7	1/4000	✗	✓	✓
3	1.0	0/4000	✓	✓	✓
4	1.5	1/4000	✓	✓	✓

0.3.3 Posterior Summaries

Show code

rows = []
for t in temperatures:
    s = cross['per_temperature'][str(t)]
    rows.append({
        'Temperature': t,
        'Median': f"{s['eu_median']:.1f}",
        'Mean': f"{s['eu_mean']:.1f}",
        'SD': f"{eu_fit[str(t)]['alpha_sd']:.1f}",
        '90% CI': f"[{s['eu_q05']:.1f}, {s['eu_q95']:.1f}]",
    })

pd.DataFrame(rows)

Table 2: Posterior summaries for the sensitivity parameter α at each temperature level under the EU-prompt. Intervals are 90% credible intervals.

	Temperature	Median	Mean	SD	90% CI
0	0.0	57.0	59.6	16.0	[38.6, 89.7]
1	0.3	52.8	54.9	15.4	[34.1, 83.6]
2	0.7	54.7	56.9	14.2	[38.3, 83.3]
3	1.0	43.3	44.7	10.5	[30.4, 64.2]
4	1.5	26.3	27.1	5.9	[18.9, 37.9]

The EU-prompt estimates show the same qualitative pattern as the base study: $\alpha$ is highest at $T = 0.0$ and declines with increasing temperature. The near-equality of $T = 0.3$ and $T = 0.7$ also persists.

0.3.4 Forest Plot

Show code

fig, ax = plt.subplots(figsize=(8, 5))

y_positions = np.arange(len(temperatures))[::-1]

for i, t in enumerate(temperatures):
    draws = eu_draws[t]
    median = np.median(draws)
    q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95])

    y = y_positions[i]
    ax.plot([q05, q95], [y, y], color=SEU_PALETTE[i], linewidth=1.5, alpha=0.7)
    ax.plot([q25, q75], [y, y], color=SEU_PALETTE[i], linewidth=4, alpha=0.9)
    ax.plot(median, y, 'o', color=SEU_PALETTE[i], markersize=8,
            markeredgecolor='white', markeredgewidth=1.5, zorder=5)

ax.set_yticks(y_positions)
ax.set_yticklabels([f'T = {t}' for t in temperatures])
ax.set_xlabel('Sensitivity (α)')
ax.set_title('EU-Prompt: Posterior Distributions of α by Temperature')
ax.grid(axis='x', alpha=0.3)
ax.grid(axis='y', alpha=0)

plt.tight_layout()
plt.show()

Figure 1: Forest plot of posterior α distributions under the EU-prompt. Points show posterior medians; thick bars span the 50% credible interval; thin bars span the 90% credible interval. The temperature gradient is clearly visible: α declines monotonically from T=0.0 to T=1.5, with substantial overlap between T=0.3 and T=0.7.

0.3.5 Posterior Densities

Show code

from scipy.stats import gaussian_kde

fig, ax = plt.subplots(figsize=(8, 5))

for i, t in enumerate(temperatures):
    draws = eu_draws[t]
    kde = gaussian_kde(draws)
    x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.1, 300)
    ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=SEU_PALETTE[i])
    ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2,
            label=f'T = {t} (median = {np.median(draws):.0f})')

ax.set_xlabel('Sensitivity (α)')
ax.set_ylabel('Density')
ax.set_title('EU-Prompt: Posterior Density of α')
ax.legend(loc='upper right')

plt.tight_layout()
plt.show()

Figure 2: Kernel density estimates of the posterior α distributions under the EU-prompt.

0.3.6 Posterior Predictive Checks

Show code

ppc_rows = []
for t in temperatures:
    with open(data_dir / f"ppc_{temp_key(t)}.json") as f:
        ppc = json.load(f)

    pvals = ppc['p_values']
    ppc_rows.append({
        'Temperature': t,
        'Log-likelihood': f"{pvals['ll']:.3f}",
        'Modal frequency': f"{pvals['modal']:.3f}",
        'Mean probability': f"{pvals['prob']:.3f}",
    })

pd.DataFrame(ppc_rows)

Table 3: Posterior predictive check p-values for each temperature condition under the EU-prompt. Values near 0.5 indicate good calibration.

	Temperature	Log-likelihood	Modal frequency	Mean probability
0	0.0	0.432	0.456	0.423
1	0.3	0.438	0.670	0.531
2	0.7	0.453	0.544	0.498
3	1.0	0.452	0.490	0.476
4	1.5	0.494	0.567	0.506

All PPC p-values fall within $[0.42, 0.67]$, indicating adequate model fit at every temperature level. The m_01 model describes the EU-prompt choice data as well as it describes the base study data. This has an important implication: the model’s softmax choice structure is flexible enough to accommodate whatever the EU-prompt does to choices without requiring a structural change to the decision model. The prompt effect, to the extent it exists, is absorbed entirely by changes in the sensitivity parameter $\alpha$ (and nuisance parameters $\beta$, $\delta$) rather than manifesting as systematic model misspecification. This confirms that comparing $\alpha$ across prompt conditions is a meaningful exercise — the parameter retains its interpretation under both conditions.

0.3.7 Prior-to-Posterior Contraction

To assess whether the qualitative findings are robust to the choice of prior, we examine how much the posterior contracts relative to the Lognormal(3.0, 0.75) prior on $\alpha$. Strong contraction indicates that the data dominate the prior, making the results insensitive to alternative prior specifications.

Show code

from scipy.stats import lognorm

# Lognormal(3.0, 0.75) prior: mean and SD on the natural scale
prior_mu, prior_sigma = 3.0, 0.75
prior_mean = np.exp(prior_mu + prior_sigma**2 / 2)
prior_var = (np.exp(prior_sigma**2) - 1) * np.exp(2 * prior_mu + prior_sigma**2)
prior_sd = np.sqrt(prior_var)

contraction_rows = []
for t in temperatures:
    draws = eu_draws[t]
    post_sd = np.std(draws)
    contraction = 1 - post_sd / prior_sd
    contraction_rows.append({
        'Temperature': t,
        'Prior SD': f"{prior_sd:.1f}",
        'Posterior SD': f"{post_sd:.1f}",
        'Contraction': f"{contraction:.3f}",
    })

pd.DataFrame(contraction_rows)

Table 4: Prior-to-posterior contraction for α at each temperature. Contraction is measured as 1 − (posterior SD / prior SD). Values near 1 indicate strong data dominance over the prior.

	Temperature	Prior SD	Posterior SD	Contraction
0	0.0	23.1	16.0	0.309
1	0.3	23.1	15.4	0.333
2	0.7	23.1	14.2	0.384
3	1.0	23.1	10.5	0.544
4	1.5	23.1	5.9	0.743

The posterior is substantially narrower than the prior at every temperature, indicating strong data dominance. This provides confidence that the qualitative findings — the temperature gradient and the direction of prompt effects — are not artifacts of the specific prior specification.

0.4 Cross-Study Comparison

The central question is whether explicit EU-maximization instructions increase or decrease estimated sensitivity. We compare the two studies using the same 4,000 posterior draws per condition.

Methodological Note on Cross-Study Comparisons

The quantities $P(\alpha_{\text{base}} > \alpha_{\text{EU}})$ reported below are computed by comparing draw $i$ from the base study’s posterior with draw $i$ from the EU-prompt study’s posterior. These are independent posteriors from separate model fits — there is no joint posterior over $(\alpha_{\text{base}}, \alpha_{\text{EU}})$. The draw-by-draw comparison is valid insofar as both MCMC chains have mixed well (which diagnostics confirm), but it does not account for the correlation structure induced by shared stimuli. A joint model over prompt conditions would properly account for this shared design structure and yield sharper inference. The per-temperature $P(\text{base} > \text{EU})$ values should therefore be regarded as approximate measures of directional confidence.

Interpreting Posterior Probabilities

Throughout this section, $P(\text{base} > \text{EU})$ denotes the proportion of posterior draws in which $\alpha_{\text{base}}$ exceeds $\alpha_{\text{EU}}$ — a Bayesian analogue to asking “how probable is it that the true base $\alpha$ exceeds the true EU $\alpha$, given the observed data?” Values near 0.5 indicate no directional evidence; values near 0 or 1 indicate strong evidence for a difference. These are not frequentist p-values and do not share their interpretation.

0.4.1 Per-Temperature Comparison

Show code

rows = []
for t in temperatures:
    s = cross['per_temperature'][str(t)]
    rows.append({
        'Temperature': t,
        'Base α (median)': f"{s['base_median']:.1f}",
        'EU α (median)': f"{s['eu_median']:.1f}",
        'Δ (base − EU)': f"{s['diff_median']:+.1f}",
        'P(base > EU)': f"{s['p_base_gt_eu']:.3f}",
    })

pd.DataFrame(rows)

Table 5: Per-temperature comparison of α estimates. P(base > EU) is the posterior probability that the base study’s α exceeds the EU-prompt study’s α at a given temperature.

	Temperature	Base α (median)	EU α (median)	Δ (base − EU)	P(base > EU)
0	0.0	74.1	57.0	+16.9	0.750
1	0.3	54.9	52.8	+2.2	0.544
2	0.7	56.3	54.7	+1.4	0.532
3	1.0	39.1	43.3	-4.3	0.365
4	1.5	36.0	26.3	+9.7	0.848

The EU-prompt lowers estimated $\alpha$ at three of five temperatures (0.0, 0.3, 0.7), raises it slightly at $T = 1.0$, and lowers it substantially at $T = 1.5$. However, the 90% credible intervals on $\alpha_{\text{base}} - \alpha_{\text{EU}}$ contain zero at every temperature—the individual per-temperature differences are not decisive.

0.4.2 Aggregate Test of the Prompt Effect

The individual per-temperature comparisons are suggestive but not individually decisive. To evaluate the pattern-level claim that the EU-prompt lowers $\alpha$, we compute two aggregate quantities from the existing posterior draws:

Mean difference across temperatures: For each draw $i$, compute $\overline{\Delta\alpha}_i = \frac{1}{5}\sum_{t} (\alpha^{(i)}_{\text{base},t} - \alpha^{(i)}_{\text{EU},t})$ and report $P(\overline{\Delta\alpha} > 0)$.
Sign count: For each draw $i$, count the number of temperatures at which $\alpha^{(i)}_{\text{base}} > \alpha^{(i)}_{\text{EU}}$. Under the null hypothesis of no prompt effect, the expected count is 2.5 out of 5.

Show code

# Aggregate test 1: Mean difference across temperatures
n = cross['n_draws']
mean_diffs = np.zeros(n)
for t in temperatures:
    mean_diffs += (base_draws[t][:n] - eu_draws[t][:n])
mean_diffs /= len(temperatures)

p_mean_positive = np.mean(mean_diffs > 0)
print(f"Mean Δα across temperatures:")
print(f"  Median = {np.median(mean_diffs):+.1f}")
print(f"  90% CI = [{np.percentile(mean_diffs, 5):+.1f}, {np.percentile(mean_diffs, 95):+.1f}]")
print(f"  P(mean Δα > 0) = {p_mean_positive:.3f}")
print()

# Aggregate test 2: Sign count per draw
sign_counts = np.zeros(n)
for t in temperatures:
    sign_counts += (base_draws[t][:n] > eu_draws[t][:n]).astype(float)

print(f"Number of temperatures with base > EU (per draw):")
print(f"  Mean = {np.mean(sign_counts):.2f} / 5")
print(f"  P(≥4 of 5 temperatures with base > EU) = {np.mean(sign_counts >= 4):.3f}")
print(f"  P(all 5 temperatures with base > EU) = {np.mean(sign_counts >= 5):.3f}")

Mean Δα across temperatures:
  Median = +5.3
  90% CI = [-8.7, +21.2]
  P(mean Δα > 0) = 0.731

Number of temperatures with base > EU (per draw):
  Mean = 3.04 / 5
  P(≥4 of 5 temperatures with base > EU) = 0.333
  P(all 5 temperatures with base > EU) = 0.065

These aggregate tests provide a formal basis for evaluating the pattern-level claim. If $P(\overline{\Delta\alpha} > 0)$ is substantially above 0.5, the data favor the interpretation that the EU-prompt reduces sensitivity on average, though this inference remains approximate due to the use of independent posteriors (see note above).

0.4.3 Overlaid Posterior Densities

Show code

fig, axes = plt.subplots(5, 1, figsize=(8, 10), sharex=True)

for i, t in enumerate(temperatures):
    ax = axes[i]
    base = base_draws[t]
    eu = eu_draws[t]

    x_lo = min(base.min(), eu.min()) * 0.8
    x_hi = max(base.max(), eu.max()) * 1.1
    x_grid = np.linspace(x_lo, x_hi, 300)

    kde_base = gaussian_kde(base)
    kde_eu = gaussian_kde(eu)

    ax.fill_between(x_grid, kde_base(x_grid), alpha=0.2, color=SEU_COLORS['primary'])
    ax.plot(x_grid, kde_base(x_grid), color=SEU_COLORS['primary'], linewidth=2,
            label=f'Base (median={np.median(base):.0f})')

    ax.fill_between(x_grid, kde_eu(x_grid), alpha=0.2, color=SEU_COLORS['accent'])
    ax.plot(x_grid, kde_eu(x_grid), color=SEU_COLORS['accent'], linewidth=2,
            linestyle='--', label=f'EU-prompt (median={np.median(eu):.0f})')

    s = cross['per_temperature'][str(t)]
    ax.set_title(f'T = {t}    |    P(base > EU) = {s["p_base_gt_eu"]:.3f}', fontsize=11)
    ax.legend(fontsize=9)
    ax.set_ylabel('Density')
    ax.grid(True, alpha=0.2)

axes[-1].set_xlabel('Sensitivity (α)')

plt.tight_layout()
plt.show()

Figure 3: Overlaid posterior densities of α for the base study (solid) and EU-prompt study (dashed) at each temperature. Shading highlights the region where the base study’s posterior exceeds the EU-prompt study’s.

The overlaid densities show substantial overlap at every temperature. The most visible separation occurs at $T = 0.0$ (where the base study’s median is $$17 points higher) and $T = 1.5$ (where the gap is $$10 points). At the middle temperatures, the two posteriors are nearly indistinguishable.

0.4.4 Summary Plot

Show code

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Left: medians with CIs
ax = axes[0]
offset = 0.02
for label, draws, color, shift in [
    ('Base', base_draws, SEU_COLORS['primary'], -offset),
    ('EU-prompt', eu_draws, SEU_COLORS['accent'], offset),
]:
    medians = [np.median(draws[t]) for t in temperatures]
    q05s = [np.percentile(draws[t], 5) for t in temperatures]
    q95s = [np.percentile(draws[t], 95) for t in temperatures]
    x = np.array(temperatures) + shift
    ax.errorbar(x, medians,
                yerr=[np.array(medians) - np.array(q05s),
                      np.array(q95s) - np.array(medians)],
                fmt='o-', color=color, linewidth=2, markersize=7,
                capsize=5, capthick=1.5, label=label)

ax.set_xlabel('Temperature')
ax.set_ylabel('Sensitivity (α)')
ax.set_title('α vs. Temperature: Base vs. EU-Prompt')
ax.set_xticks(temperatures)
ax.legend()
ax.grid(True, alpha=0.3)

# Right: posterior of the difference
ax = axes[1]
n_draws = cross['n_draws']
for i, t in enumerate(temperatures):
    diff = base_draws[t][:n_draws] - eu_draws[t][:n_draws]
    parts = ax.violinplot([diff], positions=[i], showmedians=True, showextrema=False)
    for pc in parts['bodies']:
        pc.set_facecolor(SEU_PALETTE[i])
        pc.set_alpha(0.5)
    parts['cmedians'].set_color(SEU_PALETTE[i])

ax.axhline(0, color='gray', linestyle='--', alpha=0.5, label='No difference')
ax.set_xticks(range(len(temperatures)))
ax.set_xticklabels([f'{t}' for t in temperatures])
ax.set_xlabel('Temperature')
ax.set_ylabel(r'$\alpha_{\mathrm{base}} - \alpha_{\mathrm{EU}}$')
ax.set_title('Posterior of the Prompt Effect (Δα)')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Figure 4: Left: posterior medians with 90% credible intervals for both studies; the EU-prompt estimates (orange) are lower at most temperatures but credible intervals overlap throughout. Right: posterior distribution of α_base − α_EU at each temperature (positive = base yields higher α); distributions are centered near zero with slight positive shifts at T=0.0 and T=1.5.

0.4.5 Slope Comparison

The initial study estimated a global slope of $\Delta\alpha / \Delta T \approx -25$ (90% CI $[-52, -7]$). We now compare the draw-wise slope from both studies:

Show code

sc = cross['slope_comparison']

temp_array = np.array(temperatures)
t_var = np.var(temp_array)

base_slopes = []
eu_slopes = []
for i in range(cross['n_draws']):
    ba = np.array([base_draws[t][i] for t in temperatures])
    ea = np.array([eu_draws[t][i] for t in temperatures])
    base_slopes.append(np.cov(temp_array, ba)[0, 1] / t_var)
    eu_slopes.append(np.cov(temp_array, ea)[0, 1] / t_var)

base_slopes = np.array(base_slopes)
eu_slopes = np.array(eu_slopes)

fig, ax = plt.subplots(figsize=(8, 4))

for slopes, label, color, ls in [
    (base_slopes, 'Base', SEU_COLORS['primary'], '-'),
    (eu_slopes, 'EU-prompt', SEU_COLORS['accent'], '--'),
]:
    kde = gaussian_kde(slopes)
    x_grid = np.linspace(np.percentile(slopes, 0.5), np.percentile(slopes, 99.5), 300)
    ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=color)
    ax.plot(x_grid, kde(x_grid), color=color, linewidth=2, linestyle=ls,
            label=f'{label} (median = {np.median(slopes):.1f})')

ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
ax.set_xlabel('Slope (Δα / ΔT)')
ax.set_ylabel('Density')
ax.set_title('Temperature–Sensitivity Slope: Base vs. EU-Prompt')
ax.legend()
plt.tight_layout()
plt.show()

Figure 5: Posterior distributions of the temperature–sensitivity slope (Δα/ΔT) for the base study and EU-prompt study. Both slopes are clearly negative; their distributions overlap substantially.

Base slope:  median = -30.8,  90% CI [-65.5, -8.3]
EU slope:    median = -24.7,  90% CI [-48.6, -7.0]
P(EU slope steeper than base): 0.390

Both slopes are clearly negative, confirming that the monotone temperature effect is preserved under the EU-prompt. The EU-prompt slope is slightly shallower (median $\approx -25$ vs $-31$), but the difference is not credible—$P(\text{EU slope steeper}) \approx 0.39$, far from decisive.

0.4.6 Monotonicity

Show code

m = cross['monotonicity']
print(f"P(strict monotonicity):")
print(f"  Base:      {m['base_strict_monotonicity']:.3f}")
print(f"  EU-prompt: {m['eu_strict_monotonicity']:.3f}")

P(strict monotonicity):
  Base:      0.125
  EU-prompt: 0.087

Strict monotonicity probabilities are comparable and modest in both cases, driven by the $T = 0.3 \approx T = 0.7$ plateau that persists across prompt conditions. This mid-range plateau also appears in the risky alternatives study, suggesting it is a structural feature of the temperature–sensitivity relationship rather than a task- or prompt-specific artifact.

0.5 Discussion

0.5.1 Summary of Findings

The EU-prompt study yields three main findings:

The temperature–sensitivity gradient is preserved. Explicit EU-maximization instructions do not disrupt the monotone relationship between temperature and $\alpha$. The global slope, pairwise ordering, and approximate magnitude are reproduced.
The EU-prompt does not increase α, and may decrease it. Contrary to our directional prediction that telling the LLM to maximize expected utility would make its choices more EU-aligned, the point estimates of $\alpha$ are lower under the EU-prompt at four of five temperatures. The aggregate test (§4) yields $P(\overline{\Delta\alpha} > 0) \approx 0.73$ — the data lean toward the EU-prompt reducing sensitivity, but the evidence is modest rather than compelling. At the per-temperature level, the differences remain individually non-decisive. This level of evidence is consistent with either a small genuine effect or sampling variability, and does not warrant strong causal claims about the EU-prompt’s direction of influence.
The evidence is suggestive but not conclusive. The 90% credible intervals on $\alpha_{\text{base}} - \alpha_{\text{EU}}$ contain zero at every temperature, and no single $P(\text{base} > \text{EU})$ exceeds 0.85. The aggregate test provides a pattern-level assessment, but the approximate nature of the cross-study comparison (independent posteriors) means that even an apparently strong aggregate result should be interpreted with caution until confirmed by a joint hierarchical model.

Show code

fig, ax = plt.subplots(figsize=(6, 1.5))

probs = [cross['per_temperature'][str(t)]['p_base_gt_eu'] for t in temperatures]
probs_arr = np.array(probs).reshape(1, -1)

im = ax.imshow(probs_arr, cmap='RdBu', vmin=0.0, vmax=1.0, aspect='auto')
ax.set_xticks(range(len(temperatures)))
ax.set_xticklabels([f'{t}' for t in temperatures])
ax.set_yticks([])
ax.set_xlabel('Temperature')

for j, p in enumerate(probs):
    color = 'white' if p > 0.7 or p < 0.3 else 'black'
    ax.text(j, 0, f'{p:.2f}', ha='center', va='center', fontsize=11, fontweight='bold',
            color=color)

plt.colorbar(im, ax=ax, shrink=0.8, label='P(α_base > α_EU)')
ax.set_title('P(base > EU) by Temperature')
plt.tight_layout()
plt.show()

Figure 6: Summary heatmap of P(α_base > α_EU) at each temperature. Values above 0.5 indicate the base study yields higher sensitivity. The pattern favors base > EU at 4/5 temperatures, with T=1.0 as the exception, though no individual comparison is decisive. Scale spans the full [0, 1] range.

0.5.2 Why Might Explicit EU Instructions Reduce Sensitivity?

The finding that explicit EU-maximization instructions decrease rather than increase estimated $\alpha$ admits several interpretations. The following interpretations are speculative, conditional on the directional pattern being genuine — a proposition that the current data support at a suggestive but not decisive level. We consider four possibilities.

0. No robust effect. The most parsimonious interpretation is that there is no genuine EU-prompt effect: the observed 4-out-of-5 directional pattern is sampling variability. Under the null hypothesis, observing $\geq 4$ same-sign differences among 5 comparisons has probability $$0.19 by a binomial sign test — suggestive but not conventionally significant. The aggregate test yields $P(\overline{\Delta\alpha} > 0) \approx 0.73$, which is consistent with this interpretation: the data modestly favor a directional effect but do not rule out noise. Distinguishing a small true effect from noise will likely require either the joint hierarchical model described in Next Steps or replication in additional study conditions. Given the modest aggregate evidence, this null interpretation should be weighed seriously.

1. Prompt interference with implicit competence. The base study’s high $\alpha$ values (median $\approx 74$ at $T = 0.0$) suggest that GPT-4o already approximates EU-maximizing behavior without being told to do so. Adding explicit EU instructions may interfere with well-functioning implicit processes by forcing the model into a more deliberate reasoning mode. This is analogous to phenomena well-documented in human cognition, including verbal overshadowing (Schooler and Engstler-Schooler 1990) — where verbalizing a perceptual judgment degrades subsequent recognition — and the broader finding that explicit rule-following can impair automatic, well-practiced skills (Nisbett and Wilson 1977). While these are human-cognition results, the structural parallel is apt: asking the model to articulate a decision process may disrupt an effective implicit one.

2. Imperfect explicit calculation. The EU-prompt asks the model to “consider the subjective probability you assign to each outcome and your utility for that outcome” and “choose the claim for which the probability-weighted utility across the three outcomes is highest.” This invites the model to perform an explicit numerical computation — assigning probabilities, utilities, multiplying, and summing. If the model performs this computation imprecisely (e.g., miscalibrated probabilities, crude utility estimates), the resulting choices could be less EU-aligned than those produced by the model’s default holistic assessment.

3. Frame narrowing. The EU-prompt constrains the model’s attention to a specific decision-theoretic framework. The base prompt, by contrast, leaves the decision criterion open — the model is free to draw on whatever internal reasoning best discriminates among claims. The EU-prompt may cause the model to attend less to features of the claims that are relevant to the choice (and captured in the embedding-derived $w$ vectors) and more to the abstract probability-utility calculus.

These interpretations are not mutually exclusive. Interpretations 1–3 all point to a common theme: the gap between competence and articulable reasoning. The model appears to “know” how to choose well but cannot reliably improve its choices when asked to explain or formalize its decision process.

0.5.3 Implications for the Sensitivity Framework

The finding has a natural interpretation within the SEU sensitivity framework developed in the foundational reports:

The sensitivity parameter $\alpha$ captures how sharply choices track expected utility differences. A high $\alpha$ does not require the agent to perform explicit EU calculations—it only requires that the agent’s choices are as if it were maximizing EU with some noise.
The EU-prompt intervention targets the articulable commitment to EU maximization without necessarily improving the performance. In the commitment–performance distinction of Report 1, the prompt modifies the commitment framing but may degrade performance by disrupting well-functioning implicit processes.
This result is consistent with the view that $\alpha$ measures a behavioral property of the agent—the degree to which its choices covary with EU rankings—rather than a cognitive property (whether the agent explicitly represents probabilities and utilities). An agent can have high $\alpha$ without any explicit EU reasoning, and explicit EU reasoning can coexist with lower $\alpha$.

0.5.4 The T = 1.0 Exception

At $T = 1.0$, the EU-prompt yields slightly higher $\alpha$ than the base study (median 43.3 vs 39.1, $P(\text{base} > \text{EU}) = 0.37$). While not statistically decisive, this reversal weakens confidence in the pattern-level claim and is consistent with the “no robust effect” interpretation (§5.2, item 0). If the directional pattern is genuine, one speculative possibility is that the EU-prompt provides anchoring benefit at temperatures where implicit heuristics are already substantially disrupted — but the current data are insufficient to test this conjecture.

0.5.5 Limitations

This study has several limitations that constrain the strength of its conclusions:

Independent posteriors. As noted in §4, the base and EU-prompt $\alpha$ values are estimated from independent model fits. The aggregate and per-temperature comparisons are approximate; a joint model that shares structure across prompt conditions would provide sharper inference and properly account for the shared design structure.
Single model, single task. All conclusions are conditional on the m_01 model, the insurance triage task, and GPT-4o. The EU-prompt effect may differ with other model specifications, task domains, or LLM architectures.
Prompt specificity. The EU-prompt uses one particular formulation of the EU-maximization instruction. Alternative phrasings—more or less detailed, referencing specific numerical formats, or using chain-of-thought scaffolding—could produce different results.
No assessment variation. By design, assessments are reused from the base study. The EU-prompt modifies only the choice prompt. If the instruction to maximize EU also affected how the model assesses claims (which it might, via priming effects on the assessment prompt), then the current design underestimates the full impact of the EU-maximization framing.

0.6 Reproducibility

0.6.1 Data Snapshot

All results are loaded from a frozen data snapshot in the data/ subdirectory. This snapshot is version-controlled and immune to future pipeline re-runs.

File	Description
`alpha_draws_T*.npz`	Posterior draws of α (4,000 per condition)
`ppc_T*.json`	Posterior predictive check results
`diagnostics_T*.txt`	CmdStan diagnostic output
`stan_data_T*.json`	Stan-ready data (for refitting)
`fit_summary.json`	Summary statistics across conditions
`cross_study_analysis.json`	Pre-computed comparison with the base study
`run_summary.json`	Pipeline metadata and configuration
`study_config.yaml`	Frozen copy of the study configuration
`prompts.yaml`	Frozen copy of the EU-prompt templates

0.6.2 Running the Study

# Validate configuration and base study data availability
python -m applications.temperature_study_with_eu_prompt validate

# Estimate API costs (choices only)
python -m applications.temperature_study_with_eu_prompt estimate-cost

# Run choice collection (skip model fitting for now)
python -m applications.temperature_study_with_eu_prompt run --skip-fitting

# Fit models on collected data
python -m applications.temperature_study_with_eu_prompt fit

# Freeze data snapshot for this report
python scripts/freeze_eu_prompt_report_data.py

0.7 Next Steps

Joint hierarchical model. A model that includes prompt condition as a covariate would provide direct inference on the EU-prompt effect and its interaction with temperature, without relying on independent posterior comparisons.
Alternative EU-prompt formulations. Testing whether the sensitivity-reducing effect is robust to different wordings of the EU-maximization instruction, including chain-of-thought variants that ask the model to show its probability and utility estimates before choosing.
Assessment-level intervention. Extending the EU-prompt to the assessment phase (where the model evaluates individual claims) would test whether the EU framing affects belief formation, not only the decision rule.
Cross-architecture replication. Repeating with a different LLM (e.g., Claude) would test whether the prompt interference effect is specific to GPT-4o’s architecture or reflects a more general phenomenon.

0.8 References

Nisbett, Richard E., and Timothy D. Wilson. 1977. “Telling More Than We Can Know: Verbal Reports on Mental Processes.” Psychological Review 84 (3): 231–59.

Schooler, Jonathan W., and Tonya Y. Engstler-Schooler. 1990. “Verbal Overshadowing of Visual Memories: Some Things Are Better Left Unsaid.” Cognitive Psychology 22 (1): 36–71.

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:

@online{helzner2026,
  author = {Helzner, Jeff},
  title = {Temperature and {SEU} {Sensitivity:} {EU-Prompt} {Variation}},
  date = {2026-05-12},
  url = {https://jeffhelzner.github.io/seu-sensitivity/applications/temperature_study_with_eu_prompt/01_eu_prompt_study.html},
  langid = {en}
}

For attribution, please cite this work as:

Helzner, Jeff. 2026. “Temperature and SEU Sensitivity: EU-Prompt Variation.” SEU Sensitivity Project, May 12. https://jeffhelzner.github.io/seu-sensitivity/applications/temperature_study_with_eu_prompt/01_eu_prompt_study.html.

--- title: "Temperature and SEU Sensitivity: EU-Prompt Variation" subtitle: "Application Report: Temperature Study 3" description: | Tests whether explicitly instructing the LLM to maximize expected utility increases estimated sensitivity (α) compared to the base temperature study. Uses the same problems, assessments, and embeddings — only the choice prompt is modified to include an explicit EU-maximization instruction. The EU-prompt does not increase α; point estimates are lower at most temperatures, though the evidence is suggestive rather than conclusive. categories: [applications, temperature, m_01, eu-prompt] execute: cache: true --- ```{python} #| label: setup #| include: false import sys import os reports_root = os.path.normpath(os.path.join(os.getcwd(), '..', '..')) project_root = os.path.dirname(reports_root) sys.path.insert(0, reports_root) sys.path.insert(0, project_root) import numpy as np import json import yaml import warnings warnings.filterwarnings('ignore') from pathlib import Path # Data directory for this report (frozen snapshots) data_dir = Path("data") ``` ## Introduction The [initial temperature study](../temperature_study/01_initial_study.qmd) established that higher LLM sampling temperature systematically reduces estimated sensitivity (α) to subjective expected utility maximization. The global slope was Δα/ΔT ≈ −25, with strong pairwise separation at the extremes (T=0.0 vs T≥1.0) and near-equality at moderate levels (T=0.3 ≈ T=0.7). This study asks a complementary question: **does explicitly telling the LLM to maximize expected utility increase α?** Our directional prediction was that the EU-prompt would *increase* $\alpha$: if the model is informed of the correct decision criterion, its choices should align more closely with EU maximization. This prediction follows naturally from the assumption that LLM performance improves with more specific instructions — an assumption common in the prompting literature but, as we will see, not obviously correct in this domain. We note that this study is exploratory in the sense that the prediction was not pre-registered; the design was motivated by the question of whether explicit instructions could modulate sensitivity, with the directional expectation serving as the natural baseline hypothesis against which to evaluate the results. The intervention is minimal. The choice prompt is identical to the base study except for one added paragraph: > *Select the claim that maximizes your expected utility with respect to the three > outcomes above. For each claim, consider the subjective probability you assign to > each outcome and your utility for that outcome. Choose the claim for which the > probability-weighted utility across the three outcomes is highest.* Everything else is held constant: - Same 100 problems × 3 presentations - Same assessment texts (reused from the base study, not re-collected) - Same PCA-reduced embeddings - Same model: m_01 with Lognormal(3.0, 0.75) prior on α - Same 5 temperature levels: 0.0, 0.3, 0.7, 1.0, 1.5 This design isolates the prompt intervention to the **decision rule** while holding **belief formation** (assessments) constant — a clean decomposition of the two stages of SEU. ## Experimental Design ### Prompt Modification The base study's choice prompt presents three outcomes and asks the LLM to select a claim. The EU-prompt variant adds an explicit instruction to maximize expected utility with respect to those outcomes — referencing subjective probabilities, utilities, and probability-weighted utility. The assessment prompt is not modified. Assessments are reused directly from the base study, ensuring that the feature representation (w vectors) is identical. ::: {.callout-note collapse="true"} ## Complete Choice Prompts The two choice prompts differ by a single paragraph (shown in **bold** below). The system prompt is identical in both conditions. **System prompt (both conditions):** > You are a claims analyst selecting which insurance claim to forward to experienced fraud investigators for further review. **Base study choice prompt:** > Based on the following analyst assessments, select ONE claim to send to a team of two experienced fraud investigators. > > Your decision will be evaluated based on the investigators' assessments: > > - Best outcome: Both investigators agree your selection warrants investigation > - Middle outcome: One investigator agrees, one does not > - Worst outcome: Neither investigator agrees with your selection > > The assessments are: > > {assessments_list} > > Which claim do you select for investigation? > > Respond with ONLY the claim number ({num_range}). **EU-prompt choice prompt:** > Based on the following analyst assessments, select ONE claim to send to a team of two experienced fraud investigators. > > Your decision will be evaluated based on the investigators' assessments: > > - Best outcome: Both investigators agree your selection warrants investigation > - Middle outcome: One investigator agrees, one does not > - Worst outcome: Neither investigator agrees with your selection > > **Select the claim that maximizes your expected utility with respect to the three outcomes above. For each claim, consider the subjective probability you assign to each outcome and your utility for that outcome. Choose the claim for which the probability-weighted utility across the three outcomes is highest.** > > The assessments are: > > {assessments_list} > > Which claim do you select for investigation? > > Respond with ONLY the claim number ({num_range}). ::: ### Data Reuse | Component | Base Study | EU-Prompt Study | |-----------|-----------|-----------------| | Problems | Generated | **Reused** | | Assessments | Collected | **Reused** | | Embeddings | Computed | **Reused** | | Choices | Collected | **New** (EU prompt) | | Model fitting | m_01 | m_01 (same prior) | This means the EU-prompt study required only 1,500 new API calls (choices), not the full ~3,150 of the base study. ## Results {#sec-results} ### Loading Posterior Draws ```{python} #| label: load-posteriors #| output: false import matplotlib.pyplot as plt from report_utils import set_seu_style, SEU_COLORS, SEU_PALETTE set_seu_style() temperatures = [0.0, 0.3, 0.7, 1.0, 1.5] temp_labels = {t: f"T={t}" for t in temperatures} def temp_key(t): return f"T{str(t).replace('.', '_')}" # EU-prompt alpha draws eu_draws = {} for t in temperatures: data = np.load(data_dir / f"alpha_draws_{temp_key(t)}.npz") eu_draws[t] = data['alpha'] # Base study alpha draws (from sibling report's frozen data) base_data_dir = Path("..") / "temperature_study" / "data" base_draws = {} for t in temperatures: data = np.load(base_data_dir / f"alpha_draws_{temp_key(t)}.npz") base_draws[t] = data['alpha'] # Pre-computed cross-study analysis with open(data_dir / "cross_study_analysis.json") as f: cross = json.load(f) # Fit summaries with open(data_dir / "fit_summary.json") as f: eu_fit = json.load(f) with open(base_data_dir / "fit_summary.json") as f: base_fit = json.load(f) ``` ```{python} #| echo: false for t in temperatures: n = len(eu_draws[t]) print(f" T={t}: {n:,} posterior draws loaded (EU-prompt)") ``` ### MCMC Diagnostics ```{python} #| label: tbl-diagnostics #| tbl-cap: "MCMC diagnostics for all five temperature conditions under the EU-prompt. All fits used 4 chains with 1,000 warmup and 1,000 sampling iterations each (4,000 post-warmup draws total)." import pandas as pd import re diag_rows = [] for t in temperatures: with open(data_dir / f"diagnostics_{temp_key(t)}.txt") as f: diag_text = f.read() if "No divergent transitions" in diag_text or "0 of" in diag_text: n_div = 0 else: match = re.search(r'(\d+) of (\d+)', diag_text) n_div = int(match.group(1)) if match else 0 rhat_ok = "R-hat values satisfactory" in diag_text or "split R-hat values satisfactory" in diag_text ess_ok = "effective sample size satisfactory" in diag_text ebfmi_ok = "E-BFMI satisfactory" in diag_text diag_rows.append({ 'Temperature': t, 'Divergences': f"{n_div}/4000", 'R̂': '✓' if rhat_ok else '✗', 'ESS': '✓' if ess_ok else '✗', 'E-BFMI': '✓' if ebfmi_ok else '✗', }) pd.DataFrame(diag_rows) ``` ### Posterior Summaries ```{python} #| label: tbl-posteriors #| tbl-cap: "Posterior summaries for the sensitivity parameter α at each temperature level under the EU-prompt. Intervals are 90% credible intervals." rows = [] for t in temperatures: s = cross['per_temperature'][str(t)] rows.append({ 'Temperature': t, 'Median': f"{s['eu_median']:.1f}", 'Mean': f"{s['eu_mean']:.1f}", 'SD': f"{eu_fit[str(t)]['alpha_sd']:.1f}", '90% CI': f"[{s['eu_q05']:.1f}, {s['eu_q95']:.1f}]", }) pd.DataFrame(rows) ``` The EU-prompt estimates show the same qualitative pattern as the base study: $\alpha$ is highest at $T = 0.0$ and declines with increasing temperature. The near-equality of $T = 0.3$ and $T = 0.7$ also persists. ### Forest Plot ```{python} #| label: fig-forest-eu #| fig-cap: "Forest plot of posterior α distributions under the EU-prompt. Points show posterior medians; thick bars span the 50% credible interval; thin bars span the 90% credible interval. The temperature gradient is clearly visible: α declines monotonically from T=0.0 to T=1.5, with substantial overlap between T=0.3 and T=0.7." #| fig-height: 5 fig, ax = plt.subplots(figsize=(8, 5)) y_positions = np.arange(len(temperatures))[::-1] for i, t in enumerate(temperatures): draws = eu_draws[t] median = np.median(draws) q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95]) y = y_positions[i] ax.plot([q05, q95], [y, y], color=SEU_PALETTE[i], linewidth=1.5, alpha=0.7) ax.plot([q25, q75], [y, y], color=SEU_PALETTE[i], linewidth=4, alpha=0.9) ax.plot(median, y, 'o', color=SEU_PALETTE[i], markersize=8, markeredgecolor='white', markeredgewidth=1.5, zorder=5) ax.set_yticks(y_positions) ax.set_yticklabels([f'T = {t}' for t in temperatures]) ax.set_xlabel('Sensitivity (α)') ax.set_title('EU-Prompt: Posterior Distributions of α by Temperature') ax.grid(axis='x', alpha=0.3) ax.grid(axis='y', alpha=0) plt.tight_layout() plt.show() ``` ### Posterior Densities ```{python} #| label: fig-density-eu #| fig-cap: "Kernel density estimates of the posterior α distributions under the EU-prompt." #| fig-height: 5 from scipy.stats import gaussian_kde fig, ax = plt.subplots(figsize=(8, 5)) for i, t in enumerate(temperatures): draws = eu_draws[t] kde = gaussian_kde(draws) x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.1, 300) ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=SEU_PALETTE[i]) ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2, label=f'T = {t} (median = {np.median(draws):.0f})') ax.set_xlabel('Sensitivity (α)') ax.set_ylabel('Density') ax.set_title('EU-Prompt: Posterior Density of α') ax.legend(loc='upper right') plt.tight_layout() plt.show() ``` ### Posterior Predictive Checks ```{python} #| label: tbl-ppc #| tbl-cap: "Posterior predictive check p-values for each temperature condition under the EU-prompt. Values near 0.5 indicate good calibration." ppc_rows = [] for t in temperatures: with open(data_dir / f"ppc_{temp_key(t)}.json") as f: ppc = json.load(f) pvals = ppc['p_values'] ppc_rows.append({ 'Temperature': t, 'Log-likelihood': f"{pvals['ll']:.3f}", 'Modal frequency': f"{pvals['modal']:.3f}", 'Mean probability': f"{pvals['prob']:.3f}", }) pd.DataFrame(ppc_rows) ``` All PPC p-values fall within $[0.42, 0.67]$, indicating adequate model fit at every temperature level. The m_01 model describes the EU-prompt choice data as well as it describes the base study data. This has an important implication: the model's softmax choice structure is flexible enough to accommodate whatever the EU-prompt does to choices without requiring a structural change to the decision model. The prompt effect, to the extent it exists, is absorbed entirely by changes in the sensitivity parameter $\alpha$ (and nuisance parameters $\beta$, $\delta$) rather than manifesting as systematic model misspecification. This confirms that comparing $\alpha$ across prompt conditions is a meaningful exercise — the parameter retains its interpretation under both conditions. ### Prior-to-Posterior Contraction To assess whether the qualitative findings are robust to the choice of prior, we examine how much the posterior contracts relative to the Lognormal(3.0, 0.75) prior on $\alpha$. Strong contraction indicates that the data dominate the prior, making the results insensitive to alternative prior specifications. ```{python} #| label: tbl-contraction #| tbl-cap: "Prior-to-posterior contraction for α at each temperature. Contraction is measured as 1 − (posterior SD / prior SD). Values near 1 indicate strong data dominance over the prior." from scipy.stats import lognorm # Lognormal(3.0, 0.75) prior: mean and SD on the natural scale prior_mu, prior_sigma = 3.0, 0.75 prior_mean = np.exp(prior_mu + prior_sigma**2 / 2) prior_var = (np.exp(prior_sigma**2) - 1) * np.exp(2 * prior_mu + prior_sigma**2) prior_sd = np.sqrt(prior_var) contraction_rows = [] for t in temperatures: draws = eu_draws[t] post_sd = np.std(draws) contraction = 1 - post_sd / prior_sd contraction_rows.append({ 'Temperature': t, 'Prior SD': f"{prior_sd:.1f}", 'Posterior SD': f"{post_sd:.1f}", 'Contraction': f"{contraction:.3f}", }) pd.DataFrame(contraction_rows) ``` The posterior is substantially narrower than the prior at every temperature, indicating strong data dominance. This provides confidence that the qualitative findings — the temperature gradient and the direction of prompt effects — are not artifacts of the specific prior specification. ## Cross-Study Comparison {#sec-comparison} The central question is whether explicit EU-maximization instructions increase or decrease estimated sensitivity. We compare the two studies using the same 4,000 posterior draws per condition. ::: {.callout-important} ## Methodological Note on Cross-Study Comparisons The quantities $P(\alpha_{\text{base}} > \alpha_{\text{EU}})$ reported below are computed by comparing draw $i$ from the base study's posterior with draw $i$ from the EU-prompt study's posterior. These are *independent* posteriors from separate model fits — there is no joint posterior over $(\alpha_{\text{base}}, \alpha_{\text{EU}})$. The draw-by-draw comparison is valid insofar as both MCMC chains have mixed well (which diagnostics confirm), but it does not account for the correlation structure induced by shared stimuli. A joint model over prompt conditions would properly account for this shared design structure and yield sharper inference. The per-temperature $P(\text{base} > \text{EU})$ values should therefore be regarded as approximate measures of directional confidence. ::: ::: {.callout-note} ## Interpreting Posterior Probabilities Throughout this section, $P(\text{base} > \text{EU})$ denotes the proportion of posterior draws in which $\alpha_{\text{base}}$ exceeds $\alpha_{\text{EU}}$ — a Bayesian analogue to asking "how probable is it that the true base $\alpha$ exceeds the true EU $\alpha$, given the observed data?" Values near 0.5 indicate no directional evidence; values near 0 or 1 indicate strong evidence for a difference. These are not frequentist p-values and do not share their interpretation. ::: ### Per-Temperature Comparison ```{python} #| label: tbl-cross-study #| tbl-cap: "Per-temperature comparison of α estimates. P(base > EU) is the posterior probability that the base study's α exceeds the EU-prompt study's α at a given temperature." rows = [] for t in temperatures: s = cross['per_temperature'][str(t)] rows.append({ 'Temperature': t, 'Base α (median)': f"{s['base_median']:.1f}", 'EU α (median)': f"{s['eu_median']:.1f}", 'Δ (base − EU)': f"{s['diff_median']:+.1f}", 'P(base > EU)': f"{s['p_base_gt_eu']:.3f}", }) pd.DataFrame(rows) ``` The EU-prompt *lowers* estimated $\alpha$ at three of five temperatures (0.0, 0.3, 0.7), raises it slightly at $T = 1.0$, and lowers it substantially at $T = 1.5$. However, the 90% credible intervals on $\alpha_{\text{base}} - \alpha_{\text{EU}}$ contain zero at every temperature—the individual per-temperature differences are not decisive. ### Aggregate Test of the Prompt Effect The individual per-temperature comparisons are suggestive but not individually decisive. To evaluate the *pattern-level* claim that the EU-prompt lowers $\alpha$, we compute two aggregate quantities from the existing posterior draws: 1. **Mean difference across temperatures:** For each draw $i$, compute $\overline{\Delta\alpha}_i = \frac{1}{5}\sum_{t} (\alpha^{(i)}_{\text{base},t} - \alpha^{(i)}_{\text{EU},t})$ and report $P(\overline{\Delta\alpha} > 0)$. 2. **Sign count:** For each draw $i$, count the number of temperatures at which $\alpha^{(i)}_{\text{base}} > \alpha^{(i)}_{\text{EU}}$. Under the null hypothesis of no prompt effect, the expected count is 2.5 out of 5. ```{python} #| label: aggregate-test #| echo: true # Aggregate test 1: Mean difference across temperatures n = cross['n_draws'] mean_diffs = np.zeros(n) for t in temperatures: mean_diffs += (base_draws[t][:n] - eu_draws[t][:n]) mean_diffs /= len(temperatures) p_mean_positive = np.mean(mean_diffs > 0) print(f"Mean Δα across temperatures:") print(f" Median = {np.median(mean_diffs):+.1f}") print(f" 90% CI = [{np.percentile(mean_diffs, 5):+.1f}, {np.percentile(mean_diffs, 95):+.1f}]") print(f" P(mean Δα > 0) = {p_mean_positive:.3f}") print() # Aggregate test 2: Sign count per draw sign_counts = np.zeros(n) for t in temperatures: sign_counts += (base_draws[t][:n] > eu_draws[t][:n]).astype(float) print(f"Number of temperatures with base > EU (per draw):") print(f" Mean = {np.mean(sign_counts):.2f} / 5") print(f" P(≥4 of 5 temperatures with base > EU) = {np.mean(sign_counts >= 4):.3f}") print(f" P(all 5 temperatures with base > EU) = {np.mean(sign_counts >= 5):.3f}") ``` These aggregate tests provide a formal basis for evaluating the pattern-level claim. If $P(\overline{\Delta\alpha} > 0)$ is substantially above 0.5, the data favor the interpretation that the EU-prompt reduces sensitivity on average, though this inference remains approximate due to the use of independent posteriors (see note above). ### Overlaid Posterior Densities ```{python} #| label: fig-overlay #| fig-cap: "Overlaid posterior densities of α for the base study (solid) and EU-prompt study (dashed) at each temperature. Shading highlights the region where the base study's posterior exceeds the EU-prompt study's." #| fig-height: 10 fig, axes = plt.subplots(5, 1, figsize=(8, 10), sharex=True) for i, t in enumerate(temperatures): ax = axes[i] base = base_draws[t] eu = eu_draws[t] x_lo = min(base.min(), eu.min()) * 0.8 x_hi = max(base.max(), eu.max()) * 1.1 x_grid = np.linspace(x_lo, x_hi, 300) kde_base = gaussian_kde(base) kde_eu = gaussian_kde(eu) ax.fill_between(x_grid, kde_base(x_grid), alpha=0.2, color=SEU_COLORS['primary']) ax.plot(x_grid, kde_base(x_grid), color=SEU_COLORS['primary'], linewidth=2, label=f'Base (median={np.median(base):.0f})') ax.fill_between(x_grid, kde_eu(x_grid), alpha=0.2, color=SEU_COLORS['accent']) ax.plot(x_grid, kde_eu(x_grid), color=SEU_COLORS['accent'], linewidth=2, linestyle='--', label=f'EU-prompt (median={np.median(eu):.0f})') s = cross['per_temperature'][str(t)] ax.set_title(f'T = {t} | P(base > EU) = {s["p_base_gt_eu"]:.3f}', fontsize=11) ax.legend(fontsize=9) ax.set_ylabel('Density') ax.grid(True, alpha=0.2) axes[-1].set_xlabel('Sensitivity (α)') plt.tight_layout() plt.show() ``` The overlaid densities show substantial overlap at every temperature. The most visible separation occurs at $T = 0.0$ (where the base study's median is $\sim$17 points higher) and $T = 1.5$ (where the gap is $\sim$10 points). At the middle temperatures, the two posteriors are nearly indistinguishable. ### Summary Plot ```{python} #| label: fig-comparison-summary #| fig-cap: "Left: posterior medians with 90% credible intervals for both studies; the EU-prompt estimates (orange) are lower at most temperatures but credible intervals overlap throughout. Right: posterior distribution of α_base − α_EU at each temperature (positive = base yields higher α); distributions are centered near zero with slight positive shifts at T=0.0 and T=1.5." #| fig-height: 5 fig, axes = plt.subplots(1, 2, figsize=(13, 5)) # Left: medians with CIs ax = axes[0] offset = 0.02 for label, draws, color, shift in [ ('Base', base_draws, SEU_COLORS['primary'], -offset), ('EU-prompt', eu_draws, SEU_COLORS['accent'], offset), ]: medians = [np.median(draws[t]) for t in temperatures] q05s = [np.percentile(draws[t], 5) for t in temperatures] q95s = [np.percentile(draws[t], 95) for t in temperatures] x = np.array(temperatures) + shift ax.errorbar(x, medians, yerr=[np.array(medians) - np.array(q05s), np.array(q95s) - np.array(medians)], fmt='o-', color=color, linewidth=2, markersize=7, capsize=5, capthick=1.5, label=label) ax.set_xlabel('Temperature') ax.set_ylabel('Sensitivity (α)') ax.set_title('α vs. Temperature: Base vs. EU-Prompt') ax.set_xticks(temperatures) ax.legend() ax.grid(True, alpha=0.3) # Right: posterior of the difference ax = axes[1] n_draws = cross['n_draws'] for i, t in enumerate(temperatures): diff = base_draws[t][:n_draws] - eu_draws[t][:n_draws] parts = ax.violinplot([diff], positions=[i], showmedians=True, showextrema=False) for pc in parts['bodies']: pc.set_facecolor(SEU_PALETTE[i]) pc.set_alpha(0.5) parts['cmedians'].set_color(SEU_PALETTE[i]) ax.axhline(0, color='gray', linestyle='--', alpha=0.5, label='No difference') ax.set_xticks(range(len(temperatures))) ax.set_xticklabels([f'{t}' for t in temperatures]) ax.set_xlabel('Temperature') ax.set_ylabel(r'$\alpha_{\mathrm{base}} - \alpha_{\mathrm{EU}}$') ax.set_title('Posterior of the Prompt Effect (Δα)') ax.legend() ax.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` ### Slope Comparison The [initial study](../temperature_study/01_initial_study.qmd) estimated a global slope of $\Delta\alpha / \Delta T \approx -25$ (90% CI $[-52, -7]$). We now compare the draw-wise slope from both studies: ```{python} #| label: fig-slope-comparison #| fig-cap: "Posterior distributions of the temperature–sensitivity slope (Δα/ΔT) for the base study and EU-prompt study. Both slopes are clearly negative; their distributions overlap substantially." sc = cross['slope_comparison'] temp_array = np.array(temperatures) t_var = np.var(temp_array) base_slopes = [] eu_slopes = [] for i in range(cross['n_draws']): ba = np.array([base_draws[t][i] for t in temperatures]) ea = np.array([eu_draws[t][i] for t in temperatures]) base_slopes.append(np.cov(temp_array, ba)[0, 1] / t_var) eu_slopes.append(np.cov(temp_array, ea)[0, 1] / t_var) base_slopes = np.array(base_slopes) eu_slopes = np.array(eu_slopes) fig, ax = plt.subplots(figsize=(8, 4)) for slopes, label, color, ls in [ (base_slopes, 'Base', SEU_COLORS['primary'], '-'), (eu_slopes, 'EU-prompt', SEU_COLORS['accent'], '--'), ]: kde = gaussian_kde(slopes) x_grid = np.linspace(np.percentile(slopes, 0.5), np.percentile(slopes, 99.5), 300) ax.fill_between(x_grid, kde(x_grid), alpha=0.2, color=color) ax.plot(x_grid, kde(x_grid), color=color, linewidth=2, linestyle=ls, label=f'{label} (median = {np.median(slopes):.1f})') ax.axvline(0, color='gray', linestyle='--', alpha=0.5) ax.set_xlabel('Slope (Δα / ΔT)') ax.set_ylabel('Density') ax.set_title('Temperature–Sensitivity Slope: Base vs. EU-Prompt') ax.legend() plt.tight_layout() plt.show() ``` ```{python} #| echo: false print(f"Base slope: median = {sc['base_slope_median']:.1f}, 90% CI [{sc['base_slope_q05']:.1f}, {sc['base_slope_q95']:.1f}]") print(f"EU slope: median = {sc['eu_slope_median']:.1f}, 90% CI [{sc['eu_slope_q05']:.1f}, {sc['eu_slope_q95']:.1f}]") print(f"P(EU slope steeper than base): {sc['p_eu_slope_steeper']:.3f}") ``` Both slopes are clearly negative, confirming that the monotone temperature effect is preserved under the EU-prompt. The EU-prompt slope is slightly shallower (median $\approx -25$ vs $-31$), but the difference is not credible—$P(\text{EU slope steeper}) \approx 0.39$, far from decisive. ### Monotonicity ```{python} #| echo: true m = cross['monotonicity'] print(f"P(strict monotonicity):") print(f" Base: {m['base_strict_monotonicity']:.3f}") print(f" EU-prompt: {m['eu_strict_monotonicity']:.3f}") ``` Strict monotonicity probabilities are comparable and modest in both cases, driven by the $T = 0.3 \approx T = 0.7$ plateau that persists across prompt conditions. This mid-range plateau also appears in the [risky alternatives study](../temperature_study_with_risky_alts/01_risky_alternatives_study.qmd), suggesting it is a structural feature of the temperature–sensitivity relationship rather than a task- or prompt-specific artifact. ## Discussion {#sec-discussion} ### Summary of Findings The EU-prompt study yields three main findings: 1. **The temperature–sensitivity gradient is preserved.** Explicit EU-maximization instructions do not disrupt the monotone relationship between temperature and $\alpha$. The global slope, pairwise ordering, and approximate magnitude are reproduced. 2. **The EU-prompt does not increase α, and may decrease it.** Contrary to our directional prediction that telling the LLM to maximize expected utility would make its choices more EU-aligned, the point estimates of $\alpha$ are *lower* under the EU-prompt at four of five temperatures. The aggregate test (§4) yields $P(\overline{\Delta\alpha} > 0) \approx 0.73$ — the data lean toward the EU-prompt reducing sensitivity, but the evidence is modest rather than compelling. At the per-temperature level, the differences remain individually non-decisive. This level of evidence is consistent with either a small genuine effect or sampling variability, and does not warrant strong causal claims about the EU-prompt's direction of influence. 3. **The evidence is suggestive but not conclusive.** The 90% credible intervals on $\alpha_{\text{base}} - \alpha_{\text{EU}}$ contain zero at every temperature, and no single $P(\text{base} > \text{EU})$ exceeds 0.85. The aggregate test provides a pattern-level assessment, but the approximate nature of the cross-study comparison (independent posteriors) means that even an apparently strong aggregate result should be interpreted with caution until confirmed by a joint hierarchical model. ```{python} #| label: fig-summary #| fig-cap: "Summary heatmap of P(α_base > α_EU) at each temperature. Values above 0.5 indicate the base study yields higher sensitivity. The pattern favors base > EU at 4/5 temperatures, with T=1.0 as the exception, though no individual comparison is decisive. Scale spans the full [0, 1] range." #| fig-height: 1.5 fig, ax = plt.subplots(figsize=(6, 1.5)) probs = [cross['per_temperature'][str(t)]['p_base_gt_eu'] for t in temperatures] probs_arr = np.array(probs).reshape(1, -1) im = ax.imshow(probs_arr, cmap='RdBu', vmin=0.0, vmax=1.0, aspect='auto') ax.set_xticks(range(len(temperatures))) ax.set_xticklabels([f'{t}' for t in temperatures]) ax.set_yticks([]) ax.set_xlabel('Temperature') for j, p in enumerate(probs): color = 'white' if p > 0.7 or p < 0.3 else 'black' ax.text(j, 0, f'{p:.2f}', ha='center', va='center', fontsize=11, fontweight='bold', color=color) plt.colorbar(im, ax=ax, shrink=0.8, label='P(α_base > α_EU)') ax.set_title('P(base > EU) by Temperature') plt.tight_layout() plt.show() ``` ### Why Might Explicit EU Instructions Reduce Sensitivity? The finding that explicit EU-maximization instructions *decrease* rather than increase estimated $\alpha$ admits several interpretations. The following interpretations are speculative, conditional on the directional pattern being genuine — a proposition that the current data support at a suggestive but not decisive level. We consider four possibilities. **0. No robust effect.** The most parsimonious interpretation is that there is no genuine EU-prompt effect: the observed 4-out-of-5 directional pattern is sampling variability. Under the null hypothesis, observing $\geq 4$ same-sign differences among 5 comparisons has probability $\sim$0.19 by a binomial sign test — suggestive but not conventionally significant. The aggregate test yields $P(\overline{\Delta\alpha} > 0) \approx 0.73$, which is consistent with this interpretation: the data modestly favor a directional effect but do not rule out noise. Distinguishing a small true effect from noise will likely require either the joint hierarchical model described in Next Steps or replication in additional study conditions. Given the modest aggregate evidence, this null interpretation should be weighed seriously. **1. Prompt interference with implicit competence.** The base study's high $\alpha$ values (median $\approx 74$ at $T = 0.0$) suggest that GPT-4o already approximates EU-maximizing behavior without being told to do so. Adding explicit EU instructions may *interfere* with well-functioning implicit processes by forcing the model into a more deliberate reasoning mode. This is analogous to phenomena well-documented in human cognition, including verbal overshadowing [@schooler1990verbal] — where verbalizing a perceptual judgment degrades subsequent recognition — and the broader finding that explicit rule-following can impair automatic, well-practiced skills [@nisbett1977telling]. While these are human-cognition results, the structural parallel is apt: asking the model to articulate a decision process may disrupt an effective implicit one. **2. Imperfect explicit calculation.** The EU-prompt asks the model to "consider the subjective probability you assign to each outcome and your utility for that outcome" and "choose the claim for which the probability-weighted utility across the three outcomes is highest." This invites the model to perform an explicit numerical computation — assigning probabilities, utilities, multiplying, and summing. If the model performs this computation imprecisely (e.g., miscalibrated probabilities, crude utility estimates), the resulting choices could be *less* EU-aligned than those produced by the model's default holistic assessment. **3. Frame narrowing.** The EU-prompt constrains the model's attention to a specific decision-theoretic framework. The base prompt, by contrast, leaves the decision criterion open — the model is free to draw on whatever internal reasoning best discriminates among claims. The EU-prompt may cause the model to attend less to features of the claims that are relevant to the choice (and captured in the embedding-derived $w$ vectors) and more to the abstract probability-utility calculus. These interpretations are not mutually exclusive. Interpretations 1–3 all point to a common theme: **the gap between competence and articulable reasoning.** The model appears to "know" how to choose well but cannot reliably improve its choices when asked to explain or formalize its decision process. ### Implications for the Sensitivity Framework The finding has a natural interpretation within the SEU sensitivity framework developed in the [foundational reports](../../foundations/01_abstract_formulation.qmd): - The sensitivity parameter $\alpha$ captures how sharply choices track expected utility differences. A high $\alpha$ does not require the agent to perform explicit EU calculations—it only requires that the agent's choices are *as if* it were maximizing EU with some noise. - The EU-prompt intervention targets the *articulable commitment* to EU maximization without necessarily improving the *performance*. In the commitment–performance distinction of Report 1, the prompt modifies the commitment framing but may degrade performance by disrupting well-functioning implicit processes. - This result is consistent with the view that $\alpha$ measures a *behavioral* property of the agent—the degree to which its choices covary with EU rankings—rather than a *cognitive* property (whether the agent explicitly represents probabilities and utilities). An agent can have high $\alpha$ without any explicit EU reasoning, and explicit EU reasoning can coexist with lower $\alpha$. ### The T = 1.0 Exception At $T = 1.0$, the EU-prompt yields slightly *higher* $\alpha$ than the base study (median 43.3 vs 39.1, $P(\text{base} > \text{EU}) = 0.37$). While not statistically decisive, this reversal weakens confidence in the pattern-level claim and is consistent with the "no robust effect" interpretation (§5.2, item 0). If the directional pattern is genuine, one speculative possibility is that the EU-prompt provides anchoring benefit at temperatures where implicit heuristics are already substantially disrupted — but the current data are insufficient to test this conjecture. ### Limitations This study has several limitations that constrain the strength of its conclusions: 1. **Independent posteriors.** As noted in §4, the base and EU-prompt $\alpha$ values are estimated from independent model fits. The aggregate and per-temperature comparisons are approximate; a joint model that shares structure across prompt conditions would provide sharper inference and properly account for the shared design structure. 2. **Single model, single task.** All conclusions are conditional on the m_01 model, the insurance triage task, and GPT-4o. The EU-prompt effect may differ with other model specifications, task domains, or LLM architectures. 3. **Prompt specificity.** The EU-prompt uses one particular formulation of the EU-maximization instruction. Alternative phrasings—more or less detailed, referencing specific numerical formats, or using chain-of-thought scaffolding—could produce different results. 4. **No assessment variation.** By design, assessments are reused from the base study. The EU-prompt modifies only the choice prompt. If the instruction to maximize EU also affected how the model *assesses* claims (which it might, via priming effects on the assessment prompt), then the current design underestimates the full impact of the EU-maximization framing. ## Reproducibility {#sec-reproducibility} ### Data Snapshot All results are loaded from a frozen data snapshot in the `data/` subdirectory. This snapshot is version-controlled and immune to future pipeline re-runs. | File | Description | |------|-------------| | `alpha_draws_T*.npz` | Posterior draws of α (4,000 per condition) | | `ppc_T*.json` | Posterior predictive check results | | `diagnostics_T*.txt` | CmdStan diagnostic output | | `stan_data_T*.json` | Stan-ready data (for refitting) | | `fit_summary.json` | Summary statistics across conditions | | `cross_study_analysis.json` | Pre-computed comparison with the base study | | `run_summary.json` | Pipeline metadata and configuration | | `study_config.yaml` | Frozen copy of the study configuration | | `prompts.yaml` | Frozen copy of the EU-prompt templates | ### Running the Study ```bash # Validate configuration and base study data availability python -m applications.temperature_study_with_eu_prompt validate # Estimate API costs (choices only) python -m applications.temperature_study_with_eu_prompt estimate-cost # Run choice collection (skip model fitting for now) python -m applications.temperature_study_with_eu_prompt run --skip-fitting # Fit models on collected data python -m applications.temperature_study_with_eu_prompt fit # Freeze data snapshot for this report python scripts/freeze_eu_prompt_report_data.py ``` ## Next Steps 1. **Joint hierarchical model.** A model that includes prompt condition as a covariate would provide direct inference on the EU-prompt effect and its interaction with temperature, without relying on independent posterior comparisons. 2. **Alternative EU-prompt formulations.** Testing whether the sensitivity-reducing effect is robust to different wordings of the EU-maximization instruction, including chain-of-thought variants that ask the model to show its probability and utility estimates before choosing. 3. **Assessment-level intervention.** Extending the EU-prompt to the assessment phase (where the model evaluates individual claims) would test whether the EU framing affects belief formation, not only the decision rule. 4. **Cross-architecture replication.** Repeating with a different LLM (e.g., Claude) would test whether the prompt interference effect is specific to GPT-4o's architecture or reflects a more general phenomenon. ## References ::: {#refs} :::