Temperature and SEU Sensitivity: Risky Alternatives Extension

Jeff Helzner

Temperature and SEU Sensitivity: Risky Alternatives Extension

Application Report: Temperature Study 2

applications

temperature

m_11

m_21

m_31

risky

Extends the initial temperature study by introducing risky alternatives alongside the original uncertain alternatives, enabling joint estimation under models m_11, m_21, and m_31.

Author

Jeff Helzner

Published

May 12, 2026

0.1 Introduction

Report 1 established that the estimated sensitivity parameter $\alpha$ of GPT-4o decreases with sampling temperature, using the m_01 model — a single-context softmax choice model fit to uncertain alternatives whose probabilities are inferred from embedded text features. That study demonstrated a clear negative relationship between temperature and EU sensitivity, with a posterior probability exceeding 0.99 that the global slope is negative.

This report extends that analysis in two ways:

Adding risky alternatives. We collect a new set of choice data in which the LLM chooses among alternatives whose outcome probabilities are stated explicitly (risky alternatives), rather than inferred from features (uncertain alternatives). This creates a paired dataset: each temperature condition now has $M = 300$ uncertain decisions and $N = 300$ risky decisions ($N = 299$ at $T = 1.5$ due to one unparseable response).
A family of augmented models. We fit three models that jointly estimate sensitivity from both decision contexts:
- m_11 — shared sensitivity $\alpha$ across uncertain and risky choices
- m_21 — separate sensitivities $\alpha$ (uncertain) and $\omega$ (risky)
- m_31 — proportional sensitivities with $\omega = \kappa \cdot \alpha$

The central questions are: (i) Does the temperature–sensitivity relationship replicate when risky alternatives are included? (ii) Does sensitivity differ between uncertain and risky contexts? (iii) If so, is the difference proportional across temperatures?

What This Report Covers

This report presents data collection, prior calibration, model fitting, posterior predictive checks, and monotonicity analysis for the augmented models. It builds directly on the uncertain-choice data and m_01 results from Report 1.

Construct Validity: Why Adding Risky Alternatives Matters

Report 1 flagged a fundamental interpretive limit of the m_0 / m_01 family: because the design uses only uncertain (assessment-elicited) choices, the feature-to-probability weights $\beta$ and the utility increments $\delta$ are not separately identified. Operationally, $\alpha$ in m_01 measures the consistency with which choices align with the model-implied utility ranking — not necessarily with the agent’s “true” subjective expected utility. In the three-layer construct- validity scheme of the initial-study report, m_01 supports comparative claims across conditions sharing a stimulus pool (layer 2), but not absolute, agent-level rationality claims (layer 3).

Risky alternatives are the principled relaxation of that limitation. By presenting probabilities explicitly — as a $K$-simplex over consequences — risky choices give $\delta$ direct identifying information that uncertain choices alone cannot provide: $\eta^{(r)} = x^\top \upsilon$ depends on $\delta$ through $\upsilon = \mathrm{cumsum}([0,\delta])$ but no longer through $\beta$, so risky-choice likelihood contributions identify $\delta$ separately from the uncertain-choice $\beta$. This is exactly the missing identifying information called for in §Construct Validity of Report 1 and motivates the m_1 / m_2 / m_3 ladder that this report instantiates as m_11 / m_21 / m_31.

The three augmented models map onto the construct-validity layers as follows:

m_11 (shared $\alpha$). Stays at layer (2) but with tighter posteriors: the risky data adds a second source of evidence about the same sensitivity parameter, sharpening within-condition precision without introducing new claim types.
m_21 (separate $\alpha$ and $\omega$). Opens a new layer-(2) contrast — between contexts (uncertain vs risky) — that the m_0 / m_01 family cannot estimate at all.
m_31 ($\omega = \kappa\alpha$). The proportionality parameter $\kappa$ is the cleanest summary the project produces of between- context sensitivity differences. As a ratio, $\kappa$ is robust to the absolute scaling of the model-implied utility: whatever residual identifiability concern attaches to the absolute level of $\alpha$ or $\omega$ cancels in the ratio. This is the closest the m_1 / m_2 / m_3 family comes to a layer-(3)–adjacent quantity.

This setup motivates the three central questions of this report listed above. It also explains the methodological status of this report in the broader applications programme: it is the empirical proof-of-concept for the m_1 / m_2 / m_3 follow-up sequencing flagged in §0.5 of prompts/hierarchical_alignment_study_plan.md — that is, what an alignment-style follow-up would look like once the construct-validity caveats of m_01 / h_m01 motivate the move to designs with risky alternatives.

0.2 Experimental Design

0.2.1 Risky Alternatives

The uncertain alternatives from the initial study use the same insurance claims triage task: embedded natural-language assessments produce features $w_r \in \mathbb{R}^D$, and the model infers subjective probabilities $\psi_r = \text{softmax}(\beta \cdot w_r)$ over $K = 3$ consequences. The risky alternatives replace this inference step with explicitly stated probability simplexes:

Risky alternatives pool: S = 30
Each alternative specifies a simplex over K = 3 consequences.

	ID	p(neither)	p(one)	p(both)
0	R01	0.90	0.05	0.05
1	R02	0.05	0.90	0.05
2	R03	0.05	0.05	0.90
3	R04	0.80	0.10	0.10
4	R05	0.10	0.80	0.10
5	R06	0.10	0.10	0.80

The $S = 30$ risky alternatives span a range of probability profiles: corner alternatives concentrating mass on a single consequence (e.g., $[0.90, 0.05, 0.05]$), balanced alternatives (e.g., $[1/3, 1/3, 1/3]$), and intermediate cases. Each risky decision problem draws 2–4 alternatives uniformly at random (without replacement) from the pool of 30, with a fresh draw for each of the 100 base problems. The same position-counterbalancing design is used as in the uncertain case: 100 base problems $\times$ 3 presentations with shuffled orderings. All draws use the study-level random seed recorded in the frozen data snapshot, ensuring exact reproducibility.

Terminology: “Uncertain” vs. “Ambiguous”

Throughout this report, we use “uncertain” to describe the decision context in which probabilities must be inferred from natural-language features, and “risky” for the context in which probabilities are stated explicitly. In the JDM literature, the former is closer to what is typically called “ambiguity” (unknown or imprecise probabilities) and the latter to “risk” (known probabilities). We retain “uncertain” for consistency with Report 1 and the model notation, but the connection to the classic risk–ambiguity distinction (Ellsberg, 1961) is substantive and is discussed in Section 0.10.

0.2.2 Design Parameters

Show code

with open(data_dir / "study_config.yaml") as f:
    config = yaml.safe_load(f)

with open(data_dir / "run_summary.json") as f:
    run_summary = json.load(f)

t0_info = run_summary['phases']['phase3_data_prep']['per_temperature']['0.0']
print(f"Study Design:")
print(f"  Uncertain problems (M):  {t0_info['M']}  (100 base × 3 presentations)")
print(f"  Risky problems (N):      {t0_info['N']}  (100 base × 3 presentations)")
print(f"  Uncertain alternatives:  R = {t0_info['R']}")
print(f"  Risky alternatives:      S = {t0_info['S']}")
print(f"  Alternatives per problem: {config['min_alternatives']}–{config['max_alternatives']}")
print(f"  Consequences (K):        {config['K']}")
print(f"  Embedding dimensions (D): {t0_info['D']}")
print(f"  LLM model:               {config['llm_model']}")
print(f"  Temperature conditions:   {config['temperatures']}")

Study Design:
  Uncertain problems (M):  300  (100 base × 3 presentations)
  Risky problems (N):      300  (100 base × 3 presentations)
  Uncertain alternatives:  R = 30
  Risky alternatives:      S = 30
  Alternatives per problem: 2–4
  Consequences (K):        3
  Embedding dimensions (D): 32
  LLM model:               gpt-4o
  Temperature conditions:   [0.0, 0.3, 0.7, 1.0, 1.5]

Table 1: Comparison of uncertain and risky decision contexts.

		Observations	Alternatives	Per problem	Probabilities	Sensitivity
0	Uncertain (m_01 data)	M = 300 per temp	R = 30 distinct	2–4 alternatives	Inferred via β·w → softmax	α (all models)
1	Risky (new data)	N = 300 per temp (299 at T=1.5)	S = 30 distinct	2–4 alternatives	Stated explicitly (simplexes)	α (m_11) / ω (m_21, m_31)

0.2.3 Data Quality

Risky Choice NA Summary:
  Overall: 1 / 1500 (0.07%)
    T=0.0: 0 / 300 (0.00%)
    T=0.3: 0 / 300 (0.00%)
    T=0.7: 0 / 300 (0.00%)
    T=1.0: 0 / 300 (0.00%)
    T=1.5: 1 / 300 (0.33%)

Data quality is excellent: only 1 unparseable response out of 1,500 risky choices (at $T = 1.5$), matching the near-perfect parsing observed in the initial uncertain study.

Sample Size

The sample sizes ($M = 300$ uncertain, $N = 300$ risky per temperature) were chosen to match the initial study’s design while remaining computationally tractable for the augmented models. No formal power analysis was conducted to determine the sample size needed for precise estimation of $\kappa$ or for discriminating between m_11 and m_31. As will become apparent (Section 0.5), the credible intervals on $\kappa$ are wide enough that a larger study would improve precision. The sample size adequacy for the primary finding—replication of the temperature–$\alpha$ relationship—is supported by the strong posterior separation observed.

0.3 Model Family

All three augmented models share: (1) the same utility function $\upsilon = \text{cumulative\_sum}([0, \delta])$ with $\delta \sim \text{Dirichlet}(\mathbf{1})$, (2) subjective probabilities $\psi_r = \text{softmax}(\beta \cdot w_r)$ for uncertain alternatives with $\beta \sim \mathcal{N}(0, 1)$, and (3) the calibrated prior $\alpha \sim \text{Lognormal}(3.0, 0.75)$ from Report 1. They differ only in how sensitivity governs risky choices.

0.3.1 Model Specifications

Table 2: Parameter specifications for the three augmented models. All models share the same utility and subjective probability structure.

	Model	Uncertain sensitivity	Risky sensitivity	Free parameters	Interpretation
0	m_11	α ~ LN(3.0, 0.75)	α (shared with uncertain)	α, β, δ	Single sensitivity governs both contexts
1	m_21	α ~ LN(3.0, 0.75)	ω ~ LN(3.0, 0.75)	α, ω, β, δ	Contexts have independent sensitivities
2	m_31	α ~ LN(3.0, 0.75)	ω = κ·α, κ ~ LN(0, 0.5)	α, κ, β, δ	Sensitivities are proportionally linked

The key structural differences:

m_11 forces the same $\alpha$ to explain both uncertain and risky choices. The risky data provides additional constraint, yielding tighter posteriors.
m_21 gives each context its own sensitivity parameter. If $\omega \neq \alpha$, the LLM processes explicit probabilities differently from inferred ones.
m_31 nests between the other two: when $\kappa = 1$, it reduces to m_11; when $\kappa$ deviates from 1, it captures a multiplicative scaling of sensitivity in the risky context. The prior $\kappa \sim \text{Lognormal}(0, 0.5)$ centers at 1 with 90% CI $\approx [0.44, 2.28]$.

0.3.2 Relationship to m_01

The m_01 model from Report 1 fits only uncertain choices: $y_m \sim \text{Categorical}(\text{softmax}(\alpha \cdot \eta^{(u)}_m))$. The augmented models extend this by adding a second likelihood for risky choices: $z_n \sim \text{Categorical}(\text{softmax}(\alpha_{\text{risky}} \cdot \eta^{(r)}_n))$, where $\eta^{(r)}_n = x_s' \upsilon$ uses the stated probability simplexes $x_s$ rather than inferred subjective probabilities. The utility function $\upsilon$ and $\beta$ parameters are estimated jointly from both data sources.

0.4 Prior Predictive Calibration

Prior predictive simulation was performed using the _sim.stan variants of each model on the actual augmented study design. For each candidate prior, we drew parameter values and simulated choices, computing the SEU-maximizer selection rate separately for uncertain and risky contexts.

Show code

with open(data_dir / "prior_predictive" / "m_1_grid_results.json") as f:
    m1_grid = json.load(f)

results = m1_grid['results']

fig, ax = plt.subplots(figsize=(10, 5))

labels = [r['prior_label'] for r in results]
y_pos = np.arange(len(labels))

unc_means = [r['seu_rate_uncertain_mean'] for r in results]
risky_means = [r['seu_rate_risky_mean'] for r in results]

bar_height = 0.35
bars_unc = ax.barh(y_pos + bar_height/2, unc_means, bar_height, 
                    color=SEU_COLORS['primary'], alpha=0.8, label='Uncertain')
bars_risky = ax.barh(y_pos - bar_height/2, risky_means, bar_height,
                      color=SEU_COLORS['accent'], alpha=0.8, label='Risky')

# Highlight selected prior
selected_idx = [i for i, l in enumerate(labels) if 'lognormal(3.0, 0.75)' in l][0]
bars_unc[selected_idx].set_edgecolor('black')
bars_unc[selected_idx].set_linewidth(2)
bars_risky[selected_idx].set_edgecolor('black')
bars_risky[selected_idx].set_linewidth(2)

ax.set_yticks(y_pos)
ax.set_yticklabels(labels, fontsize=9)
ax.set_xlabel('SEU-Maximizer Selection Rate')
ax.set_title('m_1: Prior-Implied SEU-Max Rate by Context')
ax.legend(fontsize=10)
ax.set_xlim(0, 1)

plt.tight_layout()
plt.show()

Figure 1: Prior predictive SEU-maximizer selection rates for m_1 across candidate α priors. Risky alternatives yield consistently higher SEU-max rates than uncertain alternatives at the same prior, reflecting the easier decision structure when probabilities are known. The selected prior lognormal(3.0, 0.75) produces combined rates near 0.82.

The prior predictive analysis reveals a consistent pattern across all three models: risky alternatives produce higher SEU-max rates than uncertain alternatives at the same sensitivity level. This is expected—when probabilities are stated explicitly, there is no estimation error in $\psi$; the only source of suboptimality is the softmax noise governed by the sensitivity parameter,

For m_21 and m_31, the joint prior space was searched over 2D grids. The selected priors — $\alpha \sim \text{LN}(3.0, 0.75)$ and $\omega \sim \text{LN}(3.0, 0.75)$ for m_21, $\kappa \sim \text{LN}(0.0, 0.5)$ for m_31 — yield prior-implied SEU-max rates of approximately 0.77 (uncertain) and 0.85 (risky), a sensible range for GPT-4o behavior.

0.5 Results

0.5.1 Loading Posterior Draws

Show code

# Load all draws
alpha_draws = {model: {} for model in ['m_11', 'm_21', 'm_31']}
omega_draws = {model: {} for model in ['m_21', 'm_31']}
kappa_draws = {}

for model in ['m_11', 'm_21', 'm_31']:
    for t, tk in zip(temperatures, temp_keys):
        d = np.load(data_dir / f"alpha_draws_{model}_{tk}.npz")
        alpha_draws[model][t] = d['alpha']

for t, tk in zip(temperatures, temp_keys):
    d = np.load(data_dir / f"omega_draws_m_21_{tk}.npz")
    omega_draws['m_21'][t] = d['omega']
    d = np.load(data_dir / f"omega_draws_m_31_{tk}.npz")
    omega_draws['m_31'][t] = d['omega']
    d = np.load(data_dir / f"kappa_draws_m_31_{tk}.npz")
    kappa_draws[t] = d['kappa']

with open(data_dir / "parameter_summary.json") as f:
    param_summary = json.load(f)
with open(data_dir / "ppc_summary.json") as f:
    ppc_summary = json.load(f)
with open(data_dir / "fit_summary.json") as f:
    fit_summary = json.load(f)

# Load m_01 comparison data
with open(data_dir / "m01_fit_summary.json") as f:
    m01_summary = json.load(f)
with open(data_dir / "m01_primary_analysis.json") as f:
    m01_analysis = json.load(f)

  m_11: 4,000 posterior draws per temperature
  m_21: 4,000 posterior draws per temperature
  m_31: 4,000 posterior draws per temperature

0.5.2 MCMC Diagnostics

All 15 fits (3 models × 5 temperatures) achieved clean MCMC diagnostics: no divergences, no treedepth warnings, satisfactory E-BFMI, and $\hat{R} < 1.005$ for all parameters.

Table 3: MCMC diagnostics for key parameters across all models and temperatures. All fits used 4 chains × 1,000 warmup + 1,000 sampling iterations.

	Model	T	α R̂	α ESS	ω R̂	ω ESS	κ R̂	κ ESS
0	m_11	0.0	1.0022	3372
1	m_11	0.3	1.0000	3274
2	m_11	0.7	1.0015	3658
3	m_11	1.0	1.0015	4062
4	m_11	1.5	0.9999	4958
5	m_21	0.0	1.0016	1754	1.0007	7377
6	m_21	0.3	1.0014	1747	1.0011	5605
7	m_21	0.7	1.0020	1898	1.0003	5921
8	m_21	1.0	1.0008	2244	1.0022	9073
9	m_21	1.5	1.0014	1948	1.0057	6711
10	m_31	0.0	1.0013	2211			1.0014	3153
11	m_31	0.3	1.0022	1758			1.0015	2233
12	m_31	0.7	1.0025	1712			1.0020	2199
13	m_31	1.0	1.0009	1778			1.0007	2442
14	m_31	1.5	1.0010	1565			1.0001	2048

0.5.3 Posterior Summaries: m_11 (Shared α)

Table 4: m_11: Posterior summaries for the shared sensitivity parameter α. Intervals are 90% credible intervals.

	Temp	Median	Mean	SD	90% CI
0	0.0	48.1	48.6	6.6	[38.7, 60.4]
1	0.3	46.5	47.1	6.5	[37.4, 58.7]
2	0.7	44.4	44.8	5.8	[36.1, 55.0]
3	1.0	35.1	35.3	4.2	[28.9, 42.5]
4	1.5	33.5	33.6	3.8	[27.6, 40.4]

0.5.4 Posterior Summaries: m_21 (Separate α, ω)

Table 5: m_21: Posterior summaries for α (uncertain) and ω (risky). The separate parametrization reveals systematically lower sensitivity in the risky context.

	Temp	α median	α 90% CI	ω median	ω 90% CI
0	0.0	69.7	[44.6, 116.0]	41.1	[31.7, 53.2]
1	0.3	52.4	[35.1, 81.8]	42.7	[32.8, 56.3]
2	0.7	55.5	[37.4, 85.5]	39.7	[30.8, 51.3]
3	1.0	37.6	[26.9, 54.0]	33.3	[26.5, 42.1]
4	1.5	36.7	[26.6, 52.6]	31.4	[25.1, 39.2]

0.5.5 Posterior Summaries: m_31 (ω = κ·α)

Table 6: m_31: Posterior summaries for α, κ, and the derived ω = κ·α. The proportionality parameter κ clusters below 1.0, confirming reduced risky sensitivity.

	Temp	α median	α 90% CI	κ median	κ 90% CI	ω median	ω 90% CI
0	0.0	62.4	[41.9, 93.6]	0.712	[0.449, 1.090]	44.1	[34.2, 56.9]
1	0.3	51.3	[35.2, 76.0]	0.876	[0.565, 1.326]	45.0	[34.8, 58.3]
2	0.7	52.3	[36.5, 76.3]	0.794	[0.519, 1.205]	41.9	[32.6, 53.6]
3	1.0	36.8	[26.7, 51.3]	0.937	[0.635, 1.356]	34.5	[27.3, 43.4]
4	1.5	36.2	[26.7, 49.6]	0.890	[0.620, 1.270]	32.3	[25.8, 40.1]

0.5.6 Forest Plot: α Across Models

Show code

from scipy.stats import gaussian_kde

fig, axes = plt.subplots(1, 3, figsize=(14, 6), sharey=True)

model_names = ['m_11', 'm_21', 'm_31']
model_titles = ['m₁₁ (shared α)', 'm₂₁ (α for uncertain)', 'm₃₁ (α for uncertain)']

for ax, model, title in zip(axes, model_names, model_titles):
    y_positions = np.arange(len(temperatures))[::-1]
    
    for i, t in enumerate(temperatures):
        draws = alpha_draws[model][t]
        median = np.median(draws)
        q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95])
        
        y = y_positions[i]
        color = SEU_PALETTE[i]
        
        # Thin bar: 90% CI
        ax.plot([q05, q95], [y, y], color=color, linewidth=1.5, solid_capstyle='round')
        # Thick bar: 50% CI
        ax.plot([q25, q75], [y, y], color=color, linewidth=4, solid_capstyle='round')
        # Point: median
        ax.plot(median, y, 'o', color=color, markersize=8, zorder=5)
    
    ax.set_yticks(y_positions)
    ax.set_yticklabels([f'T = {t}' for t in temperatures])
    ax.set_xlabel('α')
    ax.set_title(title)
    ax.grid(axis='x', alpha=0.3)

plt.suptitle('Posterior α by Temperature and Model', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

Figure 2: Forest plot of posterior α distributions across models and temperatures. m_11 (shared α) produces the tightest posteriors. m_21 and m_31 (which allow risky sensitivity to vary) estimate higher α for the uncertain context, reflecting the additional degrees of freedom.

0.5.7 Risky Sensitivity: ω and κ

Show code

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Left: ω comparison
y_positions = np.arange(len(temperatures))[::-1]

for i, t in enumerate(temperatures):
    # m_21 omega
    draws_21 = omega_draws['m_21'][t]
    med_21 = np.median(draws_21)
    q05_21, q25_21, q75_21, q95_21 = np.percentile(draws_21, [5, 25, 75, 95])
    
    y = y_positions[i] + 0.15
    ax1.plot([q05_21, q95_21], [y, y], color=SEU_COLORS['primary'], linewidth=1.5)
    ax1.plot([q25_21, q75_21], [y, y], color=SEU_COLORS['primary'], linewidth=4)
    ax1.plot(med_21, y, 'o', color=SEU_COLORS['primary'], markersize=7)
    
    # m_31 omega (derived)
    draws_31 = omega_draws['m_31'][t]
    med_31 = np.median(draws_31)
    q05_31, q25_31, q75_31, q95_31 = np.percentile(draws_31, [5, 25, 75, 95])
    
    y = y_positions[i] - 0.15
    ax1.plot([q05_31, q95_31], [y, y], color=SEU_COLORS['accent'], linewidth=1.5)
    ax1.plot([q25_31, q75_31], [y, y], color=SEU_COLORS['accent'], linewidth=4)
    ax1.plot(med_31, y, 's', color=SEU_COLORS['accent'], markersize=7)

ax1.set_yticks(y_positions)
ax1.set_yticklabels([f'T = {t}' for t in temperatures])
ax1.set_xlabel('ω (risky sensitivity)')
ax1.set_title('Posterior ω by Temperature')
ax1.legend(['m₂₁ (free ω)', '', '', 'm₃₁ (ω = κ·α)', '', ''], 
           loc='upper right', fontsize=9)
ax1.grid(axis='x', alpha=0.3)

# Right: κ from m_31
for i, t in enumerate(temperatures):
    draws = kappa_draws[t]
    median = np.median(draws)
    q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95])
    
    y = y_positions[i]
    color = SEU_PALETTE[i]
    
    ax2.plot([q05, q95], [y, y], color=color, linewidth=1.5)
    ax2.plot([q25, q75], [y, y], color=color, linewidth=4)
    ax2.plot(median, y, 'o', color=color, markersize=8, zorder=5)

ax2.axvline(x=1.0, color='gray', linestyle='--', alpha=0.5, label='κ = 1 (m₁₁ equiv.)')
ax2.set_yticks(y_positions)
ax2.set_yticklabels([f'T = {t}' for t in temperatures])
ax2.set_xlabel('κ')
ax2.set_title('m₃₁: Posterior κ (ω/α ratio)')
ax2.legend(fontsize=9)
ax2.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

Figure 3: Left: Posterior ω (risky sensitivity) from m_21 and m_31 across temperatures. The two models produce consistent ω estimates, both showing the temperature–sensitivity decline. Right: Posterior κ from m_31. The proportionality parameter clusters below 1.0, indicating that the LLM is systematically less sensitive to EU in the risky context. The 90% CIs include 1.0 at some temperatures.

0.5.8 Posterior Densities

Show code

fig, ax = plt.subplots(figsize=(8, 5))

for i, t in enumerate(temperatures):
    draws = alpha_draws['m_11'][t]
    kde = gaussian_kde(draws)
    x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.2, 300)
    ax.fill_between(x_grid, kde(x_grid), alpha=0.15, color=SEU_PALETTE[i])
    ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2, 
            label=f'T = {t}')

ax.set_xlabel('α')
ax.set_ylabel('Density')
ax.set_title('m₁₁: Posterior Density of α')
ax.legend(loc='upper right')
plt.tight_layout()
plt.show()

Figure 4: Posterior density of α for each temperature under m_11 (shared α). The clear separation between low-temperature (T ≤ 0.7) and high-temperature (T ≥ 1.0) conditions replicates the pattern from the m_01 analysis, but with substantially tighter posteriors owing to the doubled data.

Show code

fig, ax = plt.subplots(figsize=(8, 5))

for i, t in enumerate(temperatures):
    draws = omega_draws['m_21'][t]
    kde = gaussian_kde(draws)
    x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.2, 300)
    ax.fill_between(x_grid, kde(x_grid), alpha=0.15, color=SEU_PALETTE[i])
    ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2, 
            label=f'T = {t}')

ax.set_xlabel('ω')
ax.set_ylabel('Density')
ax.set_title('m₂₁: Posterior Density of ω (Risky Sensitivity)')
ax.legend(loc='upper right')
plt.tight_layout()
plt.show()

Figure 5: Posterior density of ω (risky sensitivity) from m_21 at each temperature. The patterns mirror α—declining with temperature—but at lower absolute levels.

0.6 Posterior Predictive Checks

The augmented models produce separate posterior predictive check statistics for uncertain and risky choices. For each context, we compute three test statistics:

Log-likelihood (ll): The total log-likelihood of the observed choices under the model — i.e., $\sum_i \log p(y_i^{\text{obs}} \mid \theta)$ — computed separately for uncertain and risky observations.
Modal choice frequency (modal): The fraction of decision problems in which the alternative assigned the highest predicted probability by the model is the one actually chosen by the LLM.
Mean choice probability (prob): The average predicted probability assigned to the observed choice across all problems — i.e., $\frac{1}{N}\sum_i p(y_i^{\text{obs}} \mid \theta)$.

The posterior predictive p-value is the proportion of replicated datasets where the statistic equals or exceeds the observed value; 0.5 indicates perfect calibration.

0.6.1 m_11 PPCs

Table 7: m_11 posterior predictive p-values. The uncertain-choice statistics are well-calibrated. The risky-choice modal and prob statistics run high, suggesting the model’s shared α may be somewhat too low to fully account for the risky context’s regularity.

	T	LL unc	Modal unc	Prob unc	LL risky	Modal risky	Prob risky	LL combined
0	0.0	0.259	0.246	0.199	0.742	0.956	0.861	0.470
1	0.3	0.353	0.440	0.397	0.615	0.887	0.748	0.464
2	0.7	0.316	0.386	0.271	0.693	0.921	0.827	0.471
3	1.0	0.429	0.608	0.502	0.573	0.858	0.704	0.500
4	1.5	0.427	0.498	0.469	0.622	0.707	0.598	0.510

0.6.2 m_21 PPCs

Table 8: m_21 posterior predictive p-values. With a separate ω for risky choices, the risky PPCs are better calibrated than under m_11—particularly the modal and prob statistics, which no longer show systematic upward bias.

	T	LL unc	Modal unc	Prob unc	LL risky	Modal risky	Prob risky	LL combined
0	0.0	0.418	0.360	0.337	0.465	0.825	0.613	0.406
1	0.3	0.442	0.501	0.481	0.460	0.775	0.592	0.418
2	0.7	0.447	0.503	0.392	0.467	0.783	0.615	0.432
3	1.0	0.454	0.626	0.521	0.489	0.798	0.610	0.444
4	1.5	0.475	0.527	0.511	0.485	0.596	0.473	0.452

0.6.3 m_31 PPCs

Table 9: m_31 posterior predictive p-values. The proportional model (ω = κ·α) produces PPC calibration intermediate between m_11 and m_21, consistent with its intermediate structural flexibility.

	T	LL unc	Modal unc	Prob unc	LL risky	Modal risky	Prob risky	LL combined
0	0.0	0.369	0.320	0.287	0.602	0.894	0.743	0.461
1	0.3	0.422	0.497	0.463	0.557	0.835	0.684	0.472
2	0.7	0.410	0.472	0.359	0.568	0.856	0.717	0.462
3	1.0	0.462	0.623	0.526	0.528	0.830	0.655	0.479
4	1.5	0.473	0.530	0.506	0.545	0.631	0.521	0.493

0.6.4 PPC Comparison

Show code

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

model_colors = {'m_11': SEU_COLORS['primary'], 'm_21': SEU_COLORS['accent'], 'm_31': SEU_COLORS['success']}
model_markers = {'m_11': 'o', 'm_21': 's', 'm_31': 'D'}

# Uncertain PPCs
for model in ['m_11', 'm_21', 'm_31']:
    ll_vals = [ppc_summary[model][tk]['ppc_ll_uncertain'] for tk in temp_keys]
    prob_vals = [ppc_summary[model][tk]['ppc_prob_uncertain'] for tk in temp_keys]
    ax1.scatter(temperatures, ll_vals, color=model_colors[model], 
                marker=model_markers[model], s=60, label=f'{model} (ll)', alpha=0.8)
    ax1.scatter(temperatures, prob_vals, color=model_colors[model],
                marker=model_markers[model], s=60, alpha=0.4, facecolors='none',
                edgecolors=model_colors[model], linewidths=1.5)

ax1.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
ax1.set_xlabel('Temperature')
ax1.set_ylabel('PPC p-value')
ax1.set_title('Uncertain Choices')
ax1.set_ylim(0, 1)
ax1.legend(fontsize=8)

# Risky PPCs
for model in ['m_11', 'm_21', 'm_31']:
    ll_vals = [ppc_summary[model][tk]['ppc_ll_risky'] for tk in temp_keys]
    prob_vals = [ppc_summary[model][tk]['ppc_prob_risky'] for tk in temp_keys]
    ax1.scatter(temperatures, ll_vals, color=model_colors[model],
                marker=model_markers[model], s=60, alpha=0.8)
    ax2.scatter(temperatures, ll_vals, color=model_colors[model], 
                marker=model_markers[model], s=60, label=f'{model} (ll)', alpha=0.8)
    ax2.scatter(temperatures, prob_vals, color=model_colors[model],
                marker=model_markers[model], s=60, alpha=0.4, facecolors='none',
                edgecolors=model_colors[model], linewidths=1.5)

ax2.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
ax2.set_xlabel('Temperature')
ax2.set_ylabel('PPC p-value')
ax2.set_title('Risky Choices')
ax2.set_ylim(0, 1)
ax2.legend(fontsize=8)

plt.tight_layout()
plt.show()

Figure 6: PPC p-values across models and contexts. Left: uncertain choices. Right: risky choices. The ideal calibration line at 0.5 is shown as a dashed line. m_21 (separate ω) achieves the best risky-choice calibration.

The PPC analysis reveals a clear model-adequacy story:

Uncertain choices are well-described by all three models, with p-values mostly in $[0.2, 0.6]$. This is expected — the uncertain likelihood shares the same structure as m_01, which showed good fit in Report 1.
Risky choices under m_11 show systematically elevated ppc_modal_risky (0.71–0.96) and ppc_prob_risky (0.60–0.86): the model assigns even higher probability to observed choices than expected under its own generative process. This occurs because m_11’s shared $\alpha$ is pulled toward a compromise between the two contexts.
m_21 resolves this miscalibration by giving risky choices their own $\omega$, bringing all risky PPC p-values into the well-calibrated range.
m_31 falls between the other two, consistent with its intermediate structural flexibility.

Limitation: No Formal Information-Theoretic Model Comparison

The PPC analysis provides evidence of model adequacy — whether each model can reproduce observed data patterns — but does not quantify the predictive performance trade-off between models of different complexity. A formal model comparison using leave-one-out cross-validation (LOO-CV via Pareto-smoothed importance sampling) or the widely applicable information criterion (WAIC) would complement the PPC evidence, particularly since m_21 has an additional free parameter relative to m_11 and m_31. All models output pointwise log-likelihood values (log_lik_uncertain, log_lik_risky), making PSIS-LOO straightforward to compute. We note this as an important gap: the PPC-based preference for m_21 is supported by the specific pattern of misfit in m_11’s risky-choice statistics, but formal information-theoretic comparison would strengthen the model selection conclusion. Future revisions of this report should include LOO-CV with elpd differences and standard errors across models.

0.7 Monotonicity Analysis

0.7.1 Global Slope: α

We replicate the draw-wise slope analysis from Report 1. For each posterior draw, we regress $\alpha$ on temperature across the five conditions and collect the slope coefficient.

Show code

temp_array = np.array(temperatures)

fig, axes = plt.subplots(1, 3, figsize=(14, 4), sharey=True)

for ax, model, title in zip(axes, ['m_11', 'm_21', 'm_31'], 
                             ['m₁₁ (shared α)', 'm₂₁ (α uncertain)', 'm₃₁ (α uncertain)']):
    n_draws = len(alpha_draws[model][temperatures[0]])
    slope_draws = []
    for draw_idx in range(n_draws):
        alpha_vec = np.array([alpha_draws[model][t][draw_idx] for t in temperatures])
        # OLS: b = cov(T, alpha) / var(T)
        b = np.polyfit(temp_array, alpha_vec, 1)[0]
        slope_draws.append(b)
    slope_draws = np.array(slope_draws)
    
    kde = gaussian_kde(slope_draws)
    x_grid = np.linspace(np.percentile(slope_draws, 0.5), 
                          np.percentile(slope_draws, 99.5), 300)
    ax.fill_between(x_grid, kde(x_grid), alpha=0.3, color=SEU_COLORS['primary'])
    ax.plot(x_grid, kde(x_grid), color=SEU_COLORS['primary'], linewidth=2)
    
    median_slope = np.median(slope_draws)
    q05, q95 = np.percentile(slope_draws, [5, 95])
    ax.axvline(x=median_slope, color=SEU_COLORS['accent'], linestyle='-', linewidth=2)
    ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
    
    prob_neg = np.mean(slope_draws < 0)
    ax.set_xlabel('Slope (Δα / ΔT)')
    ax.set_title(f'{title}\nmed={median_slope:.1f}, P(<0)={prob_neg:.4f}')
    ax.grid(alpha=0.2)

axes[0].set_ylabel('Density')
plt.suptitle('Temperature–Sensitivity Slope (α)', fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

Figure 7: Posterior distribution of the slope Δα/ΔT for each model. All three models place virtually all posterior mass below zero, confirming the temperature–sensitivity relationship is robust to model specification. m_11 (shared α) yields the tightest slope posterior.

0.7.2 Global Slope: ω

Show code

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))

for ax, model, title, color in zip(
    [ax1, ax2], ['m_21', 'm_31'], 
    ['m₂₁ (free ω)', 'm₃₁ (ω = κ·α)'],
    [SEU_COLORS['accent'], SEU_COLORS['success']]):
    
    n_draws = len(omega_draws[model][temperatures[0]])
    slope_draws = []
    for draw_idx in range(n_draws):
        omega_vec = np.array([omega_draws[model][t][draw_idx] for t in temperatures])
        b = np.polyfit(temp_array, omega_vec, 1)[0]
        slope_draws.append(b)
    slope_draws = np.array(slope_draws)
    
    kde = gaussian_kde(slope_draws)
    x_grid = np.linspace(np.percentile(slope_draws, 0.5), 
                          np.percentile(slope_draws, 99.5), 300)
    ax.fill_between(x_grid, kde(x_grid), alpha=0.3, color=color)
    ax.plot(x_grid, kde(x_grid), color=color, linewidth=2)
    
    median_slope = np.median(slope_draws)
    prob_neg = np.mean(slope_draws < 0)
    ax.axvline(x=median_slope, color='black', linestyle='-', linewidth=1.5)
    ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel('Slope (Δω / ΔT)')
    ax.set_title(f'{title}\nmed={median_slope:.1f}, P(<0)={prob_neg:.4f}')
    ax.grid(alpha=0.2)

axes[0].set_ylabel('Density')
plt.tight_layout()
plt.show()

Figure 8: Posterior distribution of the slope Δω/ΔT from m_21 (free ω) and m_31 (derived ω = κ·α). Both models confirm that risky sensitivity also declines with temperature.

0.7.3 Pairwise Comparisons (m_11)

Table 10: m_11: Posterior probability that α is higher at the lower temperature. The strong-separation / weak-separation pattern from the m_01 analysis (Report 1) replicates exactly.

	Pair	P(α_low > α_high)	Strength
0	T=0.0 vs T=0.3	0.5600	○
1	T=0.0 vs T=0.7	0.6653	●
2	T=0.0 vs T=1.0	0.9683	●●●
3	T=0.0 vs T=1.5	0.9822	●●●
4	T=0.3 vs T=0.7	0.6015	○
5	T=0.3 vs T=1.0	0.9407	●●
6	T=0.3 vs T=1.5	0.9692	●●●
7	T=0.7 vs T=1.0	0.9105	●●
8	T=0.7 vs T=1.5	0.9493	●●
9	T=1.0 vs T=1.5	0.6122	○

The pairwise structure exactly replicates Report 1: strong separation between $T = 0.0$ and $T \geq 1.0$, moderate separation between middle and high temperatures, and near-indistinguishability between $T = 0.3$ and $T = 0.7$.

0.8 Formal Context Comparison: Uncertain vs. Risky Sensitivity

The most novel finding of this report is that the LLM exhibits lower EU sensitivity in the risky context than in the uncertain context. This section provides formal quantification of that claim.

0.8.1 Posterior Probability of α > ω (m_21)

Table 11: m_21: Posterior probability that uncertain sensitivity α exceeds risky sensitivity ω at each temperature, with the median difference Δ = α − ω.

	Temp	P(α > ω)	Median Δ	90% CI of Δ
0	0.0	0.9545	28.8	[0.6, 75.1]
1	0.3	0.7445	9.4	[-12.5, 40.3]
2	0.7	0.8760	15.7	[-6.0, 47.5]
3	1.0	0.6760	4.2	[-9.8, 21.7]
4	1.5	0.7528	5.5	[-7.4, 22.0]

0.8.2 Aggregate Test: Mean Difference Across Temperatures (m_21)

Table 12: Aggregate measure of the context-dependent sensitivity gap. For each posterior draw, the five per-temperature α − ω differences are averaged, yielding a single summary of the overall gap.

Aggregate mean(α − ω) across temperatures:
  Median:     13.8
  90% CI:     [3.4, 27.1]
  P(mean > 0): 0.9895

Interpreting the Aggregate Test

The aggregate P(mean(α − ω) > 0) provides a single summary of the strength of evidence for context-dependent sensitivity across all temperatures. This draw-wise averaging treats the per-temperature estimates as independent (since they are fit separately), which is appropriate given the study design but does not pool information across temperatures.

0.8.3 Posterior Probability of κ < 1 (m_31)

Table 13: m_31: Posterior probability that the proportionality parameter κ is below 1.0 at each temperature. κ < 1 indicates lower risky sensitivity relative to uncertain sensitivity.

	Temp	κ median	P(κ < 1)	90% CI
0	0.0	0.712	0.8992	[0.449, 1.090]
1	0.3	0.876	0.6925	[0.565, 1.326]
2	0.7	0.794	0.8123	[0.519, 1.205]
3	1.0	0.937	0.6162	[0.635, 1.356]
4	1.5	0.890	0.7065	[0.620, 1.270]

0.8.4 Pairwise Comparisons: ω (m_21)

Table 14: m_21: Posterior probability that ω (risky sensitivity) is higher at the lower temperature. The temperature–sensitivity gradient is present but may differ in strength from the α gradient.

	Pair	P(ω_low > ω_high)	Strength
0	T=0.0 vs T=0.3	0.4210	○
1	T=0.0 vs T=0.7	0.5693	○
2	T=0.0 vs T=1.0	0.8337	●●
3	T=0.0 vs T=1.5	0.9020	●●
4	T=0.3 vs T=0.7	0.6302	○
5	T=0.3 vs T=1.0	0.8782	●●
6	T=0.3 vs T=1.5	0.9277	●●
7	T=0.7 vs T=1.0	0.7975	●
8	T=0.7 vs T=1.5	0.8688	●●
9	T=1.0 vs T=1.5	0.6190	○

The pairwise ω comparison clarifies whether the temperature–sensitivity gradient operates similarly in the risky context. If the pairwise probabilities are systematically lower for ω than for α (Table 10), this would indicate that risky sensitivity is not only lower in level but also less responsive to temperature — a more nuanced finding.

0.9 Cross-Study Comparison

Show code

fig, ax = plt.subplots(figsize=(10, 6))

model_configs = [
    ('m_01', 'm_01 (Report 1, uncertain only)', m01_analysis['summary_table'], SEU_COLORS['secondary'], 'o', 0.45),
    ('m_11', 'm_11 (shared α, this report)', None, SEU_COLORS['primary'], 's', 0.15),
    ('m_21', 'm_21 (separate α, ω)', None, SEU_COLORS['accent'], 'D', -0.15),
    ('m_31', 'm_31 (ω = κ·α)', None, SEU_COLORS['success'], '^', -0.45),
]

y_positions = np.arange(len(temperatures))[::-1]

for model_key, model_label, summary_data, color, marker, offset in model_configs:
    for i, t in enumerate(temperatures):
        y = y_positions[i] + offset
        
        if model_key == 'm_01':
            entry = summary_data[i]
            median = entry['median']
            q05 = entry['ci_low']
            q95 = entry['ci_high']
        else:
            tk = temp_key_map[t]
            p = param_summary[model_key][tk]
            median = p['alpha_median']
            q05 = p['alpha_q05']
            q95 = p['alpha_q95']
        
        ax.plot([q05, q95], [y, y], color=color, linewidth=1.5, alpha=0.7)
        ax.plot(median, y, marker, color=color, markersize=7, zorder=5,
                label=model_label if i == 0 else '')

ax.set_yticks(y_positions)
ax.set_yticklabels([f'T = {t}' for t in temperatures])
ax.set_xlabel('α (sensitivity)')
ax.set_title('Cross-Study Comparison: α Posteriors')
ax.legend(loc='upper right', fontsize=10)
ax.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

Figure 9: Cross-model consistency check of α posteriors from the initial m_01 study and the augmented m_11, m_21, and m_31 models *fit to the same uncertain-choice data*. Labels: m_01 (Report 1, uncertain only), m_11 (shared α, this report), m_21 (separate α and ω, this report), m_31 (proportional ω = κ·α, this report). Because all four fits share the uncertain-choice data, the agreement across models is a within-dataset consistency check on the augmented model structure, not an independent replication of the temperature pattern.

Several patterns emerge from the cross-study comparison. An important caveat: the uncertain-choice data in the augmented models (m_11, m_21, m_31) are the same data as in m_01, so the α estimates from the augmented models are not independent of the m_01 estimates. This comparison is therefore a consistency check—verifying that the augmented model structure does not distort the uncertain-context estimates—rather than an independent replication.

Qualitative consistency. All four models agree on the direction and approximate shape of the temperature–sensitivity relationship. The ordering $\alpha(T\!=\!0.0) > \alpha(T\!=\!0.3) \approx \alpha(T\!=\!0.7) > \alpha(T\!=\!1.0) > \alpha(T\!=\!1.5)$ is preserved.
m_11 produces the tightest posteriors. By forcing a single $\alpha$ to explain both 300 uncertain and 300 risky choices, m_11 achieves SDs roughly half those of m_01 (which used 300 uncertain choices alone). The medians are somewhat lower than m_01’s — a consequence of the shared $\alpha$ being a compromise between the two contexts, with the risky data favoring lower values.
m_21 and m_31 recover m_01-like α values. When the risky context is given its own sensitivity parameter ($\omega$ or $\kappa \cdot \alpha$), the uncertain-context $\alpha$ estimates closely match the m_01 values. This is expected as a necessary consequence of the shared uncertain-choice likelihood: the m_01 and m_21/m_31 models fit the same uncertain observations with the same structural assumptions. The consistency confirms that the augmented model does not introduce distortion, rather than providing independent confirmation of the m_01 estimates.
The $\alpha/\omega$ gap is substantive. Under m_21 at $T = 0.0$, the median $\alpha \approx 70$ while the median $\omega \approx 41$. As quantified formally in Section 0.8, the LLM is approximately 1.7× more sensitive to EU differences in the uncertain context than in the risky context.

0.10 Discussion

0.10.1 Summary of Findings

This study demonstrates three key results:

Temperature–sensitivity replication. The negative association between sampling temperature and estimated SEU sensitivity $\alpha$ — first established in Report 1 using the m_01 model — is robust to the inclusion of risky alternatives and to the choice of augmented model (m_11, m_21, m_31). The qualitative pattern — high sensitivity at greedy decoding, a marked decrease between $T = 0.7$ and $T = 1.0$, and the near-indistinguishability of $T = 0.3$ and $T = 0.7$ — replicates exactly.
Context-dependent sensitivity. Under m_21 and m_31, the LLM’s sensitivity to EU maximization is consistently lower in the risky context (where probabilities are stated explicitly) than in the uncertain context (where probabilities are inferred from features). The m_31 proportionality parameter $\kappa$ clusters below 1.0 at every temperature level, with medians ranging from 0.71 to 0.94.
Model adequacy. Posterior predictive checks support m_21 as the best-calibrated model: its separate $\omega$ parameter resolves the upward bias in risky-choice PPC statistics that m_11 exhibits. The proportional model m_31 provides a reasonable compromise, but does not match m_21’s calibration at all temperatures.

0.10.2 Interpretation: Why Is Risky Sensitivity Lower?

The finding that $\omega < \alpha$ — now formally quantified in Section 0.8 — is a robust empirical pattern, but its explanation remains open. The interpretations below are post hoc: they were formulated after observing the data and cannot be discriminated by the current design. The data establish the descriptive fact; mechanistic explanation requires follow-up study.

Format effect. When probabilities are stated numerically (risky context), the LLM may process them less effectively than when probability-relevant information is embedded in natural-language descriptions (uncertain context). The softmax token sampling introduces noise at the token level, which may compound differently across the two representations. This hypothesis could in principle be tested by presenting risky alternatives in natural-language format (e.g., “about a 90% chance of neither claim being approved”) and observing whether sensitivity rises to uncertain-context levels.
Calibration asymmetry. The feature-to-probability mapping $\psi = \text{softmax}(\beta \cdot w)$ is learned jointly with $\alpha$, and may effectively “sharpen” the inferred probability distributions in ways that favor EU-aligned choices. The risky context has no such adaptive layer. This is partially testable: if the estimated subjective probabilities $\psi_r$ under the fitted model are more “peaked” (lower entropy) than the stated risky simplexes, this would be consistent with the β layer acting as an adaptive sharpening mechanism.
Utility estimation precision. In the risky context, the expected utilities $\eta^{(r)} = x' \upsilon$ are exact given $\upsilon$, creating very fine EU differences between alternatives with similar probability profiles. When many alternatives have nearly equal EU, even moderate sensitivity $\omega$ produces near-uniform choice probabilities. This could be assessed by comparing the distribution of EU differences among alternatives in risky vs. uncertain problems.

Any of these mechanisms — or some combination — could be operative; the current data do not discriminate among them.

0.10.3 Confounds in the Uncertain/Risky Comparison

Beyond the post hoc nature of the interpretations above, the uncertain and risky contexts differ in ways that go beyond probability format:

Stimulus complexity. Uncertain alternatives are derived from natural-language claim descriptions processed through embedding and PCA ($D = 32$ features per alternative), while risky alternatives present explicit probability simplexes ($K = 3$ values).
Dimensionality and estimation burden. The uncertain model estimates a $K \times D$ matrix $\beta$ jointly with $\alpha$; the risky model takes probabilities as given.
Representation pathway. Uncertain probabilities pass through a learned softmax mapping; risky probabilities enter the EU calculation directly.

These differences mean that the $\alpha > \omega$ finding could reflect task structure rather than probability-format effects per se. A matched design — where the same 30 stimulus profiles are used in both contexts, with probabilities either inferred or stated — would provide cleaner causal attribution. The current finding should be interpreted as: as operationalized in this design, risky choices show lower sensitivity to EU differences.

0.10.4 Connection to the JDM Risk–Ambiguity Literature

The distinction between our “uncertain” and “risky” contexts maps directly onto the risk–ambiguity distinction that has been central to JDM since Ellsberg (1961). In the uncertain context, the LLM must infer probabilities from text features — analogous to the “ambiguity” condition where probabilities are unknown or imprecise. In the risky context, probabilities are stated explicitly — the canonical “risk” condition.

A large body of research on human decision-making has documented ambiguity aversion: people tend to prefer options with known probabilities over options with unknown probabilities, even when expected values are equivalent (Camerer & Weber, 1992; Trautmann & van de Kuilen, 2015). This typically manifests as more conservative choice under ambiguity.

The finding that the LLM shows higher EU sensitivity under uncertainty/ambiguity than under risk is interesting in light of this literature, though direct comparison requires caution. Higher $\alpha$ means choices are more tightly aligned with EU maximization — which could be seen as more rational rather than more conservative. Whether this pattern reflects something analogous to human ambiguity attitudes, or is an artifact of the adaptive $\beta$ layer, remains an open question. The Ellsberg study in this series engages more directly with classic Ellsberg-style ambiguity manipulations; cross-referencing those findings with the present results may shed light on whether the sensitivity asymmetry reflects a general feature of LLM probability processing.

0.10.5 Practical Implications

The finding that temperature affects LLM rationality has implications for AI deployment. At greedy decoding ($T = 0.0$), the LLM’s choices are most closely aligned with EU maximization — potentially desirable in applications where consistent, utility-maximizing decisions are valued (e.g., automated triage, recommendation systems) but potentially undesirable where diversity of response or exploration is needed. The additional finding that context format affects sensitivity suggests that how probabilities are presented to an LLM may matter for the quality of its decisions, independent of the temperature setting.

0.10.6 Limitations and Next Steps

Independent temperature fits. The current analysis fits each temperature condition independently. A hierarchical model that pools information across temperatures — e.g., $\log \alpha(T) = a + bT$, $\log \omega(T) = c + dT$ — would directly estimate slope parameters, test whether the temperature effect on $\omega$ parallels that on $\alpha$, and obviate the need for draw-wise slope computation across independent fits. The m_31 structure ($\omega = \kappa \alpha$) would be particularly amenable to a hierarchical extension where $\kappa$ is allowed to vary with temperature. The near-equality of $T = 0.3$ and $T = 0.7$ estimates motivates investigation of whether the relationship is piecewise or smoothly nonlinear.

Precision on κ. The $\kappa < 1$ finding, while consistent across temperatures, has wide credible intervals; a study with larger $N$ (more risky problems per temperature) would improve precision on this parameter and enable sharper discrimination between m_11 and m_31.

Prior sensitivity. The $\alpha$ prior — Lognormal(3.0, 0.75) — is carried forward from Report 1 without robustness checking, and the $\omega$ prior in m_21 adopts the same hyperparameters by symmetry. The $\kappa$ prior in m_31 — Lognormal(0, 0.5) — is moderately informative. Refitting under alternative priors (e.g., $\alpha, \omega \sim \text{Lognormal}(2.5, 1.0)$) and verifying that the qualitative findings — the temperature gradient and the $\omega < \alpha$ pattern — are preserved would strengthen the robustness of the conclusions. Prior-to-posterior contraction ratios for the key parameters would further quantify the data’s influence relative to the prior.

Single LLM and task domain. All results are from GPT-4o on the insurance triage task. The Ellsberg study and GPT-4o Ellsberg study in this series provide some cross-task context, while the Claude Insurance study provides cross-LLM context on the same task domain. The factorial synthesis formally disentangles LLM and task effects.

0.10.7 Construct Validity Revisited

Returning to the construct-validity framing introduced at the top of this report: the empirical results of this study illustrate, in miniature, exactly what is gained by moving from the m_0 / m_01 family to a design with risky alternatives. The temperature–$\alpha$ relationship from Report 1 is a layer-(2) finding that survives the move (it replicates under all three augmented models), but the m_31 estimate of $\kappa$ is a layer-(3)–adjacent quantity that the m_0 / m_01 family cannot produce at all — and the data do indeed locate $\kappa$ substantively below 1.0 across all temperatures. That this is possible only because risky choices give $\delta$ direct identifying information is the methodological point.

For the planned alignment study (see prompts/hierarchical_alignment_study_plan.md), the implications are concrete:

The h_m01-based first wave is layer (2) by construction — contrasts on $\log\alpha$ across alignment manipulations — and inherits the m_01 caveats spelled out in Report 1.
A second wave that adds risky alternatives would lift the alignment study into the same identification regime as the present m_11 / m_21 / m_31 family, allowing context-comparison parameters analogous to $\kappa$ — for example, prompt-condition ratios of risky-vs-uncertain sensitivity — that are robust to the absolute scaling of $u_\theta$.
The wide credible intervals on $\kappa$ here at $N = 300$ are a direct planning input for any such follow-up: precise estimation of context-comparison parameters demands more risky problems per cell than precise estimation of $\alpha$ alone.

In short, the present study should be read both as a substantive extension of the temperature finding and as the methodological template for the m_1 / m_2 / m_3-family follow-up that the construct-validity discussion in Report 1 identifies as the principled way to move beyond the m_0 / m_01 identification limit.

0.10.8 Transparency Note

The decision to fit three models (m_11, m_21, m_31) was pre-specified as part of the study design — the model family was defined in advance based on the nesting structure established in the foundational reports. The specific finding that m_21 shows better PPC calibration than m_11, and the subsequent focus on the $\alpha > \omega$ pattern, are data-driven and should be regarded as exploratory rather than confirmatory. The formal context comparison (Section 0.8) was added during revision to provide rigorous quantification of a pattern that was initially presented only qualitatively.

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:

@online{helzner2026,
  author = {Helzner, Jeff},
  title = {Temperature and {SEU} {Sensitivity:} {Risky} {Alternatives}
    {Extension}},
  date = {2026-05-12},
  url = {https://jeffhelzner.github.io/seu-sensitivity/applications/temperature_study_with_risky_alts/01_risky_alternatives_study.html},
  langid = {en}
}

For attribution, please cite this work as:

Helzner, Jeff. 2026. “Temperature and SEU Sensitivity: Risky Alternatives Extension.” SEU Sensitivity Project, May 12. https://jeffhelzner.github.io/seu-sensitivity/applications/temperature_study_with_risky_alts/01_risky_alternatives_study.html.

--- title: "Temperature and SEU Sensitivity: Risky Alternatives Extension" subtitle: "Application Report: Temperature Study 2" description: | Extends the initial temperature study by introducing risky alternatives alongside the original uncertain alternatives, enabling joint estimation under models m_11, m_21, and m_31. categories: [applications, temperature, m_11, m_21, m_31, risky] execute: cache: true --- ```{python} #| label: setup #| include: false import sys import os reports_root = os.path.normpath(os.path.join(os.getcwd(), '..', '..')) project_root = os.path.dirname(reports_root) sys.path.insert(0, reports_root) sys.path.insert(0, project_root) import numpy as np import json import yaml import warnings warnings.filterwarnings('ignore') import matplotlib.pyplot as plt import pandas as pd # Use project plotting style from report_utils import set_seu_style, SEU_COLORS, SEU_PALETTE set_seu_style() # Data directory (frozen snapshot) from pathlib import Path data_dir = Path("data") # Temperatures and labels used throughout temperatures = [0.0, 0.3, 0.7, 1.0, 1.5] temp_labels = {t: f"T={t}" for t in temperatures} temp_keys = ["T0_0", "T0_3", "T0_7", "T1_0", "T1_5"] temp_key_map = dict(zip(temperatures, temp_keys)) ``` ## Introduction [Report 1](../temperature_study/01_initial_study.qmd) established that the estimated sensitivity parameter $\alpha$ of GPT-4o decreases with sampling temperature, using the **m_01** model — a single-context softmax choice model fit to uncertain alternatives whose probabilities are inferred from embedded text features. That study demonstrated a clear negative relationship between temperature and EU sensitivity, with a posterior probability exceeding 0.99 that the global slope is negative. This report extends that analysis in two ways: 1. **Adding risky alternatives.** We collect a new set of choice data in which the LLM chooses among alternatives whose outcome probabilities are *stated explicitly* (risky alternatives), rather than inferred from features (uncertain alternatives). This creates a paired dataset: each temperature condition now has $M = 300$ uncertain decisions and $N = 300$ risky decisions ($N = 299$ at $T = 1.5$ due to one unparseable response). 2. **A family of augmented models.** We fit three models that jointly estimate sensitivity from both decision contexts: - **m_11** — shared sensitivity $\alpha$ across uncertain and risky choices - **m_21** — separate sensitivities $\alpha$ (uncertain) and $\omega$ (risky) - **m_31** — proportional sensitivities with $\omega = \kappa \cdot \alpha$ The central questions are: *(i)* Does the temperature–sensitivity relationship replicate when risky alternatives are included? *(ii)* Does sensitivity differ between uncertain and risky contexts? *(iii)* If so, is the difference proportional across temperatures? ::: {.callout-note} ## What This Report Covers This report presents data collection, prior calibration, model fitting, posterior predictive checks, and monotonicity analysis for the augmented models. It builds directly on the uncertain-choice data and m_01 results from [Report 1](../temperature_study/01_initial_study.qmd). ::: ::: {.callout-important} ## Construct Validity: Why Adding Risky Alternatives Matters [Report 1](../temperature_study/01_initial_study.qmd) flagged a fundamental interpretive limit of the m_0 / m_01 family: because the design uses only *uncertain* (assessment-elicited) choices, the feature-to-probability weights $\beta$ and the utility increments $\delta$ are not separately identified. Operationally, $\alpha$ in m_01 measures the consistency with which choices align with the *model-implied* utility ranking — not necessarily with the agent's "true" subjective expected utility. In the three-layer construct- validity scheme of the initial-study report, m_01 supports **comparative** claims across conditions sharing a stimulus pool (layer 2), but **not absolute, agent-level rationality** claims (layer 3). Risky alternatives are the principled relaxation of that limitation. By presenting probabilities *explicitly* — as a $K$-simplex over consequences — risky choices give $\delta$ direct identifying information that uncertain choices alone cannot provide: $\eta^{(r)} = x^\top \upsilon$ depends on $\delta$ through $\upsilon = \mathrm{cumsum}([0,\delta])$ but no longer through $\beta$, so risky-choice likelihood contributions identify $\delta$ separately from the uncertain-choice $\beta$. This is exactly the missing identifying information called for in §Construct Validity of [Report 1](../temperature_study/01_initial_study.qmd) and motivates the m_1 / m_2 / m_3 ladder that this report instantiates as m_11 / m_21 / m_31. The three augmented models map onto the construct-validity layers as follows: * **m_11 (shared $\alpha$).** Stays at layer (2) but with **tighter posteriors**: the risky data adds a second source of evidence about the same sensitivity parameter, sharpening within-condition precision without introducing new claim types. * **m_21 (separate $\alpha$ and $\omega$).** Opens a new layer-(2) contrast — *between contexts* (uncertain vs risky) — that the m_0 / m_01 family cannot estimate at all. * **m_31 ($\omega = \kappa\alpha$).** The proportionality parameter $\kappa$ is the cleanest summary the project produces of *between- context* sensitivity differences. As a **ratio**, $\kappa$ is robust to the absolute scaling of the model-implied utility: whatever residual identifiability concern attaches to the absolute level of $\alpha$ or $\omega$ cancels in the ratio. This is the closest the m_1 / m_2 / m_3 family comes to a layer-(3)–adjacent quantity. This setup motivates the three central questions of this report listed above. It also explains the methodological status of this report in the broader applications programme: it is the empirical proof-of-concept for the m_1 / m_2 / m_3 follow-up sequencing flagged in §0.5 of [`prompts/hierarchical_alignment_study_plan.md`](../../../prompts/hierarchical_alignment_study_plan.md) — that is, what an alignment-style follow-up would look like once the construct-validity caveats of m_01 / h_m01 motivate the move to designs with risky alternatives. ::: ## Experimental Design {#sec-design} ### Risky Alternatives The uncertain alternatives from the initial study use the same insurance claims triage task: embedded natural-language assessments produce features $w_r \in \mathbb{R}^D$, and the model infers subjective probabilities $\psi_r = \text{softmax}(\beta \cdot w_r)$ over $K = 3$ consequences. The *risky* alternatives replace this inference step with explicitly stated probability simplexes: ```{python} #| label: risky-pool #| echo: false with open(data_dir / "risky_alternatives.json") as f: risky_pool = json.load(f) alts = risky_pool["risky_alternatives"] S = len(alts) print(f"Risky alternatives pool: S = {S}") print(f"Each alternative specifies a simplex over K = 3 consequences.") print() # Show a sample sample_alts = alts[:6] rows = [] for a in sample_alts: rows.append({ "ID": a["id"], "p(neither)": a["probabilities"][0], "p(one)": a["probabilities"][1], "p(both)": a["probabilities"][2], }) pd.DataFrame(rows) ``` The $S = 30$ risky alternatives span a range of probability profiles: corner alternatives concentrating mass on a single consequence (e.g., $[0.90, 0.05, 0.05]$), balanced alternatives (e.g., $[1/3, 1/3, 1/3]$), and intermediate cases. Each risky decision problem draws 2–4 alternatives uniformly at random (without replacement) from the pool of 30, with a fresh draw for each of the 100 base problems. The same position-counterbalancing design is used as in the uncertain case: 100 base problems $\times$ 3 presentations with shuffled orderings. All draws use the study-level random seed recorded in the frozen data snapshot, ensuring exact reproducibility. ::: {.callout-note} ## Terminology: "Uncertain" vs. "Ambiguous" Throughout this report, we use "uncertain" to describe the decision context in which probabilities must be inferred from natural-language features, and "risky" for the context in which probabilities are stated explicitly. In the JDM literature, the former is closer to what is typically called "ambiguity" (unknown or imprecise probabilities) and the latter to "risk" (known probabilities). We retain "uncertain" for consistency with [Report 1](../temperature_study/01_initial_study.qmd) and the model notation, but the connection to the classic risk–ambiguity distinction (Ellsberg, 1961) is substantive and is discussed in @sec-discussion. ::: ### Design Parameters ```{python} #| label: design-params #| echo: true with open(data_dir / "study_config.yaml") as f: config = yaml.safe_load(f) with open(data_dir / "run_summary.json") as f: run_summary = json.load(f) t0_info = run_summary['phases']['phase3_data_prep']['per_temperature']['0.0'] print(f"Study Design:") print(f" Uncertain problems (M): {t0_info['M']} (100 base × 3 presentations)") print(f" Risky problems (N): {t0_info['N']} (100 base × 3 presentations)") print(f" Uncertain alternatives: R = {t0_info['R']}") print(f" Risky alternatives: S = {t0_info['S']}") print(f" Alternatives per problem: {config['min_alternatives']}–{config['max_alternatives']}") print(f" Consequences (K): {config['K']}") print(f" Embedding dimensions (D): {t0_info['D']}") print(f" LLM model: {config['llm_model']}") print(f" Temperature conditions: {config['temperatures']}") ``` ```{python} #| label: tbl-design-comparison #| tbl-cap: "Comparison of uncertain and risky decision contexts." #| echo: false design_df = pd.DataFrame({ "": ["Uncertain (m_01 data)", "Risky (new data)"], "Observations": ["M = 300 per temp", "N = 300 per temp (299 at T=1.5)"], "Alternatives": ["R = 30 distinct", "S = 30 distinct"], "Per problem": ["2–4 alternatives", "2–4 alternatives"], "Probabilities": ["Inferred via β·w → softmax", "Stated explicitly (simplexes)"], "Sensitivity": ["α (all models)", "α (m_11) / ω (m_21, m_31)"], }) design_df ``` ### Data Quality ```{python} #| label: data-quality #| echo: false na = run_summary['phases']['phase2_risky_choices']['na_summary'] print(f"Risky Choice NA Summary:") print(f" Overall: {na['overall']['na']} / {na['overall']['total']} ({na['overall']['na_rate']:.2%})") for key, val in na['per_temperature'].items(): print(f" {key}: {val['na']} / {val['total']} ({val['na_rate']:.2%})") ``` Data quality is excellent: only 1 unparseable response out of 1,500 risky choices (at $T = 1.5$), matching the near-perfect parsing observed in the initial uncertain study. ::: {.callout-note} ## Sample Size The sample sizes ($M = 300$ uncertain, $N = 300$ risky per temperature) were chosen to match the initial study's design while remaining computationally tractable for the augmented models. No formal power analysis was conducted to determine the sample size needed for precise estimation of $\kappa$ or for discriminating between m_11 and m_31. As will become apparent (@sec-results), the credible intervals on $\kappa$ are wide enough that a larger study would improve precision. The sample size adequacy for the primary finding—replication of the temperature–$\alpha$ relationship—is supported by the strong posterior separation observed. ::: ## Model Family {#sec-models} All three augmented models share: (1) the same utility function $\upsilon = \text{cumulative\_sum}([0, \delta])$ with $\delta \sim \text{Dirichlet}(\mathbf{1})$, (2) subjective probabilities $\psi_r = \text{softmax}(\beta \cdot w_r)$ for uncertain alternatives with $\beta \sim \mathcal{N}(0, 1)$, and (3) the calibrated prior $\alpha \sim \text{Lognormal}(3.0, 0.75)$ from [Report 1](../temperature_study/01_initial_study.qmd). They differ only in how sensitivity governs risky choices. ### Model Specifications ```{python} #| label: tbl-model-specs #| tbl-cap: "Parameter specifications for the three augmented models. All models share the same utility and subjective probability structure." #| echo: false specs = pd.DataFrame({ "Model": ["m_11", "m_21", "m_31"], "Uncertain sensitivity": [ "α ~ LN(3.0, 0.75)", "α ~ LN(3.0, 0.75)", "α ~ LN(3.0, 0.75)", ], "Risky sensitivity": [ "α (shared with uncertain)", "ω ~ LN(3.0, 0.75)", "ω = κ·α, κ ~ LN(0, 0.5)", ], "Free parameters": ["α, β, δ", "α, ω, β, δ", "α, κ, β, δ"], "Interpretation": [ "Single sensitivity governs both contexts", "Contexts have independent sensitivities", "Sensitivities are proportionally linked", ], }) specs ``` The key structural differences: - **m_11** forces the same $\alpha$ to explain both uncertain and risky choices. The risky data provides additional constraint, yielding tighter posteriors. - **m_21** gives each context its own sensitivity parameter. If $\omega \neq \alpha$, the LLM processes explicit probabilities differently from inferred ones. - **m_31** nests between the other two: when $\kappa = 1$, it reduces to m_11; when $\kappa$ deviates from 1, it captures a multiplicative scaling of sensitivity in the risky context. The prior $\kappa \sim \text{Lognormal}(0, 0.5)$ centers at 1 with 90% CI $\approx [0.44, 2.28]$. ### Relationship to m_01 The m_01 model from [Report 1](../temperature_study/01_initial_study.qmd) fits only uncertain choices: $y_m \sim \text{Categorical}(\text{softmax}(\alpha \cdot \eta^{(u)}_m))$. The augmented models extend this by adding a second likelihood for risky choices: $z_n \sim \text{Categorical}(\text{softmax}(\alpha_{\text{risky}} \cdot \eta^{(r)}_n))$, where $\eta^{(r)}_n = x_s' \upsilon$ uses the *stated* probability simplexes $x_s$ rather than inferred subjective probabilities. The utility function $\upsilon$ and $\beta$ parameters are estimated jointly from both data sources. ## Prior Predictive Calibration {#sec-prior-calibration} Prior predictive simulation was performed using the `_sim.stan` variants of each model on the actual augmented study design. For each candidate prior, we drew parameter values and simulated choices, computing the SEU-maximizer selection rate separately for uncertain and risky contexts. ```{python} #| label: fig-prior-grid-m1 #| fig-cap: "Prior predictive SEU-maximizer selection rates for m_1 across candidate α priors. Risky alternatives yield consistently higher SEU-max rates than uncertain alternatives at the same prior, reflecting the easier decision structure when probabilities are known. The selected prior lognormal(3.0, 0.75) produces combined rates near 0.82." #| fig-height: 5 with open(data_dir / "prior_predictive" / "m_1_grid_results.json") as f: m1_grid = json.load(f) results = m1_grid['results'] fig, ax = plt.subplots(figsize=(10, 5)) labels = [r['prior_label'] for r in results] y_pos = np.arange(len(labels)) unc_means = [r['seu_rate_uncertain_mean'] for r in results] risky_means = [r['seu_rate_risky_mean'] for r in results] bar_height = 0.35 bars_unc = ax.barh(y_pos + bar_height/2, unc_means, bar_height, color=SEU_COLORS['primary'], alpha=0.8, label='Uncertain') bars_risky = ax.barh(y_pos - bar_height/2, risky_means, bar_height, color=SEU_COLORS['accent'], alpha=0.8, label='Risky') # Highlight selected prior selected_idx = [i for i, l in enumerate(labels) if 'lognormal(3.0, 0.75)' in l][0] bars_unc[selected_idx].set_edgecolor('black') bars_unc[selected_idx].set_linewidth(2) bars_risky[selected_idx].set_edgecolor('black') bars_risky[selected_idx].set_linewidth(2) ax.set_yticks(y_pos) ax.set_yticklabels(labels, fontsize=9) ax.set_xlabel('SEU-Maximizer Selection Rate') ax.set_title('m_1: Prior-Implied SEU-Max Rate by Context') ax.legend(fontsize=10) ax.set_xlim(0, 1) plt.tight_layout() plt.show() ``` The prior predictive analysis reveals a consistent pattern across all three models: risky alternatives produce higher SEU-max rates than uncertain alternatives at the same sensitivity level. This is expected—when probabilities are stated explicitly, there is no estimation error in $\psi$; the only source of suboptimality is the softmax noise governed by the sensitivity parameter, For m_21 and m_31, the joint prior space was searched over 2D grids. The selected priors — $\alpha \sim \text{LN}(3.0, 0.75)$ and $\omega \sim \text{LN}(3.0, 0.75)$ for m_21, $\kappa \sim \text{LN}(0.0, 0.5)$ for m_31 — yield prior-implied SEU-max rates of approximately 0.77 (uncertain) and 0.85 (risky), a sensible range for GPT-4o behavior. ## Results {#sec-results} ### Loading Posterior Draws ```{python} #| label: load-posteriors #| output: false # Load all draws alpha_draws = {model: {} for model in ['m_11', 'm_21', 'm_31']} omega_draws = {model: {} for model in ['m_21', 'm_31']} kappa_draws = {} for model in ['m_11', 'm_21', 'm_31']: for t, tk in zip(temperatures, temp_keys): d = np.load(data_dir / f"alpha_draws_{model}_{tk}.npz") alpha_draws[model][t] = d['alpha'] for t, tk in zip(temperatures, temp_keys): d = np.load(data_dir / f"omega_draws_m_21_{tk}.npz") omega_draws['m_21'][t] = d['omega'] d = np.load(data_dir / f"omega_draws_m_31_{tk}.npz") omega_draws['m_31'][t] = d['omega'] d = np.load(data_dir / f"kappa_draws_m_31_{tk}.npz") kappa_draws[t] = d['kappa'] with open(data_dir / "parameter_summary.json") as f: param_summary = json.load(f) with open(data_dir / "ppc_summary.json") as f: ppc_summary = json.load(f) with open(data_dir / "fit_summary.json") as f: fit_summary = json.load(f) # Load m_01 comparison data with open(data_dir / "m01_fit_summary.json") as f: m01_summary = json.load(f) with open(data_dir / "m01_primary_analysis.json") as f: m01_analysis = json.load(f) ``` ```{python} #| echo: false for model in ['m_11', 'm_21', 'm_31']: n = len(alpha_draws[model][temperatures[0]]) print(f" {model}: {n:,} posterior draws per temperature") ``` ### MCMC Diagnostics All 15 fits (3 models × 5 temperatures) achieved clean MCMC diagnostics: no divergences, no treedepth warnings, satisfactory E-BFMI, and $\hat{R} < 1.005$ for all parameters. ```{python} #| label: tbl-diagnostics #| tbl-cap: "MCMC diagnostics for key parameters across all models and temperatures. All fits used 4 chains × 1,000 warmup + 1,000 sampling iterations." #| echo: false diag_rows = [] for model in ['m_11', 'm_21', 'm_31']: for t in temperatures: tk = temp_key_map[t] p = param_summary[model][tk] row = { 'Model': model, 'T': t, 'α R̂': f"{p['alpha_rhat']:.4f}", 'α ESS': f"{p['alpha_ess_bulk']:.0f}", } if model == 'm_21': row['ω R̂'] = f"{p.get('omega_rhat', float('nan')):.4f}" row['ω ESS'] = f"{p.get('omega_ess_bulk', float('nan')):.0f}" elif model == 'm_31': row['κ R̂'] = f"{p.get('kappa_rhat', float('nan')):.4f}" row['κ ESS'] = f"{p.get('kappa_ess_bulk', float('nan')):.0f}" diag_rows.append(row) pd.DataFrame(diag_rows).fillna('') ``` ### Posterior Summaries: m_11 (Shared α) ```{python} #| label: tbl-m11-posteriors #| tbl-cap: "m_11: Posterior summaries for the shared sensitivity parameter α. Intervals are 90% credible intervals." #| echo: false rows = [] for t in temperatures: tk = temp_key_map[t] p = param_summary['m_11'][tk] rows.append({ 'Temp': t, 'Median': f"{p['alpha_median']:.1f}", 'Mean': f"{p['alpha_mean']:.1f}", 'SD': f"{p['alpha_sd']:.1f}", '90% CI': f"[{p['alpha_q05']:.1f}, {p['alpha_q95']:.1f}]", }) pd.DataFrame(rows) ``` ### Posterior Summaries: m_21 (Separate α, ω) ```{python} #| label: tbl-m21-posteriors #| tbl-cap: "m_21: Posterior summaries for α (uncertain) and ω (risky). The separate parametrization reveals systematically lower sensitivity in the risky context." #| echo: false rows = [] for t in temperatures: tk = temp_key_map[t] p = param_summary['m_21'][tk] rows.append({ 'Temp': t, 'α median': f"{p['alpha_median']:.1f}", 'α 90% CI': f"[{p['alpha_q05']:.1f}, {p['alpha_q95']:.1f}]", 'ω median': f"{p['omega_median']:.1f}", 'ω 90% CI': f"[{p['omega_q05']:.1f}, {p['omega_q95']:.1f}]", }) pd.DataFrame(rows) ``` ### Posterior Summaries: m_31 (ω = κ·α) ```{python} #| label: tbl-m31-posteriors #| tbl-cap: "m_31: Posterior summaries for α, κ, and the derived ω = κ·α. The proportionality parameter κ clusters below 1.0, confirming reduced risky sensitivity." #| echo: false rows = [] for t in temperatures: tk = temp_key_map[t] p = param_summary['m_31'][tk] rows.append({ 'Temp': t, 'α median': f"{p['alpha_median']:.1f}", 'α 90% CI': f"[{p['alpha_q05']:.1f}, {p['alpha_q95']:.1f}]", 'κ median': f"{p['kappa_median']:.3f}", 'κ 90% CI': f"[{p['kappa_q05']:.3f}, {p['kappa_q95']:.3f}]", 'ω median': f"{p['omega_median']:.1f}", 'ω 90% CI': f"[{p['omega_q05']:.1f}, {p['omega_q95']:.1f}]", }) pd.DataFrame(rows) ``` ### Forest Plot: α Across Models ```{python} #| label: fig-forest-alpha #| fig-cap: "Forest plot of posterior α distributions across models and temperatures. m_11 (shared α) produces the tightest posteriors. m_21 and m_31 (which allow risky sensitivity to vary) estimate higher α for the uncertain context, reflecting the additional degrees of freedom." #| fig-height: 7 from scipy.stats import gaussian_kde fig, axes = plt.subplots(1, 3, figsize=(14, 6), sharey=True) model_names = ['m_11', 'm_21', 'm_31'] model_titles = ['m₁₁ (shared α)', 'm₂₁ (α for uncertain)', 'm₃₁ (α for uncertain)'] for ax, model, title in zip(axes, model_names, model_titles): y_positions = np.arange(len(temperatures))[::-1] for i, t in enumerate(temperatures): draws = alpha_draws[model][t] median = np.median(draws) q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95]) y = y_positions[i] color = SEU_PALETTE[i] # Thin bar: 90% CI ax.plot([q05, q95], [y, y], color=color, linewidth=1.5, solid_capstyle='round') # Thick bar: 50% CI ax.plot([q25, q75], [y, y], color=color, linewidth=4, solid_capstyle='round') # Point: median ax.plot(median, y, 'o', color=color, markersize=8, zorder=5) ax.set_yticks(y_positions) ax.set_yticklabels([f'T = {t}' for t in temperatures]) ax.set_xlabel('α') ax.set_title(title) ax.grid(axis='x', alpha=0.3) plt.suptitle('Posterior α by Temperature and Model', fontsize=14, y=1.02) plt.tight_layout() plt.show() ``` ### Risky Sensitivity: ω and κ ```{python} #| label: fig-omega-kappa #| fig-cap: "Left: Posterior ω (risky sensitivity) from m_21 and m_31 across temperatures. The two models produce consistent ω estimates, both showing the temperature–sensitivity decline. Right: Posterior κ from m_31. The proportionality parameter clusters below 1.0, indicating that the LLM is systematically less sensitive to EU in the risky context. The 90% CIs include 1.0 at some temperatures." #| fig-height: 5 fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5)) # Left: ω comparison y_positions = np.arange(len(temperatures))[::-1] for i, t in enumerate(temperatures): # m_21 omega draws_21 = omega_draws['m_21'][t] med_21 = np.median(draws_21) q05_21, q25_21, q75_21, q95_21 = np.percentile(draws_21, [5, 25, 75, 95]) y = y_positions[i] + 0.15 ax1.plot([q05_21, q95_21], [y, y], color=SEU_COLORS['primary'], linewidth=1.5) ax1.plot([q25_21, q75_21], [y, y], color=SEU_COLORS['primary'], linewidth=4) ax1.plot(med_21, y, 'o', color=SEU_COLORS['primary'], markersize=7) # m_31 omega (derived) draws_31 = omega_draws['m_31'][t] med_31 = np.median(draws_31) q05_31, q25_31, q75_31, q95_31 = np.percentile(draws_31, [5, 25, 75, 95]) y = y_positions[i] - 0.15 ax1.plot([q05_31, q95_31], [y, y], color=SEU_COLORS['accent'], linewidth=1.5) ax1.plot([q25_31, q75_31], [y, y], color=SEU_COLORS['accent'], linewidth=4) ax1.plot(med_31, y, 's', color=SEU_COLORS['accent'], markersize=7) ax1.set_yticks(y_positions) ax1.set_yticklabels([f'T = {t}' for t in temperatures]) ax1.set_xlabel('ω (risky sensitivity)') ax1.set_title('Posterior ω by Temperature') ax1.legend(['m₂₁ (free ω)', '', '', 'm₃₁ (ω = κ·α)', '', ''], loc='upper right', fontsize=9) ax1.grid(axis='x', alpha=0.3) # Right: κ from m_31 for i, t in enumerate(temperatures): draws = kappa_draws[t] median = np.median(draws) q05, q25, q75, q95 = np.percentile(draws, [5, 25, 75, 95]) y = y_positions[i] color = SEU_PALETTE[i] ax2.plot([q05, q95], [y, y], color=color, linewidth=1.5) ax2.plot([q25, q75], [y, y], color=color, linewidth=4) ax2.plot(median, y, 'o', color=color, markersize=8, zorder=5) ax2.axvline(x=1.0, color='gray', linestyle='--', alpha=0.5, label='κ = 1 (m₁₁ equiv.)') ax2.set_yticks(y_positions) ax2.set_yticklabels([f'T = {t}' for t in temperatures]) ax2.set_xlabel('κ') ax2.set_title('m₃₁: Posterior κ (ω/α ratio)') ax2.legend(fontsize=9) ax2.grid(axis='x', alpha=0.3) plt.tight_layout() plt.show() ``` ### Posterior Densities ```{python} #| label: fig-density-alpha #| fig-cap: "Posterior density of α for each temperature under m_11 (shared α). The clear separation between low-temperature (T ≤ 0.7) and high-temperature (T ≥ 1.0) conditions replicates the pattern from the m_01 analysis, but with substantially tighter posteriors owing to the doubled data." #| fig-height: 5 fig, ax = plt.subplots(figsize=(8, 5)) for i, t in enumerate(temperatures): draws = alpha_draws['m_11'][t] kde = gaussian_kde(draws) x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.2, 300) ax.fill_between(x_grid, kde(x_grid), alpha=0.15, color=SEU_PALETTE[i]) ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2, label=f'T = {t}') ax.set_xlabel('α') ax.set_ylabel('Density') ax.set_title('m₁₁: Posterior Density of α') ax.legend(loc='upper right') plt.tight_layout() plt.show() ``` ```{python} #| label: fig-density-omega #| fig-cap: "Posterior density of ω (risky sensitivity) from m_21 at each temperature. The patterns mirror α—declining with temperature—but at lower absolute levels." #| fig-height: 5 fig, ax = plt.subplots(figsize=(8, 5)) for i, t in enumerate(temperatures): draws = omega_draws['m_21'][t] kde = gaussian_kde(draws) x_grid = np.linspace(draws.min() * 0.8, draws.max() * 1.2, 300) ax.fill_between(x_grid, kde(x_grid), alpha=0.15, color=SEU_PALETTE[i]) ax.plot(x_grid, kde(x_grid), color=SEU_PALETTE[i], linewidth=2, label=f'T = {t}') ax.set_xlabel('ω') ax.set_ylabel('Density') ax.set_title('m₂₁: Posterior Density of ω (Risky Sensitivity)') ax.legend(loc='upper right') plt.tight_layout() plt.show() ``` ## Posterior Predictive Checks {#sec-ppc} The augmented models produce separate posterior predictive check statistics for uncertain and risky choices. For each context, we compute three test statistics: - **Log-likelihood (ll):** The total log-likelihood of the observed choices under the model — i.e., $\sum_i \log p(y_i^{\text{obs}} \mid \theta)$ — computed separately for uncertain and risky observations. - **Modal choice frequency (modal):** The fraction of decision problems in which the alternative assigned the highest predicted probability by the model is the one actually chosen by the LLM. - **Mean choice probability (prob):** The average predicted probability assigned to the observed choice across all problems — i.e., $\frac{1}{N}\sum_i p(y_i^{\text{obs}} \mid \theta)$. The posterior predictive p-value is the proportion of replicated datasets where the statistic equals or exceeds the observed value; 0.5 indicates perfect calibration. ### m_11 PPCs ```{python} #| label: tbl-ppc-m11 #| tbl-cap: "m_11 posterior predictive p-values. The uncertain-choice statistics are well-calibrated. The risky-choice modal and prob statistics run high, suggesting the model's shared α may be somewhat too low to fully account for the risky context's regularity." #| echo: false rows = [] for t in temperatures: tk = temp_key_map[t] p = ppc_summary['m_11'][tk] rows.append({ 'T': t, 'LL unc': f"{p['ppc_ll_uncertain']:.3f}", 'Modal unc': f"{p['ppc_modal_uncertain']:.3f}", 'Prob unc': f"{p['ppc_prob_uncertain']:.3f}", 'LL risky': f"{p['ppc_ll_risky']:.3f}", 'Modal risky': f"{p['ppc_modal_risky']:.3f}", 'Prob risky': f"{p['ppc_prob_risky']:.3f}", 'LL combined': f"{p['ppc_ll_combined']:.3f}", }) pd.DataFrame(rows) ``` ### m_21 PPCs ```{python} #| label: tbl-ppc-m21 #| tbl-cap: "m_21 posterior predictive p-values. With a separate ω for risky choices, the risky PPCs are better calibrated than under m_11—particularly the modal and prob statistics, which no longer show systematic upward bias." #| echo: false rows = [] for t in temperatures: tk = temp_key_map[t] p = ppc_summary['m_21'][tk] rows.append({ 'T': t, 'LL unc': f"{p['ppc_ll_uncertain']:.3f}", 'Modal unc': f"{p['ppc_modal_uncertain']:.3f}", 'Prob unc': f"{p['ppc_prob_uncertain']:.3f}", 'LL risky': f"{p['ppc_ll_risky']:.3f}", 'Modal risky': f"{p['ppc_modal_risky']:.3f}", 'Prob risky': f"{p['ppc_prob_risky']:.3f}", 'LL combined': f"{p['ppc_ll_combined']:.3f}", }) pd.DataFrame(rows) ``` ### m_31 PPCs ```{python} #| label: tbl-ppc-m31 #| tbl-cap: "m_31 posterior predictive p-values. The proportional model (ω = κ·α) produces PPC calibration intermediate between m_11 and m_21, consistent with its intermediate structural flexibility." #| echo: false rows = [] for t in temperatures: tk = temp_key_map[t] p = ppc_summary['m_31'][tk] rows.append({ 'T': t, 'LL unc': f"{p['ppc_ll_uncertain']:.3f}", 'Modal unc': f"{p['ppc_modal_uncertain']:.3f}", 'Prob unc': f"{p['ppc_prob_uncertain']:.3f}", 'LL risky': f"{p['ppc_ll_risky']:.3f}", 'Modal risky': f"{p['ppc_modal_risky']:.3f}", 'Prob risky': f"{p['ppc_prob_risky']:.3f}", 'LL combined': f"{p['ppc_ll_combined']:.3f}", }) pd.DataFrame(rows) ``` ### PPC Comparison ```{python} #| label: fig-ppc-comparison #| fig-cap: "PPC p-values across models and contexts. Left: uncertain choices. Right: risky choices. The ideal calibration line at 0.5 is shown as a dashed line. m_21 (separate ω) achieves the best risky-choice calibration." #| fig-height: 5 fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5)) model_colors = {'m_11': SEU_COLORS['primary'], 'm_21': SEU_COLORS['accent'], 'm_31': SEU_COLORS['success']} model_markers = {'m_11': 'o', 'm_21': 's', 'm_31': 'D'} # Uncertain PPCs for model in ['m_11', 'm_21', 'm_31']: ll_vals = [ppc_summary[model][tk]['ppc_ll_uncertain'] for tk in temp_keys] prob_vals = [ppc_summary[model][tk]['ppc_prob_uncertain'] for tk in temp_keys] ax1.scatter(temperatures, ll_vals, color=model_colors[model], marker=model_markers[model], s=60, label=f'{model} (ll)', alpha=0.8) ax1.scatter(temperatures, prob_vals, color=model_colors[model], marker=model_markers[model], s=60, alpha=0.4, facecolors='none', edgecolors=model_colors[model], linewidths=1.5) ax1.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5) ax1.set_xlabel('Temperature') ax1.set_ylabel('PPC p-value') ax1.set_title('Uncertain Choices') ax1.set_ylim(0, 1) ax1.legend(fontsize=8) # Risky PPCs for model in ['m_11', 'm_21', 'm_31']: ll_vals = [ppc_summary[model][tk]['ppc_ll_risky'] for tk in temp_keys] prob_vals = [ppc_summary[model][tk]['ppc_prob_risky'] for tk in temp_keys] ax1.scatter(temperatures, ll_vals, color=model_colors[model], marker=model_markers[model], s=60, alpha=0.8) ax2.scatter(temperatures, ll_vals, color=model_colors[model], marker=model_markers[model], s=60, label=f'{model} (ll)', alpha=0.8) ax2.scatter(temperatures, prob_vals, color=model_colors[model], marker=model_markers[model], s=60, alpha=0.4, facecolors='none', edgecolors=model_colors[model], linewidths=1.5) ax2.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5) ax2.set_xlabel('Temperature') ax2.set_ylabel('PPC p-value') ax2.set_title('Risky Choices') ax2.set_ylim(0, 1) ax2.legend(fontsize=8) plt.tight_layout() plt.show() ``` The PPC analysis reveals a clear model-adequacy story: - **Uncertain choices** are well-described by all three models, with p-values mostly in $[0.2, 0.6]$. This is expected — the uncertain likelihood shares the same structure as m_01, which showed good fit in [Report 1](../temperature_study/01_initial_study.qmd). - **Risky choices under m_11** show systematically elevated `ppc_modal_risky` (0.71–0.96) and `ppc_prob_risky` (0.60–0.86): the model assigns even higher probability to observed choices than expected under its own generative process. This occurs because m_11's shared $\alpha$ is pulled toward a compromise between the two contexts. - **m_21 resolves this miscalibration** by giving risky choices their own $\omega$, bringing all risky PPC p-values into the well-calibrated range. - **m_31 falls between** the other two, consistent with its intermediate structural flexibility. ::: {.callout-important} ## Limitation: No Formal Information-Theoretic Model Comparison The PPC analysis provides evidence of model *adequacy* — whether each model can reproduce observed data patterns — but does not quantify the predictive performance trade-off between models of different complexity. A formal model comparison using leave-one-out cross-validation (LOO-CV via Pareto-smoothed importance sampling) or the widely applicable information criterion (WAIC) would complement the PPC evidence, particularly since m_21 has an additional free parameter relative to m_11 and m_31. All models output pointwise log-likelihood values (`log_lik_uncertain`, `log_lik_risky`), making PSIS-LOO straightforward to compute. We note this as an important gap: the PPC-based preference for m_21 is supported by the specific pattern of misfit in m_11's risky-choice statistics, but formal information-theoretic comparison would strengthen the model selection conclusion. Future revisions of this report should include LOO-CV with elpd differences and standard errors across models. ::: ## Monotonicity Analysis {#sec-monotonicity} ### Global Slope: α We replicate the draw-wise slope analysis from [Report 1](../temperature_study/01_initial_study.qmd). For each posterior draw, we regress $\alpha$ on temperature across the five conditions and collect the slope coefficient. ```{python} #| label: fig-slope-alpha #| fig-cap: "Posterior distribution of the slope Δα/ΔT for each model. All three models place virtually all posterior mass below zero, confirming the temperature–sensitivity relationship is robust to model specification. m_11 (shared α) yields the tightest slope posterior." #| fig-height: 5 temp_array = np.array(temperatures) fig, axes = plt.subplots(1, 3, figsize=(14, 4), sharey=True) for ax, model, title in zip(axes, ['m_11', 'm_21', 'm_31'], ['m₁₁ (shared α)', 'm₂₁ (α uncertain)', 'm₃₁ (α uncertain)']): n_draws = len(alpha_draws[model][temperatures[0]]) slope_draws = [] for draw_idx in range(n_draws): alpha_vec = np.array([alpha_draws[model][t][draw_idx] for t in temperatures]) # OLS: b = cov(T, alpha) / var(T) b = np.polyfit(temp_array, alpha_vec, 1)[0] slope_draws.append(b) slope_draws = np.array(slope_draws) kde = gaussian_kde(slope_draws) x_grid = np.linspace(np.percentile(slope_draws, 0.5), np.percentile(slope_draws, 99.5), 300) ax.fill_between(x_grid, kde(x_grid), alpha=0.3, color=SEU_COLORS['primary']) ax.plot(x_grid, kde(x_grid), color=SEU_COLORS['primary'], linewidth=2) median_slope = np.median(slope_draws) q05, q95 = np.percentile(slope_draws, [5, 95]) ax.axvline(x=median_slope, color=SEU_COLORS['accent'], linestyle='-', linewidth=2) ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5) prob_neg = np.mean(slope_draws < 0) ax.set_xlabel('Slope (Δα / ΔT)') ax.set_title(f'{title}\nmed={median_slope:.1f}, P(<0)={prob_neg:.4f}') ax.grid(alpha=0.2) axes[0].set_ylabel('Density') plt.suptitle('Temperature–Sensitivity Slope (α)', fontsize=13, y=1.02) plt.tight_layout() plt.show() ``` ### Global Slope: ω ```{python} #| label: fig-slope-omega #| fig-cap: "Posterior distribution of the slope Δω/ΔT from m_21 (free ω) and m_31 (derived ω = κ·α). Both models confirm that risky sensitivity also declines with temperature." #| fig-height: 4 fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4)) for ax, model, title, color in zip( [ax1, ax2], ['m_21', 'm_31'], ['m₂₁ (free ω)', 'm₃₁ (ω = κ·α)'], [SEU_COLORS['accent'], SEU_COLORS['success']]): n_draws = len(omega_draws[model][temperatures[0]]) slope_draws = [] for draw_idx in range(n_draws): omega_vec = np.array([omega_draws[model][t][draw_idx] for t in temperatures]) b = np.polyfit(temp_array, omega_vec, 1)[0] slope_draws.append(b) slope_draws = np.array(slope_draws) kde = gaussian_kde(slope_draws) x_grid = np.linspace(np.percentile(slope_draws, 0.5), np.percentile(slope_draws, 99.5), 300) ax.fill_between(x_grid, kde(x_grid), alpha=0.3, color=color) ax.plot(x_grid, kde(x_grid), color=color, linewidth=2) median_slope = np.median(slope_draws) prob_neg = np.mean(slope_draws < 0) ax.axvline(x=median_slope, color='black', linestyle='-', linewidth=1.5) ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5) ax.set_xlabel('Slope (Δω / ΔT)') ax.set_title(f'{title}\nmed={median_slope:.1f}, P(<0)={prob_neg:.4f}') ax.grid(alpha=0.2) axes[0].set_ylabel('Density') plt.tight_layout() plt.show() ``` ### Pairwise Comparisons (m_11) ```{python} #| label: tbl-pairwise-m11 #| tbl-cap: "m_11: Posterior probability that α is higher at the lower temperature. The strong-separation / weak-separation pattern from the m_01 analysis (Report 1) replicates exactly." #| echo: false from itertools import combinations pair_rows = [] for t_low, t_high in combinations(temperatures, 2): draws_low = alpha_draws['m_11'][t_low] draws_high = alpha_draws['m_11'][t_high] prob = np.mean(draws_low > draws_high) pair_rows.append({ 'Pair': f'T={t_low} vs T={t_high}', 'P(α_low > α_high)': f'{prob:.4f}', 'Strength': '●●●' if prob > 0.95 else ('●●' if prob > 0.80 else ('●' if prob > 0.65 else '○')), }) pd.DataFrame(pair_rows) ``` The pairwise structure exactly replicates [Report 1](../temperature_study/01_initial_study.qmd): strong separation between $T = 0.0$ and $T \geq 1.0$, moderate separation between middle and high temperatures, and near-indistinguishability between $T = 0.3$ and $T = 0.7$. ## Formal Context Comparison: Uncertain vs. Risky Sensitivity {#sec-context-comparison} The most novel finding of this report is that the LLM exhibits lower EU sensitivity in the risky context than in the uncertain context. This section provides formal quantification of that claim. ### Posterior Probability of α > ω (m_21) ```{python} #| label: tbl-alpha-gt-omega #| tbl-cap: "m_21: Posterior probability that uncertain sensitivity α exceeds risky sensitivity ω at each temperature, with the median difference Δ = α − ω." #| echo: false context_rows = [] for t in temperatures: tk = temp_key_map[t] a_draws = alpha_draws['m_21'][t] o_draws = omega_draws['m_21'][t] diff = a_draws - o_draws prob_gt = np.mean(a_draws > o_draws) context_rows.append({ 'Temp': t, 'P(α > ω)': f'{prob_gt:.4f}', 'Median Δ': f'{np.median(diff):.1f}', '90% CI of Δ': f'[{np.percentile(diff, 5):.1f}, {np.percentile(diff, 95):.1f}]', }) pd.DataFrame(context_rows) ``` ### Aggregate Test: Mean Difference Across Temperatures (m_21) ```{python} #| label: tbl-aggregate-context #| tbl-cap: "Aggregate measure of the context-dependent sensitivity gap. For each posterior draw, the five per-temperature α − ω differences are averaged, yielding a single summary of the overall gap." #| echo: false n_draws = len(alpha_draws['m_21'][temperatures[0]]) mean_diffs = [] for draw_idx in range(n_draws): diffs = [alpha_draws['m_21'][t][draw_idx] - omega_draws['m_21'][t][draw_idx] for t in temperatures] mean_diffs.append(np.mean(diffs)) mean_diffs = np.array(mean_diffs) print(f"Aggregate mean(α − ω) across temperatures:") print(f" Median: {np.median(mean_diffs):.1f}") print(f" 90% CI: [{np.percentile(mean_diffs, 5):.1f}, {np.percentile(mean_diffs, 95):.1f}]") print(f" P(mean > 0): {np.mean(mean_diffs > 0):.4f}") ``` ::: {.callout-note} ## Interpreting the Aggregate Test The aggregate P(mean(α − ω) > 0) provides a single summary of the strength of evidence for context-dependent sensitivity across all temperatures. This draw-wise averaging treats the per-temperature estimates as independent (since they are fit separately), which is appropriate given the study design but does not pool information across temperatures. ::: ### Posterior Probability of κ < 1 (m_31) ```{python} #| label: tbl-kappa-lt-1 #| tbl-cap: "m_31: Posterior probability that the proportionality parameter κ is below 1.0 at each temperature. κ < 1 indicates lower risky sensitivity relative to uncertain sensitivity." #| echo: false kappa_rows = [] for t in temperatures: draws = kappa_draws[t] prob_lt1 = np.mean(draws < 1.0) kappa_rows.append({ 'Temp': t, 'κ median': f'{np.median(draws):.3f}', 'P(κ < 1)': f'{prob_lt1:.4f}', '90% CI': f'[{np.percentile(draws, 5):.3f}, {np.percentile(draws, 95):.3f}]', }) pd.DataFrame(kappa_rows) ``` ### Pairwise Comparisons: ω (m_21) ```{python} #| label: tbl-pairwise-omega #| tbl-cap: "m_21: Posterior probability that ω (risky sensitivity) is higher at the lower temperature. The temperature–sensitivity gradient is present but may differ in strength from the α gradient." #| echo: false omega_pair_rows = [] for t_low, t_high in combinations(temperatures, 2): draws_low = omega_draws['m_21'][t_low] draws_high = omega_draws['m_21'][t_high] prob = np.mean(draws_low > draws_high) omega_pair_rows.append({ 'Pair': f'T={t_low} vs T={t_high}', 'P(ω_low > ω_high)': f'{prob:.4f}', 'Strength': '●●●' if prob > 0.95 else ('●●' if prob > 0.80 else ('●' if prob > 0.65 else '○')), }) pd.DataFrame(omega_pair_rows) ``` The pairwise ω comparison clarifies whether the temperature–sensitivity gradient operates similarly in the risky context. If the pairwise probabilities are systematically lower for ω than for α (@tbl-pairwise-m11), this would indicate that risky sensitivity is not only lower in level but also less responsive to temperature — a more nuanced finding. ## Cross-Study Comparison {#sec-cross-study} ```{python} #| label: fig-cross-study #| fig-cap: "Cross-model consistency check of α posteriors from the initial m_01 study and the augmented m_11, m_21, and m_31 models *fit to the same uncertain-choice data*. Labels: m_01 (Report 1, uncertain only), m_11 (shared α, this report), m_21 (separate α and ω, this report), m_31 (proportional ω = κ·α, this report). Because all four fits share the uncertain-choice data, the agreement across models is a within-dataset consistency check on the augmented model structure, not an independent replication of the temperature pattern." #| fig-height: 6 fig, ax = plt.subplots(figsize=(10, 6)) model_configs = [ ('m_01', 'm_01 (Report 1, uncertain only)', m01_analysis['summary_table'], SEU_COLORS['secondary'], 'o', 0.45), ('m_11', 'm_11 (shared α, this report)', None, SEU_COLORS['primary'], 's', 0.15), ('m_21', 'm_21 (separate α, ω)', None, SEU_COLORS['accent'], 'D', -0.15), ('m_31', 'm_31 (ω = κ·α)', None, SEU_COLORS['success'], '^', -0.45), ] y_positions = np.arange(len(temperatures))[::-1] for model_key, model_label, summary_data, color, marker, offset in model_configs: for i, t in enumerate(temperatures): y = y_positions[i] + offset if model_key == 'm_01': entry = summary_data[i] median = entry['median'] q05 = entry['ci_low'] q95 = entry['ci_high'] else: tk = temp_key_map[t] p = param_summary[model_key][tk] median = p['alpha_median'] q05 = p['alpha_q05'] q95 = p['alpha_q95'] ax.plot([q05, q95], [y, y], color=color, linewidth=1.5, alpha=0.7) ax.plot(median, y, marker, color=color, markersize=7, zorder=5, label=model_label if i == 0 else '') ax.set_yticks(y_positions) ax.set_yticklabels([f'T = {t}' for t in temperatures]) ax.set_xlabel('α (sensitivity)') ax.set_title('Cross-Study Comparison: α Posteriors') ax.legend(loc='upper right', fontsize=10) ax.grid(axis='x', alpha=0.3) plt.tight_layout() plt.show() ``` Several patterns emerge from the cross-study comparison. An important caveat: the uncertain-choice data in the augmented models (m_11, m_21, m_31) are *the same data* as in m_01, so the α estimates from the augmented models are not independent of the m_01 estimates. This comparison is therefore a *consistency check*—verifying that the augmented model structure does not distort the uncertain-context estimates—rather than an independent replication. 1. **Qualitative consistency.** All four models agree on the direction and approximate shape of the temperature–sensitivity relationship. The ordering $\alpha(T\!=\!0.0) > \alpha(T\!=\!0.3) \approx \alpha(T\!=\!0.7) > \alpha(T\!=\!1.0) > \alpha(T\!=\!1.5)$ is preserved. 2. **m_11 produces the tightest posteriors.** By forcing a single $\alpha$ to explain both 300 uncertain and 300 risky choices, m_11 achieves SDs roughly half those of m_01 (which used 300 uncertain choices alone). The medians are somewhat lower than m_01's — a consequence of the shared $\alpha$ being a compromise between the two contexts, with the risky data favoring lower values. 3. **m_21 and m_31 recover m_01-like α values.** When the risky context is given its own sensitivity parameter ($\omega$ or $\kappa \cdot \alpha$), the uncertain-context $\alpha$ estimates closely match the m_01 values. This is expected as a necessary consequence of the shared uncertain-choice likelihood: the m_01 and m_21/m_31 models fit the same uncertain observations with the same structural assumptions. The consistency confirms that the augmented model does not introduce distortion, rather than providing independent confirmation of the m_01 estimates. 4. **The $\alpha/\omega$ gap is substantive.** Under m_21 at $T = 0.0$, the median $\alpha \approx 70$ while the median $\omega \approx 41$. As quantified formally in @sec-context-comparison, the LLM is approximately 1.7× more sensitive to EU differences in the uncertain context than in the risky context. ## Discussion {#sec-discussion} ### Summary of Findings This study demonstrates three key results: 1. **Temperature–sensitivity replication.** The negative association between sampling temperature and estimated SEU sensitivity $\alpha$ — first established in [Report 1](../temperature_study/01_initial_study.qmd) using the m_01 model — is robust to the inclusion of risky alternatives and to the choice of augmented model (m_11, m_21, m_31). The qualitative pattern — high sensitivity at greedy decoding, a marked decrease between $T = 0.7$ and $T = 1.0$, and the near-indistinguishability of $T = 0.3$ and $T = 0.7$ — replicates exactly. 2. **Context-dependent sensitivity.** Under m_21 and m_31, the LLM's sensitivity to EU maximization is consistently *lower* in the risky context (where probabilities are stated explicitly) than in the uncertain context (where probabilities are inferred from features). The m_31 proportionality parameter $\kappa$ clusters below 1.0 at every temperature level, with medians ranging from 0.71 to 0.94. 3. **Model adequacy.** Posterior predictive checks support m_21 as the best-calibrated model: its separate $\omega$ parameter resolves the upward bias in risky-choice PPC statistics that m_11 exhibits. The proportional model m_31 provides a reasonable compromise, but does not match m_21's calibration at all temperatures. ### Interpretation: Why Is Risky Sensitivity Lower? The finding that $\omega < \alpha$ — now formally quantified in @sec-context-comparison — is a robust empirical pattern, but its *explanation* remains open. The interpretations below are **post hoc**: they were formulated after observing the data and cannot be discriminated by the current design. The data establish the descriptive fact; mechanistic explanation requires follow-up study. - **Format effect.** When probabilities are stated numerically (risky context), the LLM may process them less effectively than when probability-relevant information is embedded in natural-language descriptions (uncertain context). The softmax token sampling introduces noise at the token level, which may compound differently across the two representations. This hypothesis could in principle be tested by presenting risky alternatives in natural-language format (e.g., "about a 90% chance of neither claim being approved") and observing whether sensitivity rises to uncertain-context levels. - **Calibration asymmetry.** The feature-to-probability mapping $\psi = \text{softmax}(\beta \cdot w)$ is learned jointly with $\alpha$, and may effectively "sharpen" the inferred probability distributions in ways that favor EU-aligned choices. The risky context has no such adaptive layer. This is partially testable: if the estimated subjective probabilities $\psi_r$ under the fitted model are more "peaked" (lower entropy) than the stated risky simplexes, this would be consistent with the β layer acting as an adaptive sharpening mechanism. - **Utility estimation precision.** In the risky context, the expected utilities $\eta^{(r)} = x' \upsilon$ are exact given $\upsilon$, creating very fine EU differences between alternatives with similar probability profiles. When many alternatives have nearly equal EU, even moderate sensitivity $\omega$ produces near-uniform choice probabilities. This could be assessed by comparing the distribution of EU differences among alternatives in risky vs. uncertain problems. Any of these mechanisms — or some combination — could be operative; the current data do not discriminate among them. ### Confounds in the Uncertain/Risky Comparison Beyond the post hoc nature of the interpretations above, the uncertain and risky contexts differ in ways that go beyond probability format: - **Stimulus complexity.** Uncertain alternatives are derived from natural-language claim descriptions processed through embedding and PCA ($D = 32$ features per alternative), while risky alternatives present explicit probability simplexes ($K = 3$ values). - **Dimensionality and estimation burden.** The uncertain model estimates a $K \times D$ matrix $\beta$ jointly with $\alpha$; the risky model takes probabilities as given. - **Representation pathway.** Uncertain probabilities pass through a learned softmax mapping; risky probabilities enter the EU calculation directly. These differences mean that the $\alpha > \omega$ finding could reflect task structure rather than probability-format effects per se. A matched design — where the same 30 stimulus profiles are used in both contexts, with probabilities either inferred or stated — would provide cleaner causal attribution. The current finding should be interpreted as: *as operationalized in this design*, risky choices show lower sensitivity to EU differences. ### Connection to the JDM Risk–Ambiguity Literature The distinction between our "uncertain" and "risky" contexts maps directly onto the risk–ambiguity distinction that has been central to JDM since Ellsberg (1961). In the uncertain context, the LLM must infer probabilities from text features — analogous to the "ambiguity" condition where probabilities are unknown or imprecise. In the risky context, probabilities are stated explicitly — the canonical "risk" condition. A large body of research on human decision-making has documented *ambiguity aversion*: people tend to prefer options with known probabilities over options with unknown probabilities, even when expected values are equivalent (Camerer & Weber, 1992; Trautmann & van de Kuilen, 2015). This typically manifests as more conservative choice under ambiguity. The finding that the LLM shows *higher* EU sensitivity under uncertainty/ambiguity than under risk is interesting in light of this literature, though direct comparison requires caution. Higher $\alpha$ means choices are more tightly aligned with EU maximization — which could be seen as *more rational* rather than more conservative. Whether this pattern reflects something analogous to human ambiguity attitudes, or is an artifact of the adaptive $\beta$ layer, remains an open question. The [Ellsberg study](../ellsberg_study/01_ellsberg_study.qmd) in this series engages more directly with classic Ellsberg-style ambiguity manipulations; cross-referencing those findings with the present results may shed light on whether the sensitivity asymmetry reflects a general feature of LLM probability processing. ### Practical Implications The finding that temperature affects LLM rationality has implications for AI deployment. At greedy decoding ($T = 0.0$), the LLM's choices are most closely aligned with EU maximization — potentially desirable in applications where consistent, utility-maximizing decisions are valued (e.g., automated triage, recommendation systems) but potentially undesirable where diversity of response or exploration is needed. The additional finding that context format affects sensitivity suggests that how probabilities are presented to an LLM may matter for the quality of its decisions, independent of the temperature setting. ### Limitations and Next Steps **Independent temperature fits.** The current analysis fits each temperature condition independently. A hierarchical model that pools information across temperatures — e.g., $\log \alpha(T) = a + bT$, $\log \omega(T) = c + dT$ — would directly estimate slope parameters, test whether the temperature effect on $\omega$ parallels that on $\alpha$, and obviate the need for draw-wise slope computation across independent fits. The m_31 structure ($\omega = \kappa \alpha$) would be particularly amenable to a hierarchical extension where $\kappa$ is allowed to vary with temperature. The near-equality of $T = 0.3$ and $T = 0.7$ estimates motivates investigation of whether the relationship is piecewise or smoothly nonlinear. **Precision on κ.** The $\kappa < 1$ finding, while consistent across temperatures, has wide credible intervals; a study with larger $N$ (more risky problems per temperature) would improve precision on this parameter and enable sharper discrimination between m_11 and m_31. **Prior sensitivity.** The $\alpha$ prior — Lognormal(3.0, 0.75) — is carried forward from [Report 1](../temperature_study/01_initial_study.qmd) without robustness checking, and the $\omega$ prior in m_21 adopts the same hyperparameters by symmetry. The $\kappa$ prior in m_31 — Lognormal(0, 0.5) — is moderately informative. Refitting under alternative priors (e.g., $\alpha, \omega \sim \text{Lognormal}(2.5, 1.0)$) and verifying that the qualitative findings — the temperature gradient and the $\omega < \alpha$ pattern — are preserved would strengthen the robustness of the conclusions. Prior-to-posterior contraction ratios for the key parameters would further quantify the data's influence relative to the prior. **Single LLM and task domain.** All results are from GPT-4o on the insurance triage task. The [Ellsberg study](../ellsberg_study/01_ellsberg_study.qmd) and [GPT-4o Ellsberg study](../gpt4o_ellsberg_study/01_gpt4o_ellsberg_study.qmd) in this series provide some cross-task context, while the [Claude Insurance study](../claude_insurance_study/01_claude_insurance_study.qmd) provides cross-LLM context on the same task domain. The [factorial synthesis](../factorial_synthesis/01_factorial_synthesis.qmd) formally disentangles LLM and task effects. ### Construct Validity Revisited Returning to the construct-validity framing introduced at the top of this report: the empirical results of this study illustrate, in miniature, exactly what is gained by moving from the m_0 / m_01 family to a design with risky alternatives. The temperature–$\alpha$ relationship from [Report 1](../temperature_study/01_initial_study.qmd) is a layer-(2) finding that survives the move (it replicates under all three augmented models), but the m_31 estimate of $\kappa$ is a layer-(3)–adjacent quantity that the m_0 / m_01 family **cannot produce at all** — and the data do indeed locate $\kappa$ substantively below 1.0 across all temperatures. That this is possible only because risky choices give $\delta$ direct identifying information is the methodological point. For the planned alignment study (see [`prompts/hierarchical_alignment_study_plan.md`](../../../prompts/hierarchical_alignment_study_plan.md)), the implications are concrete: 1. **The h_m01-based first wave is layer (2) by construction** — contrasts on $\log\alpha$ across alignment manipulations — and inherits the m_01 caveats spelled out in [Report 1](../temperature_study/01_initial_study.qmd). 2. **A second wave that adds risky alternatives** would lift the alignment study into the same identification regime as the present m_11 / m_21 / m_31 family, allowing context-comparison parameters analogous to $\kappa$ — for example, prompt-condition ratios of risky-vs-uncertain sensitivity — that are robust to the absolute scaling of $u_\theta$. 3. **The wide credible intervals on $\kappa$ here at $N = 300$** are a direct planning input for any such follow-up: precise estimation of context-comparison parameters demands more risky problems per cell than precise estimation of $\alpha$ alone. In short, the present study should be read both as a substantive extension of the temperature finding *and* as the methodological template for the m_1 / m_2 / m_3-family follow-up that the construct-validity discussion in [Report 1](../temperature_study/01_initial_study.qmd) identifies as the principled way to move beyond the m_0 / m_01 identification limit. ### Transparency Note The decision to fit three models (m_11, m_21, m_31) was pre-specified as part of the study design — the model family was defined in advance based on the nesting structure established in the [foundational reports](../../foundations/07_generalizing_sensitivity.qmd). The specific finding that m_21 shows better PPC calibration than m_11, and the subsequent focus on the $\alpha > \omega$ pattern, are data-driven and should be regarded as exploratory rather than confirmatory. The formal context comparison (@sec-context-comparison) was added during revision to provide rigorous quantification of a pattern that was initially presented only qualitatively.