Does m_1 Actually Identify δ?

Jeff Helzner

Does m_1 Actually Identify δ?

Foundational Report 14

foundations

validation

identification

m_0

m_1

Matched-design parameter recovery testing whether adding risky choices (model m_1) delivers the δ identification gain over uncertain-only choices (model m_0) that the theoretical argument in Report 5 predicts.

Author

Jeff Helzner

Published

June 27, 2026

0.1 Introduction

Report 5 argues, on identification-theoretic grounds, that adding risky choices to an uncertain-choice study should sharpen the recovery of the utility increments $\boldsymbol{\delta}$ in model m_1 relative to model m_0. The argument rests on the structural observation that risky-choice expected utilities depend only on $\boldsymbol{\upsilon}$ (and hence $\boldsymbol{\delta}$), not on $\boldsymbol{\beta}$, so risky choices break the multiplicative $(\boldsymbol{\beta}, \boldsymbol{\delta})$ coupling that limits learning about $\boldsymbol{\delta}$ from uncertain choices alone.

This is a claim about likelihood structure. Its empirical force depends on whether the additional information that risky choices provide is substantial at the sample sizes and lottery designs actually used in practice. The recovery study reported in Report 5 was incomplete — the four-way comparison it scoped (m_0 at M=25; m_0 at M=50 with same alternatives; m_0 at M=50 with new alternatives; m_1 at M=25+N=25) was never fully executed. The headline m_1 number in results/parameter_recovery/m1_recovery/ was computed against m_0 fits using different study designs and different true-parameter draws, so the comparison was not strictly matched.

Why this matters now

Report 13 ruled out Route 2 (concentrated δ prior) as an identification improvement: it is prior regularization, not identification. Route 3 (hierarchical pooling, implemented in h_m01) does not break the single-cell likelihood coupling. That leaves Route 1 (adding risky choices) as the only proposed remedy that could improve identification in the likelihood-based sense. Before committing to a hierarchical Route 1 model (h_m11) for the alignment study, we need empirical confirmation that the single-level Route 1 model actually delivers the predicted δ-identification gain.

0.1.1 What this report does

This report runs a strictly matched-design parameter recovery study. For each of 30 iterations:

A single true parameter vector $(\alpha, \boldsymbol{\beta}, \boldsymbol{\delta})$ is drawn from the m_1 simulation priors.
A shared study design — $M = 50$ uncertain problems built on $R = 15$ alternatives, $N = 50$ risky problems built on $S = 15$ lotteries — is held fixed across all iterations and conditions.
The same simulated choice vectors $\mathbf{y}$ (uncertain) and $\mathbf{z}$ (risky) are sliced four ways:

Condition	Model	Uncertain problems	Risky problems	Total choices
A	`m_0`	25	—	25
B	`m_0`	50	—	50
C	`m_1`	25	25	50
D	`m_1`	50	50	100

The central comparison is B vs C: same true parameters, same total choice count, only the model and the type of choices differ. If the Route 1 identification argument has empirical force at this sample size, C should sharpen $\boldsymbol{\delta}$ recovery materially relative to B. The A vs B comparison serves as a data-quantity control inside the m_0 family; the C vs D comparison is the same control inside m_1.

The driver script is scripts/run_m1_matched_recovery.py; results live under results/parameter_recovery/m1_matched_comparison/.

0.2 Aggregate Recovery Metrics

Show code

fmt = {col: (lambda x: f"{x:.3f}") for col in metrics_df.columns
       if col not in ("Condition", "n_iter")}
display_df = metrics_df.copy()
for col, f in fmt.items():
    display_df[col] = display_df[col].apply(f)
print(display_df.to_string(index=False))

Table 1: Aggregate parameter-recovery metrics across the four matched-design conditions. CI columns report mean 90% credible-interval width; cov columns report fraction of iterations whose 90% interval covers the true value (nominal = 0.90, MC SE for n=30 ≈ 5 points).

        Condition  n_iter α RMSE  α CI α cov β RMSE  β CI β cov δ RMSE  δ CI δ cov
     A: m_0, M=25      30  0.656 2.811 0.967  0.970 3.171 0.889  0.305 0.894 0.833
     B: m_0, M=50      30  0.691 2.500 0.933  0.948 3.077 0.896  0.294 0.864 0.833
C: m_1, M=25+N=25      30  0.590 2.519 0.967  0.966 3.162 0.887  0.289 0.848 0.833
D: m_1, M=50+N=50      30  0.518 2.062 0.933  0.942 3.070 0.893  0.286 0.800 0.867

Show code

fig, axes = plt.subplots(2, 3, figsize=(14, 7), sharex=True)
metrics_for_param = {
    'α': ('α RMSE', 'α CI'),
    'β (avg K×D)': ('β RMSE', 'β CI'),
    'δ (avg K-1)': ('δ RMSE', 'δ CI'),
}
xs = np.arange(len(CONDITIONS))
labels = [COND_LABELS[c] for c in CONDITIONS]
colors = [COND_COLORS[c] for c in CONDITIONS]

for col_ix, (pname, (rmse_col, ci_col)) in enumerate(metrics_for_param.items()):
    ax_top = axes[0, col_ix]
    ax_bot = axes[1, col_ix]
    rmses = metrics_df[rmse_col].to_numpy()
    cis = metrics_df[ci_col].to_numpy()
    ax_top.bar(xs, rmses, color=colors, edgecolor='black', linewidth=0.5)
    ax_bot.bar(xs, cis, color=colors, edgecolor='black', linewidth=0.5)
    ax_top.set_title(pname, fontsize=12)
    if col_ix == 0:
        ax_top.set_ylabel('RMSE', fontsize=11)
        ax_bot.set_ylabel('mean 90% CI width', fontsize=11)
    ax_bot.set_xticks(xs)
    ax_bot.set_xticklabels(labels, rotation=25, ha='right', fontsize=9)
    for ax in (ax_top, ax_bot):
        ax.grid(True, axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

Figure 1: Aggregate δ and β recovery across the four matched-design conditions. Bars show RMSE (top) and mean 90% CI width (bottom). The central test is B vs C — same total choice count, same true δ per iteration, only the model and the type of choices differ. If Route 1’s likelihood structure delivers an identification gain, C should be visibly below B.

Headline finding

At equal total choice count (B vs C, 50 choices each), m_1’s risky-choice information delivers a δ-RMSE reduction of roughly 2% and a δ-CI-width reduction of roughly 2%. Doubling the number of risky choices (C to D) yields a further 1% / 6% improvement. The predicted “qualitative identification advantage” of risky choices over uncertain choices is, at this sample size and lottery design, a small quantitative improvement, not a qualitative one.

0.3 Within-Iteration B vs C Comparison

The aggregate numbers above could in principle hide a real B vs C effect if the variance across iterations is large relative to the mean difference. To rule that out, we compare B and C within each iteration — the matched design lets us hold true parameters fixed and just vary the model + data slice.

Show code

fig, axes = plt.subplots(1, 2, figsize=(11, 5))

ci_B, ci_C, err2_B, err2_C = [], [], [], []
for tp, smB, smC in zip(true_params, summaries_by_cond["B_m0_M50"], summaries_by_cond["C_m1_M25N25"]):
    for k in range(K-1):
        dt = tp["delta"][k]
        ci_B.append(smB.loc[f"delta[{k+1}]", "95%"] - smB.loc[f"delta[{k+1}]", "5%"])
        ci_C.append(smC.loc[f"delta[{k+1}]", "95%"] - smC.loc[f"delta[{k+1}]", "5%"])
        err2_B.append((smB.loc[f"delta[{k+1}]", "Mean"] - dt)**2)
        err2_C.append((smC.loc[f"delta[{k+1}]", "Mean"] - dt)**2)
ci_B, ci_C, err2_B, err2_C = map(np.array, (ci_B, ci_C, err2_B, err2_C))

ax = axes[0]
lim = max(ci_B.max(), ci_C.max()) * 1.05
ax.scatter(ci_B, ci_C, s=45, alpha=0.75, c='#1f77b4', edgecolor='white')
ax.plot([0, lim], [0, lim], 'r--', linewidth=1.5)
ax.set_xlim(0, lim); ax.set_ylim(0, lim); ax.set_aspect('equal')
ax.set_xlabel('B (m_0, M=50) δ CI width', fontsize=11)
ax.set_ylabel('C (m_1, M=25+N=25) δ CI width', fontsize=11)
ax.set_title(f'90% CI widths: C narrower than B in {(ci_C < ci_B).mean():.0%} of points', fontsize=11)
ax.grid(True, alpha=0.3)

ax = axes[1]
lim = max(err2_B.max(), err2_C.max()) * 1.05
ax.scatter(err2_B, err2_C, s=45, alpha=0.75, c='#1f77b4', edgecolor='white')
ax.plot([0, lim], [0, lim], 'r--', linewidth=1.5)
ax.set_xlim(0, lim); ax.set_ylim(0, lim); ax.set_aspect('equal')
ax.set_xlabel('B (m_0, M=50) δ squared error', fontsize=11)
ax.set_ylabel('C (m_1, M=25+N=25) δ squared error', fontsize=11)
ax.set_title(f'Squared error: C smaller than B in {(err2_C < err2_B).mean():.0%} of points', fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Figure 2: Within-iteration comparison of B (m_0, M=50) vs C (m_1, M=25+N=25). Each point is one (iteration, δ component) pair (n = 30 iterations × 2 components = 60 points). Top: 90% CI widths under B vs C; points below the diagonal favor C. Bottom: squared posterior-mean errors under B vs C; points below the diagonal favor C. With matched true parameters and matched total choice count, a sustained C-advantage would show as systematic displacement below the diagonal.

Within-iteration B vs C (matched true parameters, matched total choice count):
  median δ CI width:   B = 0.892,  C = 0.879
  median (C - B):       -0.0123  (-1.4% of B)
  C narrower than B:    80.0% of points
  δ RMSE:               B = 0.294,  C = 0.289
  C lower squared err:  53.3% of points

Wilcoxon signed-rank test on CI width (B - C, paired):
  W = 302.0,  p = 0.0000

The within-iteration analysis confirms the aggregate picture. C produces narrower δ credible intervals than B in a large majority of cases — the Wilcoxon test is significant — but the size of the C-advantage is small: a median improvement of roughly 1–2% of CI width. The squared-error advantage is even more marginal. Risky choices, in this design, displace uncertain choices on δ recovery by a margin that is statistically real but practically negligible.

0.4 Where Does the m_1 Advantage Actually Live?

If m_1 is not buying us δ identification, what is it buying? Figure 1 already hints at the answer: the largest single improvement across the four conditions is in $\alpha$ recovery. The 25 → 50 doubling within m_0 (A → B) actually slightly worsens α RMSE; switching from B to C cuts α RMSE noticeably; doubling within m_1 (C → D) cuts it further. This is consistent with risky choices providing a clean signal about the choice-sensitivity parameter — they remove the $\boldsymbol{\beta}$-induced uncertainty in expected utilities, leaving a cleaner softmax — but contributing relatively little new information about the utility scale itself.

Show code

fig, axes = plt.subplots(2, 2, figsize=(11, 9))
def ci_for(cond, name):
    sm = summaries_by_cond[cond]
    return np.array([s.loc[name, "95%"] - s.loc[name, "5%"] for s in sm])

pairs = [
    ("A_m0_M25", "B_m0_M50", "A → B: same model, +25 uncertain"),
    ("B_m0_M50", "C_m1_M25N25", "B → C: matched count, m_0 → m_1"),
]
for row_ix, (left, right, title) in enumerate(pairs):
    for col_ix, (param_name, ax_title) in enumerate([("alpha", "α"), ("delta[1]", "δ₁")]):
        ax = axes[row_ix, col_ix]
        L = ci_for(left, param_name)
        R = ci_for(right, param_name)
        lim = max(L.max(), R.max()) * 1.05
        ax.scatter(L, R, s=50, alpha=0.75,
                   c='#1f77b4' if col_ix == 0 else '#2ca02c', edgecolor='white')
        ax.plot([0, lim], [0, lim], 'r--', linewidth=1.5)
        ax.set_xlim(0, lim); ax.set_ylim(0, lim); ax.set_aspect('equal')
        ax.set_xlabel(f'{COND_LABELS[left]} CI width', fontsize=10)
        ax.set_ylabel(f'{COND_LABELS[right]} CI width', fontsize=10)
        ax.set_title(f'{ax_title}  ({title})', fontsize=10)
        # median reduction
        med_pct = (R.mean() - L.mean()) / L.mean() * 100
        ax.text(0.05, 0.95, f'mean Δ = {med_pct:+.1f}%',
                transform=ax.transAxes, fontsize=9, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
        ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Figure 3: Within-iteration A→B (m_0 data quantity) and B→C (model + data type, matched count) effects on α and δ posterior CI widths. Each point is one iteration. The B→C panels are the central test: same total choice count, the only change is what kind of choices and which model. The α panel shows a noticeable downward shift; the δ panel does not.

0.5 Why the Gain on δ Is So Small

The theoretical argument in Report 5 is correct as far as it goes: a risky-choice expected utility $\eta^{(r)} = \boldsymbol{\pi}^\top \boldsymbol{\upsilon}(\boldsymbol{\delta})$ depends on $\boldsymbol{\delta}$ but not on $\boldsymbol{\beta}$. What the empirical result reveals is that decoupling alone is not the same thing as informativeness. A choice between two risky lotteries with probability simplexes $\boldsymbol{\pi}^{(1)}, \boldsymbol{\pi}^{(2)}$ provides one noisy bit of information about the sign of \[ \boldsymbol{\pi}^{(1)} \cdot \boldsymbol{\upsilon}(\boldsymbol{\delta}) - \boldsymbol{\pi}^{(2)} \cdot \boldsymbol{\upsilon}(\boldsymbol{\delta}) = (\boldsymbol{\pi}^{(1)} - \boldsymbol{\pi}^{(2)})^\top \boldsymbol{\upsilon}(\boldsymbol{\delta}), \] weighted by the sensitivity $\alpha$. Two factors damp the resulting δ-precision:

The choice signal is small for moderate α. The m_1 simulation prior puts $\alpha \sim \text{lognormal}(0, 1)$, so a large fraction of iterations have $\alpha < 1$ — the softmax is shallow and each choice is close to a coin flip. With $N = 25$ near-coin-flips, the Fisher information about a 2-dimensional $\boldsymbol{\delta}$ is just not large.
The lottery design is moderately, not optimally, informative. The risky_probs="fixed" setting uses 8 simplexes spanning a reasonable range, but it is not a δ-information-optimized design. An optimal risky-choice design for K=3 would systematically present pairs that maximally vary $\boldsymbol{\pi}^{(1)} - \boldsymbol{\pi}^{(2)}$ along the directions that distinguish candidate δ values; the current design samples lotteries roughly uniformly across the “common probability” set.

Decoupling vs. informativeness

Report 13 showed that prior regularization on δ (Route 2) achieves a δ RMSE of about 0.10 at α₀=10, well under the ~0.29 floor that data-only Routes 1 and 0 hit at this sample size in this design. That is not an apples-to-apples comparison — the Route 2 study uses a concentrated prior that also concentrates the true δ values near the centroid, so RMSE shrinks for both reasons — but it underlines that the data, at this sample size and lottery design, is not the dominant signal about δ; the prior is.

0.6 Implications for the Alignment Study

The alignment study currently uses h_m01 (hierarchical extension of uncertain-only m_0). Three observations bear on next steps:

A hierarchical extension of m_1 (h_m11) is unlikely to deliver substantial δ identification in the cell-level design contemplated for the alignment study. The single-level m_1 advantage on δ is too small to reasonably expect hierarchical pooling to magnify it into something practically meaningful — and hierarchical pooling does not change the within-cell likelihood structure that this report has just shown to be only marginally improved by risky choices.
The α-recovery improvement of m_1 over m_0 is real and non-trivial (~15% RMSE reduction at matched choice count). If the alignment study’s primary inferential target is log-α contrasts across cells — and the current scoping document frames it that way — adding risky-choice elicitation to the alignment-study protocol would tighten the contrast estimates, even though it would not solve the δ identification problem.
The δ identification problem will not be solved by adding more data of either type in the regimes contemplated for the alignment study. If accurate δ recovery is genuinely needed, the operative levers are (a) much larger samples; (b) a δ-information-optimal lottery design (which requires substantive work beyond the current risky_probs options); or (c) accepting Route 2 (prior regularization) on substantive grounds for the alignment-study consequences. In the contrast-study framing of the alignment study, accurate δ recovery is not actually needed — what matters is that log-α is identified, which both m_0 and m_1 deliver.

0.7 Conclusion

The Route 1 identification argument from Report 5 is structurally correct but, in the design and sample sizes investigated here, only marginally relevant in practice. At equal total choice count and matched true parameters, switching from uncertain-only m_0 to mixed-data m_1 reduces the δ posterior CI width by roughly 2% and δ RMSE by roughly 2%. The Wilcoxon test confirms this advantage is statistically real, but it is small enough that no realistic alignment-study design could rely on it to achieve identification of utility increments. The real benefit of m_1 over m_0 in this study is on α recovery (~15% improvement at matched count), not on δ.

The practical implication for the alignment study is straightforward: do not build h_m11 on the premise that risky choices will identify δ. Either (a) ship the alignment study with the current h_m01 and the contrast-study framing already documented; (b) add risky-choice elicitation to the alignment-study protocol if the marginal α-precision gain is judged worth the data-collection cost, but expect no δ-identification benefit; or (c) revisit Route 2 (concentrated δ prior) as a substantive modeling choice if and only if equal-spacing of consequences is defensible for the alignment-study domain. The “build h_m11 to fix identification” path that the routes inventory in Report 4 suggested is not, on the present evidence, available.

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:

@online{helzner2026,
  author = {Helzner, Jeff},
  title = {Does m\_1 {Actually} {Identify} δ?},
  date = {2026-06-27},
  url = {https://jeffhelzner.github.io/seu-sensitivity/foundations/14_does_m1_identify_delta.html},
  langid = {en}
}

For attribution, please cite this work as:

Helzner, Jeff. 2026. “Does m_1 Actually Identify δ?” SEU Sensitivity Project, June 27. https://jeffhelzner.github.io/seu-sensitivity/foundations/14_does_m1_identify_delta.html.

--- title: "Does m_1 Actually Identify δ?" subtitle: "Foundational Report 14" description: | Matched-design parameter recovery testing whether adding risky choices (model m_1) delivers the δ identification gain over uncertain-only choices (model m_0) that the theoretical argument in Report 5 predicts. categories: [foundations, validation, identification, m_0, m_1] execute: cache: true --- ```{python} #| label: setup #| include: false import sys, os, json, glob, warnings warnings.filterwarnings('ignore') sys.path.insert(0, os.path.join(os.getcwd(), '..')) project_root = os.path.dirname(os.path.dirname(os.getcwd())) sys.path.insert(0, project_root) import numpy as np import pandas as pd import matplotlib.pyplot as plt from scipy import stats np.random.seed(2026) ROOT = os.path.join(project_root, "results", "parameter_recovery", "m1_matched_comparison") CONDITIONS = ["A_m0_M25", "B_m0_M50", "C_m1_M25N25", "D_m1_M50N50"] COND_LABELS = { "A_m0_M25": "A: m_0, M=25", "B_m0_M50": "B: m_0, M=50", "C_m1_M25N25":"C: m_1, M=25+N=25", "D_m1_M50N50":"D: m_1, M=50+N=50", } COND_COLORS = { "A_m0_M25": "#cccccc", "B_m0_M50": "#7f7f7f", "C_m1_M25N25": "#1f77b4", "D_m1_M50N50": "#08306b", } K, D_dim = 3, 5 ``` ## Introduction [Report 5](05_adding_risky_choices.qmd) argues, on identification-theoretic grounds, that adding risky choices to an uncertain-choice study should sharpen the recovery of the utility increments $\boldsymbol{\delta}$ in model `m_1` relative to model `m_0`. The argument rests on the structural observation that risky-choice expected utilities depend only on $\boldsymbol{\upsilon}$ (and hence $\boldsymbol{\delta}$), not on $\boldsymbol{\beta}$, so risky choices break the multiplicative $(\boldsymbol{\beta}, \boldsymbol{\delta})$ coupling that limits learning about $\boldsymbol{\delta}$ from uncertain choices alone. This is a claim about likelihood structure. Its empirical force depends on whether the additional information that risky choices provide is substantial *at the sample sizes and lottery designs actually used in practice*. The recovery study reported in Report 5 was incomplete — the four-way comparison it scoped (m_0 at M=25; m_0 at M=50 with same alternatives; m_0 at M=50 with new alternatives; m_1 at M=25+N=25) was never fully executed. The headline m_1 number in [results/parameter_recovery/m1_recovery/](../../results/parameter_recovery/m1_recovery/) was computed against m_0 fits using different study designs and different true-parameter draws, so the comparison was not strictly matched. ::: {.callout-important} ## Why this matters now [Report 13](13_concentrated_delta_prior.qmd) ruled out Route 2 (concentrated δ prior) as an identification improvement: it is prior regularization, not identification. Route 3 (hierarchical pooling, implemented in `h_m01`) does not break the single-cell likelihood coupling. That leaves **Route 1 (adding risky choices) as the only proposed remedy that could improve identification in the likelihood-based sense.** Before committing to a hierarchical Route 1 model (`h_m11`) for the alignment study, we need empirical confirmation that the single-level Route 1 model actually delivers the predicted δ-identification gain. ::: ### What this report does This report runs a strictly matched-design parameter recovery study. For each of 30 iterations: 1. A single true parameter vector $(\alpha, \boldsymbol{\beta}, \boldsymbol{\delta})$ is drawn from the `m_1` simulation priors. 2. A shared study design — $M = 50$ uncertain problems built on $R = 15$ alternatives, $N = 50$ risky problems built on $S = 15$ lotteries — is held fixed across all iterations and conditions. 3. The same simulated choice vectors $\mathbf{y}$ (uncertain) and $\mathbf{z}$ (risky) are sliced four ways: | Condition | Model | Uncertain problems | Risky problems | Total choices | |---|---|---|---|---| | **A** | `m_0` | 25 | — | 25 | | **B** | `m_0` | 50 | — | 50 | | **C** | `m_1` | 25 | 25 | 50 | | **D** | `m_1` | 50 | 50 | 100 | The central comparison is **B vs C**: same true parameters, same total choice count, only the model and the *type* of choices differ. If the Route 1 identification argument has empirical force at this sample size, C should sharpen $\boldsymbol{\delta}$ recovery materially relative to B. The A vs B comparison serves as a data-quantity control inside the `m_0` family; the C vs D comparison is the same control inside `m_1`. The driver script is `scripts/run_m1_matched_recovery.py`; results live under `results/parameter_recovery/m1_matched_comparison/`. ```{python} #| label: load-results #| include: false def load_all(): it_dirs = sorted(glob.glob(os.path.join(ROOT, "iteration_*")), key=lambda p: int(p.rsplit('_', 1)[-1])) tp_list, sm_lists = [], {c: [] for c in CONDITIONS} for d in it_dirs: with open(os.path.join(d, "true_parameters.json")) as f: tp = json.load(f) summaries = {} ok = True for c in CONDITIONS: p = os.path.join(d, f"summary_{c}.csv") if not os.path.exists(p): ok = False; break summaries[c] = pd.read_csv(p, index_col=0) if not ok: continue tp_list.append(tp) for c in CONDITIONS: sm_lists[c].append(summaries[c]) return tp_list, sm_lists true_params, summaries_by_cond = load_all() n_iter = len(true_params) print(f"Iterations completing all 4 conditions: {n_iter}") ``` ## Aggregate Recovery Metrics ```{python} #| label: build-metrics #| include: false def per_cond_metrics(): rows = [] for c in CONDITIONS: sm = summaries_by_cond[c] a_true = np.array([p["alpha"] for p in true_params]) a_mean = np.array([s.loc["alpha", "Mean"] for s in sm]) a_low = np.array([s.loc["alpha", "5%"] for s in sm]) a_up = np.array([s.loc["alpha", "95%"] for s in sm]) rmses, ciws, covs = [], [], [] for k in range(K): for d_ix in range(D_dim): bt = np.array([p["beta"][k][d_ix] for p in true_params]) bm = np.array([s.loc[f"beta[{k+1},{d_ix+1}]", "Mean"] for s in sm]) bl = np.array([s.loc[f"beta[{k+1},{d_ix+1}]", "5%"] for s in sm]) bu = np.array([s.loc[f"beta[{k+1},{d_ix+1}]", "95%"] for s in sm]) rmses.append(np.sqrt(np.mean((bm-bt)**2))) ciws.append(np.mean(bu-bl)) covs.append(np.mean((bt>=bl)&(bt<=bu))) b_rmse, b_ci, b_cov = float(np.mean(rmses)), float(np.mean(ciws)), float(np.mean(covs)) d_rmse_list, d_ci_list, d_cov_list = [], [], [] for k in range(K-1): dt = np.array([p["delta"][k] for p in true_params]) dm = np.array([s.loc[f"delta[{k+1}]", "Mean"] for s in sm]) dl = np.array([s.loc[f"delta[{k+1}]", "5%"] for s in sm]) du = np.array([s.loc[f"delta[{k+1}]", "95%"] for s in sm]) d_rmse_list.append(np.sqrt(np.mean((dm-dt)**2))) d_ci_list.append(np.mean(du-dl)) d_cov_list.append(np.mean((dt>=dl)&(dt<=du))) rows.append({ "Condition": COND_LABELS[c], "n_iter": len(sm), "α RMSE": float(np.sqrt(np.mean((a_mean-a_true)**2))), "α CI": float(np.mean(a_up-a_low)), "α cov": float(np.mean((a_true>=a_low)&(a_true<=a_up))), "β RMSE": b_rmse, "β CI": b_ci, "β cov": b_cov, "δ RMSE": float(np.mean(d_rmse_list)), "δ CI": float(np.mean(d_ci_list)), "δ cov": float(np.mean(d_cov_list)), }) return pd.DataFrame(rows) metrics_df = per_cond_metrics() ``` ```{python} #| label: tbl-aggregate #| tbl-cap: "Aggregate parameter-recovery metrics across the four matched-design conditions. CI columns report mean 90% credible-interval width; cov columns report fraction of iterations whose 90% interval covers the true value (nominal = 0.90, MC SE for n=30 ≈ 5 points)." fmt = {col: (lambda x: f"{x:.3f}") for col in metrics_df.columns if col not in ("Condition", "n_iter")} display_df = metrics_df.copy() for col, f in fmt.items(): display_df[col] = display_df[col].apply(f) print(display_df.to_string(index=False)) ``` ```{python} #| label: fig-aggregate-metrics #| fig-cap: "Aggregate δ and β recovery across the four matched-design conditions. Bars show RMSE (top) and mean 90% CI width (bottom). The central test is B vs C — same total choice count, same true δ per iteration, only the model and the type of choices differ. If Route 1's likelihood structure delivers an identification gain, C should be visibly below B." fig, axes = plt.subplots(2, 3, figsize=(14, 7), sharex=True) metrics_for_param = { 'α': ('α RMSE', 'α CI'), 'β (avg K×D)': ('β RMSE', 'β CI'), 'δ (avg K-1)': ('δ RMSE', 'δ CI'), } xs = np.arange(len(CONDITIONS)) labels = [COND_LABELS[c] for c in CONDITIONS] colors = [COND_COLORS[c] for c in CONDITIONS] for col_ix, (pname, (rmse_col, ci_col)) in enumerate(metrics_for_param.items()): ax_top = axes[0, col_ix] ax_bot = axes[1, col_ix] rmses = metrics_df[rmse_col].to_numpy() cis = metrics_df[ci_col].to_numpy() ax_top.bar(xs, rmses, color=colors, edgecolor='black', linewidth=0.5) ax_bot.bar(xs, cis, color=colors, edgecolor='black', linewidth=0.5) ax_top.set_title(pname, fontsize=12) if col_ix == 0: ax_top.set_ylabel('RMSE', fontsize=11) ax_bot.set_ylabel('mean 90% CI width', fontsize=11) ax_bot.set_xticks(xs) ax_bot.set_xticklabels(labels, rotation=25, ha='right', fontsize=9) for ax in (ax_top, ax_bot): ax.grid(True, axis='y', alpha=0.3) plt.tight_layout() plt.show() ``` ::: {.callout-important} ## Headline finding At equal total choice count (B vs C, 50 choices each), m_1's risky-choice information delivers a δ-RMSE reduction of roughly 2% and a δ-CI-width reduction of roughly 2%. Doubling the number of risky choices (C to D) yields a further 1% / 6% improvement. The predicted "qualitative identification advantage" of risky choices over uncertain choices is, at this sample size and lottery design, **a small quantitative improvement, not a qualitative one.** ::: ## Within-Iteration B vs C Comparison The aggregate numbers above could in principle hide a real B vs C effect if the variance across iterations is large relative to the mean difference. To rule that out, we compare B and C *within each iteration* — the matched design lets us hold true parameters fixed and just vary the model + data slice. ```{python} #| label: fig-bc-within-iter #| fig-cap: "Within-iteration comparison of B (m_0, M=50) vs C (m_1, M=25+N=25). Each point is one (iteration, δ component) pair (n = 30 iterations × 2 components = 60 points). Top: 90% CI widths under B vs C; points below the diagonal favor C. Bottom: squared posterior-mean errors under B vs C; points below the diagonal favor C. With matched true parameters and matched total choice count, a sustained C-advantage would show as systematic displacement below the diagonal." fig, axes = plt.subplots(1, 2, figsize=(11, 5)) ci_B, ci_C, err2_B, err2_C = [], [], [], [] for tp, smB, smC in zip(true_params, summaries_by_cond["B_m0_M50"], summaries_by_cond["C_m1_M25N25"]): for k in range(K-1): dt = tp["delta"][k] ci_B.append(smB.loc[f"delta[{k+1}]", "95%"] - smB.loc[f"delta[{k+1}]", "5%"]) ci_C.append(smC.loc[f"delta[{k+1}]", "95%"] - smC.loc[f"delta[{k+1}]", "5%"]) err2_B.append((smB.loc[f"delta[{k+1}]", "Mean"] - dt)**2) err2_C.append((smC.loc[f"delta[{k+1}]", "Mean"] - dt)**2) ci_B, ci_C, err2_B, err2_C = map(np.array, (ci_B, ci_C, err2_B, err2_C)) ax = axes[0] lim = max(ci_B.max(), ci_C.max()) * 1.05 ax.scatter(ci_B, ci_C, s=45, alpha=0.75, c='#1f77b4', edgecolor='white') ax.plot([0, lim], [0, lim], 'r--', linewidth=1.5) ax.set_xlim(0, lim); ax.set_ylim(0, lim); ax.set_aspect('equal') ax.set_xlabel('B (m_0, M=50) δ CI width', fontsize=11) ax.set_ylabel('C (m_1, M=25+N=25) δ CI width', fontsize=11) ax.set_title(f'90% CI widths: C narrower than B in {(ci_C < ci_B).mean():.0%} of points', fontsize=11) ax.grid(True, alpha=0.3) ax = axes[1] lim = max(err2_B.max(), err2_C.max()) * 1.05 ax.scatter(err2_B, err2_C, s=45, alpha=0.75, c='#1f77b4', edgecolor='white') ax.plot([0, lim], [0, lim], 'r--', linewidth=1.5) ax.set_xlim(0, lim); ax.set_ylim(0, lim); ax.set_aspect('equal') ax.set_xlabel('B (m_0, M=50) δ squared error', fontsize=11) ax.set_ylabel('C (m_1, M=25+N=25) δ squared error', fontsize=11) ax.set_title(f'Squared error: C smaller than B in {(err2_C < err2_B).mean():.0%} of points', fontsize=11) ax.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` ```{python} #| label: bc-summary #| echo: false med_diff_ci = np.median(ci_C - ci_B) med_pct = med_diff_ci / np.median(ci_B) * 100 print(f"Within-iteration B vs C (matched true parameters, matched total choice count):") print(f" median δ CI width: B = {np.median(ci_B):.3f}, C = {np.median(ci_C):.3f}") print(f" median (C - B): {med_diff_ci:+.4f} ({med_pct:+.1f}% of B)") print(f" C narrower than B: {(ci_C < ci_B).mean():.1%} of points") print(f" δ RMSE: B = {np.sqrt(np.mean(err2_B)):.3f}, C = {np.sqrt(np.mean(err2_C)):.3f}") print(f" C lower squared err: {(err2_C < err2_B).mean():.1%} of points") print(f"\nWilcoxon signed-rank test on CI width (B - C, paired):") w_stat, p_val = stats.wilcoxon(ci_B - ci_C) print(f" W = {w_stat:.1f}, p = {p_val:.4f}") ``` The within-iteration analysis confirms the aggregate picture. C produces narrower δ credible intervals than B in a large majority of cases — the Wilcoxon test is significant — but the *size* of the C-advantage is small: a median improvement of roughly 1–2% of CI width. The squared-error advantage is even more marginal. Risky choices, in this design, displace uncertain choices on δ recovery by a margin that is statistically real but practically negligible. ## Where Does the m_1 Advantage Actually Live? If m_1 is not buying us δ identification, what is it buying? @fig-aggregate-metrics already hints at the answer: the largest single improvement across the four conditions is in $\alpha$ recovery. The 25 → 50 doubling within `m_0` (A → B) actually slightly *worsens* α RMSE; switching from B to C cuts α RMSE noticeably; doubling within `m_1` (C → D) cuts it further. This is consistent with risky choices providing a clean signal about the choice-sensitivity parameter — they remove the $\boldsymbol{\beta}$-induced uncertainty in expected utilities, leaving a cleaner softmax — but contributing relatively little new information about the utility scale itself. ```{python} #| label: fig-where-the-gain-lives #| fig-cap: "Within-iteration A→B (m_0 data quantity) and B→C (model + data type, matched count) effects on α and δ posterior CI widths. Each point is one iteration. The B→C panels are the central test: same total choice count, the only change is what kind of choices and which model. The α panel shows a noticeable downward shift; the δ panel does not." fig, axes = plt.subplots(2, 2, figsize=(11, 9)) def ci_for(cond, name): sm = summaries_by_cond[cond] return np.array([s.loc[name, "95%"] - s.loc[name, "5%"] for s in sm]) pairs = [ ("A_m0_M25", "B_m0_M50", "A → B: same model, +25 uncertain"), ("B_m0_M50", "C_m1_M25N25", "B → C: matched count, m_0 → m_1"), ] for row_ix, (left, right, title) in enumerate(pairs): for col_ix, (param_name, ax_title) in enumerate([("alpha", "α"), ("delta[1]", "δ₁")]): ax = axes[row_ix, col_ix] L = ci_for(left, param_name) R = ci_for(right, param_name) lim = max(L.max(), R.max()) * 1.05 ax.scatter(L, R, s=50, alpha=0.75, c='#1f77b4' if col_ix == 0 else '#2ca02c', edgecolor='white') ax.plot([0, lim], [0, lim], 'r--', linewidth=1.5) ax.set_xlim(0, lim); ax.set_ylim(0, lim); ax.set_aspect('equal') ax.set_xlabel(f'{COND_LABELS[left]} CI width', fontsize=10) ax.set_ylabel(f'{COND_LABELS[right]} CI width', fontsize=10) ax.set_title(f'{ax_title} ({title})', fontsize=10) # median reduction med_pct = (R.mean() - L.mean()) / L.mean() * 100 ax.text(0.05, 0.95, f'mean Δ = {med_pct:+.1f}%', transform=ax.transAxes, fontsize=9, verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8)) ax.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` ## Why the Gain on δ Is So Small The theoretical argument in Report 5 is correct as far as it goes: a risky-choice expected utility $\eta^{(r)} = \boldsymbol{\pi}^\top \boldsymbol{\upsilon}(\boldsymbol{\delta})$ depends on $\boldsymbol{\delta}$ but not on $\boldsymbol{\beta}$. What the empirical result reveals is that **decoupling alone is not the same thing as informativeness.** A choice between two risky lotteries with probability simplexes $\boldsymbol{\pi}^{(1)}, \boldsymbol{\pi}^{(2)}$ provides one noisy bit of information about the sign of $$ \boldsymbol{\pi}^{(1)} \cdot \boldsymbol{\upsilon}(\boldsymbol{\delta}) - \boldsymbol{\pi}^{(2)} \cdot \boldsymbol{\upsilon}(\boldsymbol{\delta}) = (\boldsymbol{\pi}^{(1)} - \boldsymbol{\pi}^{(2)})^\top \boldsymbol{\upsilon}(\boldsymbol{\delta}), $$ weighted by the sensitivity $\alpha$. Two factors damp the resulting δ-precision: 1. **The choice signal is small for moderate α.** The `m_1` simulation prior puts $\alpha \sim \text{lognormal}(0, 1)$, so a large fraction of iterations have $\alpha < 1$ — the softmax is shallow and each choice is close to a coin flip. With $N = 25$ near-coin-flips, the Fisher information about a 2-dimensional $\boldsymbol{\delta}$ is just not large. 2. **The lottery design is moderately, not optimally, informative.** The `risky_probs="fixed"` setting uses 8 simplexes spanning a reasonable range, but it is not a δ-information-optimized design. An optimal risky-choice design for K=3 would systematically present pairs that maximally vary $\boldsymbol{\pi}^{(1)} - \boldsymbol{\pi}^{(2)}$ along the directions that distinguish candidate δ values; the current design samples lotteries roughly uniformly across the "common probability" set. ::: {.callout-note} ## Decoupling vs. informativeness [Report 13](13_concentrated_delta_prior.qmd) showed that prior regularization on δ (Route 2) achieves a δ RMSE of about 0.10 at α₀=10, well under the ~0.29 floor that data-only Routes 1 and 0 hit at this sample size in this design. That is *not* an apples-to-apples comparison — the Route 2 study uses a concentrated prior that also concentrates the true δ values near the centroid, so RMSE shrinks for both reasons — but it underlines that the data, at this sample size and lottery design, is not the dominant signal about δ; the prior is. ::: ## Implications for the Alignment Study The alignment study currently uses `h_m01` (hierarchical extension of uncertain-only `m_0`). Three observations bear on next steps: 1. **A hierarchical extension of m_1 (`h_m11`) is unlikely to deliver substantial δ identification in the cell-level design contemplated for the alignment study.** The single-level m_1 advantage on δ is too small to reasonably expect hierarchical pooling to magnify it into something practically meaningful — and hierarchical pooling does not change the within-cell likelihood structure that this report has just shown to be only marginally improved by risky choices. 2. **The α-recovery improvement of m_1 over m_0 is real and non-trivial** (~15% RMSE reduction at matched choice count). If the alignment study's primary inferential target is *log-α contrasts across cells* — and the current scoping document frames it that way — adding risky-choice elicitation to the alignment-study protocol would tighten the contrast estimates, even though it would not solve the δ identification problem. 3. **The δ identification problem will not be solved by adding more data of either type in the regimes contemplated for the alignment study.** If accurate δ recovery is genuinely needed, the operative levers are (a) much larger samples; (b) a δ-information-optimal lottery design (which requires substantive work beyond the current `risky_probs` options); or (c) accepting Route 2 (prior regularization) on substantive grounds for the alignment-study consequences. In the contrast-study framing of the alignment study, accurate δ recovery is not actually needed — what matters is that log-α is identified, which both m_0 and m_1 deliver. ## Conclusion The Route 1 identification argument from [Report 5](05_adding_risky_choices.qmd) is structurally correct but, in the design and sample sizes investigated here, only marginally relevant in practice. At equal total choice count and matched true parameters, switching from uncertain-only `m_0` to mixed-data `m_1` reduces the δ posterior CI width by roughly 2% and δ RMSE by roughly 2%. The Wilcoxon test confirms this advantage is statistically real, but it is small enough that no realistic alignment-study design could rely on it to achieve identification of utility increments. The real benefit of `m_1` over `m_0` in this study is on α recovery (~15% improvement at matched count), not on δ. The practical implication for the alignment study is straightforward: **do not build `h_m11` on the premise that risky choices will identify δ.** Either (a) ship the alignment study with the current `h_m01` and the contrast-study framing already documented; (b) add risky-choice elicitation to the alignment-study protocol if the marginal α-precision gain is judged worth the data-collection cost, but expect no δ-identification benefit; or (c) revisit Route 2 (concentrated δ prior) as a substantive modeling choice if and only if equal-spacing of consequences is defensible for the alignment-study domain. The "build h_m11 to fix identification" path that the routes inventory in Report 4 suggested is not, on the present evidence, available.