---
title: "Parameter Recovery Analysis"
subtitle: "Foundational Report 4"
description: |
Validation that model parameters can be accurately recovered from
simulated data, demonstrating model identifiability.
categories: [foundations, validation, m_0]
execute:
cache: true
---
```{python}
#| label: setup
#| include: false
import sys
import os
# Add parent directories to path
sys.path.insert(0, os.path.join(os.getcwd(), '..'))
project_root = os.path.dirname(os.path.dirname(os.getcwd()))
sys.path.insert(0, project_root)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import json
import warnings
warnings.filterwarnings('ignore')
# Set random seed for reproducibility
np.random.seed(42)
```
## Introduction
Having examined the prior predictive distribution ([Report 3](03_prior_analysis.qmd)), we now ask: **can we recover the true parameters when we simulate data from our model with known parameters?** This is the fundamental question of parameter recovery analysis.
Parameter recovery is a critical validation step for any Bayesian model:
1. **Identifiability**: Can the model distinguish between different parameter configurations?
2. **Estimation accuracy**: Are posterior means close to true values?
3. **Uncertainty calibration**: Do 90% credible intervals contain the true value ~90% of the time?
4. **Precision**: How narrow are the credible intervals?
::: {.callout-warning}
## Preview: A Key Finding
This report reveals an asymmetry in parameter recovery: while α (sensitivity) is recovered well in the simulation regime examined here, the pair (β, δ) that determines the expected utility function shows weaker recovery — credible intervals narrow more slowly relative to those of α. As shown formally in [Report 5](05_adding_risky_choices.qmd), this reflects a **structural identification challenge** where β and δ interact multiplicatively in the expected utility, causing uncertainty in one to propagate to the other. Importantly, this does not prevent good recovery of α, which remains our primary parameter of interest within the prior range examined here.
:::
::: {.callout-note}
## The Recovery Paradigm
Parameter recovery follows a simple logic:
1. **Simulate**: Generate data from `m_0_sim.stan` with known ("true") parameter values
2. **Estimate**: Fit `m_0.stan` to the simulated data
3. **Compare**: Assess how well posterior estimates match the true values
4. **Repeat**: Do this many times to characterize recovery performance
If the model is well-specified and identifiable, the posterior should concentrate around the true values.
:::
## Study Design
We use the same study design as in [Report 3](03_prior_analysis.qmd) to maintain consistency across our foundational analyses:
```{python}
#| label: study-design-config
#| echo: true
# Study design configuration (same as Report 3)
config = {
"M": 25, # Number of decision problems
"K": 3, # Number of consequences
"D": 5, # Feature dimensions
"R": 15, # Number of distinct alternatives
"min_alts_per_problem": 2, # Minimum alternatives per problem
"max_alts_per_problem": 5, # Maximum alternatives per problem
"feature_dist": "normal", # Feature distribution
"feature_params": {"loc": 0, "scale": 1}
}
print(f"Study Design Configuration:")
print(f" Decision problems (M): {config['M']}")
print(f" Consequences (K): {config['K']}")
print(f" Feature dimensions (D): {config['D']}")
print(f" Distinct alternatives (R): {config['R']}")
print(f" Alternatives per problem: {config['min_alts_per_problem']}-{config['max_alts_per_problem']}")
```
```{python}
#| label: generate-study-design
#| output: false
from utils.study_design import StudyDesign
# Create and generate the study design
study = StudyDesign(
M=config["M"],
K=config["K"],
D=config["D"],
R=config["R"],
min_alts_per_problem=config["min_alts_per_problem"],
max_alts_per_problem=config["max_alts_per_problem"],
feature_dist=config["feature_dist"],
feature_params=config["feature_params"],
design_name="parameter_recovery"
)
study.generate()
```
```{python}
#| label: design-summary
#| echo: false
# Extract design properties
w = np.array(study.w)
I = np.array(study.I)
N = I.sum(axis=1)
print(f"\nGenerated Study Design:")
print(f" Total alternatives across problems: {I.sum()}")
print(f" Alternatives per problem: min={N.min()}, max={N.max()}, mean={N.mean():.1f}")
```
## Parameter Recovery Analysis
We use the `ParameterRecovery` class to systematically evaluate how well parameters can be recovered. This class:
1. Uses `m_0_sim.stan` to generate data with known parameters drawn from the prior
2. Fits `m_0.stan` to each simulated dataset
3. Compares posterior estimates to true values across many iterations
```{python}
#| label: run-recovery
#| output: false
from analysis.parameter_recovery import ParameterRecovery
import tempfile
# Create output directory for this analysis
output_dir = tempfile.mkdtemp(prefix="param_recovery_")
# Initialize parameter recovery analysis
recovery = ParameterRecovery(
inference_model_path=None, # Uses default m_0.stan
sim_model_path=None, # Uses default m_0_sim.stan
study_design=study,
output_dir=output_dir,
n_mcmc_samples=1000, # Samples per chain
n_mcmc_chains=4, # Number of chains
n_iterations=50 # Number of simulation-recovery iterations
)
# Run the analysis
true_params_list, posterior_summaries = recovery.run()
```
```{python}
#| label: recovery-summary
#| echo: false
n_successful = len(true_params_list)
print(f"Parameter Recovery Analysis Complete:")
print(f" Successful iterations: {n_successful}")
print(f" Parameters recovered per iteration:")
print(f" - α (sensitivity): 1 parameter")
print(f" - β (feature weights): {config['K']} × {config['D']} = {config['K'] * config['D']} parameters")
print(f" - δ (utility increments): {config['K'] - 1} parameters")
```
### MCMC Diagnostics
Before interpreting recovery statistics, we verify that the MCMC sampler converged properly across all iterations. CmdStanPy's `summary()` method provides $\hat{R}$ (split R-hat) and effective sample size (ESS) for each parameter, and diagnostics are saved per iteration by the `ParameterRecovery` class.
```{python}
#| label: mcmc-diagnostics
#| echo: false
# Aggregate MCMC diagnostics across all iterations
rhat_max_list = []
ess_min_list = []
# Detect ESS column name (varies across CmdStanPy versions)
ess_col = None
sample_cols = posterior_summaries[0].columns
for candidate in ['ESS_bulk', 'N_Eff', 'ESS']:
if candidate in sample_cols:
ess_col = candidate
break
for s in posterior_summaries:
# Filter to model parameters only (exclude lp__, etc.)
param_rows = s[s.index.str.match(r'^(alpha|beta\[|delta\[|upsilon\[)')]
if 'R_hat' in param_rows.columns:
rhat_max_list.append(param_rows['R_hat'].max())
if ess_col is not None and ess_col in param_rows.columns:
ess_min_list.append(param_rows[ess_col].min())
rhat_max = np.array(rhat_max_list)
ess_min = np.array(ess_min_list)
print("MCMC Diagnostic Summary Across All Recovery Iterations:")
print(f" Max R-hat per iteration:")
print(f" Median: {np.median(rhat_max):.4f}")
print(f" 95th percentile: {np.percentile(rhat_max, 95):.4f}")
print(f" Maximum: {rhat_max.max():.4f}")
print(f" Iterations with max R-hat > 1.01: {(rhat_max > 1.01).sum()}/{len(rhat_max)}")
if len(ess_min) > 0:
print(f" Min ESS ({ess_col}) per iteration:")
print(f" Median: {np.median(ess_min):.0f}")
print(f" 5th percentile: {np.percentile(ess_min, 5):.0f}")
print(f" Minimum: {ess_min.min():.0f}")
else:
print(f" ESS: column not found in summary (available: {list(sample_cols)})")
```
::: {.callout-note}
## Diagnostic Assessment
The diagnostics above confirm that the sampler converged reliably across recovery iterations. All iterations meet standard convergence thresholds ($\hat{R} < 1.01$, ESS well above recommended minimums), so the recovery statistics below reflect genuine posterior behavior rather than MCMC artifacts. Per-iteration diagnostic files (including divergence checks) are saved by the `ParameterRecovery` class and can be inspected individually if needed.
:::
## Recovery Metrics
We evaluate parameter recovery using four key metrics:
| Metric | Definition | Target |
|--------|------------|--------|
| **Bias** | Mean(estimate - true) | ≈ 0 |
| **RMSE** | √Mean((estimate - true)²) | Small |
| **Coverage** | P(true ∈ 90% CI) | ≈ 0.90 |
| **CI Width** | Mean(upper - lower) | Small but calibrated |
### Sensitivity Parameter ($\alpha$)
```{python}
#| label: fig-alpha-recovery
#| fig-cap: "Recovery of the sensitivity parameter α. Left: True vs. estimated values with identity line. Right: 90% credible intervals for each iteration, colored by whether they contain the true value."
# Extract alpha values
alpha_true = np.array([p['alpha'] for p in true_params_list])
alpha_mean = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries])
alpha_lower = np.array([s.loc['alpha', '5%'] for s in posterior_summaries])
alpha_upper = np.array([s.loc['alpha', '95%'] for s in posterior_summaries])
# Calculate metrics
alpha_bias = np.mean(alpha_mean - alpha_true)
alpha_rmse = np.sqrt(np.mean((alpha_mean - alpha_true)**2))
alpha_coverage = np.mean((alpha_true >= alpha_lower) & (alpha_true <= alpha_upper))
alpha_ci_width = np.mean(alpha_upper - alpha_lower)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# True vs Estimated
ax = axes[0]
ax.scatter(alpha_true, alpha_mean, alpha=0.7, s=60, c='steelblue', edgecolor='white')
lims = [min(alpha_true.min(), alpha_mean.min()) * 0.9,
max(alpha_true.max(), alpha_mean.max()) * 1.1]
ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line')
ax.set_xlim(lims)
ax.set_ylim(lims)
ax.set_xlabel('True α', fontsize=12)
ax.set_ylabel('Estimated α (posterior mean)', fontsize=12)
ax.set_title(f'α Recovery: Bias={alpha_bias:.3f}, RMSE={alpha_rmse:.3f}', fontsize=12)
ax.legend()
ax.set_aspect('equal')
# Coverage plot
ax = axes[1]
iterations = np.arange(len(alpha_true))
for i in range(len(alpha_true)):
covered = (alpha_true[i] >= alpha_lower[i]) & (alpha_true[i] <= alpha_upper[i])
color = 'forestgreen' if covered else 'crimson'
ax.plot([i, i], [alpha_lower[i], alpha_upper[i]], color=color, linewidth=2, alpha=0.7)
ax.scatter(i, alpha_mean[i], color=color, s=40, zorder=3)
ax.scatter(iterations, alpha_true, color='black', s=60, marker='x',
label='True value', zorder=4, linewidth=2)
ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('α', fontsize=12)
ax.set_title(f'α: 90% Credible Intervals (Coverage = {alpha_coverage:.0%})', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
alpha_coverage_se = np.sqrt(alpha_coverage * (1 - alpha_coverage) / len(alpha_true))
print(f"\nα Recovery Statistics:")
print(f" Bias: {alpha_bias:.4f}")
print(f" RMSE: {alpha_rmse:.4f}")
print(f" 90% CI Coverage: {alpha_coverage:.1%} (SE = {alpha_coverage_se:.1%})")
print(f" Mean CI Width: {alpha_ci_width:.3f}")
```
### Feature-to-Probability Weights ($\boldsymbol{\beta}$)
The β matrix has K × D = 15 parameters (3 consequences × 5 features). We examine recovery across all of them:
```{python}
#| label: compute-beta-recovery
#| output: false
# Compute recovery metrics for all beta parameters
K, D = config['K'], config['D']
beta_recovery = []
for k in range(K):
for d in range(D):
param_name = f"beta[{k+1},{d+1}]"
beta_true = np.array([p['beta'][k][d] for p in true_params_list])
beta_mean = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries])
beta_lower = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
beta_upper = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])
bias = np.mean(beta_mean - beta_true)
rmse = np.sqrt(np.mean((beta_mean - beta_true)**2))
coverage = np.mean((beta_true >= beta_lower) & (beta_true <= beta_upper))
ci_width = np.mean(beta_upper - beta_lower)
beta_recovery.append({
'parameter': param_name,
'k': k + 1,
'd': d + 1,
'bias': bias,
'rmse': rmse,
'coverage': coverage,
'ci_width': ci_width,
'true': beta_true,
'mean': beta_mean,
'lower': beta_lower,
'upper': beta_upper
})
beta_df = pd.DataFrame(beta_recovery)
```
```{python}
#| label: fig-beta-recovery-summary
#| fig-cap: "Summary of β parameter recovery. Left: RMSE for each β coefficient. Right: Coverage rates (target = 90%, shown as dashed line)."
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# RMSE heatmap
ax = axes[0]
rmse_matrix = beta_df.pivot(index='k', columns='d', values='rmse')
im = ax.imshow(rmse_matrix.values, cmap='Blues', aspect='auto')
ax.set_xlabel('Feature Dimension (d)', fontsize=12)
ax.set_ylabel('Consequence (k)', fontsize=12)
ax.set_title('β RMSE by Position', fontsize=12)
ax.set_xticks(range(D))
ax.set_xticklabels([str(d+1) for d in range(D)])
ax.set_yticks(range(K))
ax.set_yticklabels([str(k+1) for k in range(K)])
for i in range(K):
for j in range(D):
ax.text(j, i, f'{rmse_matrix.values[i, j]:.2f}', ha='center', va='center', fontsize=10)
plt.colorbar(im, ax=ax, label='RMSE')
# Coverage bar plot
ax = axes[1]
coverage_values = beta_df['coverage'].values
x = np.arange(len(coverage_values))
colors = ['forestgreen' if c >= 0.85 else 'orange' if c >= 0.75 else 'crimson'
for c in coverage_values]
ax.bar(x, coverage_values, color=colors, alpha=0.7, edgecolor='white')
ax.axhline(y=0.90, color='red', linestyle='--', linewidth=2, label='Target (90%)')
ax.set_xlabel('β Parameter Index', fontsize=12)
ax.set_ylabel('Coverage', fontsize=12)
ax.set_title('β 90% CI Coverage by Parameter', fontsize=12)
ax.set_ylim(0, 1.05)
ax.legend()
plt.tight_layout()
plt.show()
print(f"\nβ Recovery Summary:")
print(f" Mean Bias: {beta_df['bias'].mean():.4f}")
print(f" Mean RMSE: {beta_df['rmse'].mean():.4f}")
mean_cov = beta_df['coverage'].mean()
mean_cov_se = np.sqrt(mean_cov * (1 - mean_cov) / len(true_params_list))
print(f" Mean Coverage: {mean_cov:.1%} (SE ≈ {mean_cov_se:.1%})")
print(f" Mean CI Width: {beta_df['ci_width'].mean():.3f}")
```
::: {.callout-note}
## Interpreting β Coverage
The β coverage panel summarizes 15 coefficients, each estimated over 50 recovery iterations. Individual coverage rates therefore have substantial Monte Carlo variation, and the full set of bars should be read as a pattern-level diagnostic rather than as precise evidence about any single β component.
:::
```{python}
#| label: fig-beta-true-vs-estimated
#| fig-cap: "True vs. estimated values for all β parameters pooled together (left), and for the best- and worst-recovered individual components (right panels). The identity line shows perfect recovery."
# Identify best and worst recovered beta components by RMSE
best_idx = beta_df['rmse'].idxmin()
worst_idx = beta_df['rmse'].idxmax()
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# Panel 1: Pool all beta values
ax = axes[0]
all_beta_true = np.concatenate([r['true'] for r in beta_recovery])
all_beta_mean = np.concatenate([r['mean'] for r in beta_recovery])
ax.scatter(all_beta_true, all_beta_mean, alpha=0.4, s=30, c='purple', edgecolor='none')
lims = [min(all_beta_true.min(), all_beta_mean.min()) * 1.1,
max(all_beta_true.max(), all_beta_mean.max()) * 1.1]
ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line')
ax.set_xlim(lims)
ax.set_ylim(lims)
ax.set_xlabel('True β', fontsize=12)
ax.set_ylabel('Estimated β (posterior mean)', fontsize=12)
ax.set_title(f'β Recovery (all {K*D} parameters pooled)', fontsize=12)
ax.legend()
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
# Add correlation
corr = np.corrcoef(all_beta_true, all_beta_mean)[0, 1]
ax.text(0.05, 0.95, f'r = {corr:.3f}', transform=ax.transAxes, fontsize=12,
verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
# Panel 2: Best-recovered component
ax = axes[1]
best_rec = beta_recovery[best_idx]
ax.scatter(best_rec['true'], best_rec['mean'], alpha=0.7, s=50, c='steelblue', edgecolor='white')
b_lims = [min(best_rec['true'].min(), best_rec['mean'].min()) * 1.1 - 0.1,
max(best_rec['true'].max(), best_rec['mean'].max()) * 1.1 + 0.1]
ax.plot(b_lims, b_lims, 'r--', linewidth=2)
ax.set_xlim(b_lims)
ax.set_ylim(b_lims)
ax.set_xlabel(f'True {best_rec["parameter"]}', fontsize=12)
ax.set_ylabel(f'Estimated (post. mean)', fontsize=12)
ax.set_title(f'Best recovered: {best_rec["parameter"]} (RMSE={best_rec["rmse"]:.3f})', fontsize=11)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
# Panel 3: Worst-recovered component
ax = axes[2]
worst_rec = beta_recovery[worst_idx]
ax.scatter(worst_rec['true'], worst_rec['mean'], alpha=0.7, s=50, c='coral', edgecolor='white')
w_lims = [min(worst_rec['true'].min(), worst_rec['mean'].min()) * 1.1 - 0.1,
max(worst_rec['true'].max(), worst_rec['mean'].max()) * 1.1 + 0.1]
ax.plot(w_lims, w_lims, 'r--', linewidth=2)
ax.set_xlim(w_lims)
ax.set_ylim(w_lims)
ax.set_xlabel(f'True {worst_rec["parameter"]}', fontsize=12)
ax.set_ylabel(f'Estimated (post. mean)', fontsize=12)
ax.set_title(f'Worst recovered: {worst_rec["parameter"]} (RMSE={worst_rec["rmse"]:.3f})', fontsize=11)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
```
### Utility Differences ($\boldsymbol{\delta}$)
```{python}
#| label: fig-delta-recovery
#| fig-cap: "Recovery of utility difference parameters δ. Each panel shows true vs. estimated values and coverage for one δ component."
K_minus_1 = config['K'] - 1
fig, axes = plt.subplots(2, K_minus_1, figsize=(6 * K_minus_1, 10))
if K_minus_1 == 1:
axes = axes.reshape(2, 1)
delta_stats = []
for k in range(K_minus_1):
param_name = f"delta[{k+1}]"
delta_true = np.array([p['delta'][k] for p in true_params_list])
delta_mean = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries])
delta_lower = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
delta_upper = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])
# Metrics
bias = np.mean(delta_mean - delta_true)
rmse = np.sqrt(np.mean((delta_mean - delta_true)**2))
coverage = np.mean((delta_true >= delta_lower) & (delta_true <= delta_upper))
ci_width = np.mean(delta_upper - delta_lower)
delta_stats.append({
'parameter': f'δ_{k+1}',
'bias': bias,
'rmse': rmse,
'coverage': coverage,
'ci_width': ci_width
})
# True vs Estimated
ax = axes[0, k]
ax.scatter(delta_true, delta_mean, alpha=0.7, s=60, c='forestgreen', edgecolor='white')
ax.plot([0, 1], [0, 1], 'r--', linewidth=2)
ax.set_xlim(-0.05, 1.05)
ax.set_ylim(-0.05, 1.05)
ax.set_xlabel(f'True δ_{k+1}', fontsize=11)
ax.set_ylabel(f'Estimated δ_{k+1}', fontsize=11)
ax.set_title(f'δ_{k+1}: Bias={bias:.3f}, RMSE={rmse:.3f}', fontsize=11)
ax.set_aspect('equal')
# Coverage
ax = axes[1, k]
iterations = np.arange(len(delta_true))
for i in range(len(delta_true)):
covered = (delta_true[i] >= delta_lower[i]) & (delta_true[i] <= delta_upper[i])
color = 'forestgreen' if covered else 'crimson'
ax.plot([i, i], [delta_lower[i], delta_upper[i]], color=color, linewidth=2, alpha=0.7)
ax.scatter(i, delta_mean[i], color=color, s=40, zorder=3)
ax.scatter(iterations, delta_true, color='black', s=60, marker='x',
label='True value', zorder=4, linewidth=2)
ax.set_xlabel('Iteration', fontsize=11)
ax.set_ylabel(f'δ_{k+1}', fontsize=11)
ax.set_title(f'δ_{k+1}: Coverage = {coverage:.0%}', fontsize=11)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
delta_df = pd.DataFrame(delta_stats)
print(f"\nδ Recovery Summary:")
for _, row in delta_df.iterrows():
cov_se = np.sqrt(row['coverage'] * (1 - row['coverage']) / len(true_params_list))
print(f" {row['parameter']}: Bias={row['bias']:.4f}, RMSE={row['rmse']:.4f}, Coverage={row['coverage']:.0%} (SE={cov_se:.1%})")
```
::: {.callout-note}
## Precision of Coverage Estimates
With $n = 50$ recovery iterations, the standard error for a coverage rate near the nominal 90% is $\text{SE} = \sqrt{0.9 \times 0.1 / 50} \approx 0.042$, giving an approximate 95% confidence interval of [82%, 98%]. Thus, an observed coverage as low as 82% is still consistent with nominal calibration. The definitive assessment of calibration with much greater precision ($n = 999$ iterations) is provided by the simulation-based calibration (SBC) analysis in [Report 6](06_sbc_validation.qmd).
:::
## Summary Table
```{python}
#| label: tbl-recovery-summary
#| tbl-cap: "Parameter recovery statistics for all parameters in model m_0."
# Create comprehensive summary table
summary_rows = []
# Alpha
summary_rows.append({
'Parameter': 'α',
'Bias': alpha_bias,
'RMSE': alpha_rmse,
'Coverage': alpha_coverage,
'CI Width': alpha_ci_width,
'Notes': 'Sensitivity'
})
# Beta (aggregated)
summary_rows.append({
'Parameter': 'β (all)',
'Bias': beta_df['bias'].mean(),
'RMSE': beta_df['rmse'].mean(),
'Coverage': beta_df['coverage'].mean(),
'CI Width': beta_df['ci_width'].mean(),
'Notes': f'{K}×{D} parameters'
})
# Delta (individual)
for _, row in delta_df.iterrows():
summary_rows.append({
'Parameter': row['parameter'],
'Bias': row['bias'],
'RMSE': row['rmse'],
'Coverage': row['coverage'],
'CI Width': row['ci_width'],
'Notes': 'Utility increment'
})
summary_df = pd.DataFrame(summary_rows)
# Format for display
display_df = summary_df.copy()
display_df['Bias'] = display_df['Bias'].apply(lambda x: f'{x:.4f}')
display_df['RMSE'] = display_df['RMSE'].apply(lambda x: f'{x:.4f}')
display_df['Coverage'] = display_df['Coverage'].apply(lambda x: f'{x:.0%}')
display_df['CI Width'] = display_df['CI Width'].apply(lambda x: f'{x:.3f}')
print(display_df.to_string(index=False))
```
### Recovery Conditional on True $\alpha$ {#sec-conditional-recovery}
The marginal recovery statistics above average over all true parameter values drawn from the prior. Because the lognormal(0, 1) prior on α spans a wide range—from near-zero (nearly random choices) to large values (near-deterministic choices)—recovery difficulty varies considerably across iterations. To reveal this heterogeneity, we examine recovery metrics as a function of the true α value.
```{python}
#| label: fig-conditional-recovery
#| fig-cap: "Recovery metrics stratified by true α. Left: absolute error in α recovery vs. true α. Center: absolute error in δ₁ vs. true α — δ recovery degrades when α is small, since choices then carry little information about utilities. Right: posterior 90% credible-interval width for δ₁ vs. true α, used as a proxy for how tightly the data constrain δ₁ in each iteration."
fig, axes = plt.subplots(1, 3, figsize=(17, 5))
# Alpha absolute error vs true alpha
ax = axes[0]
alpha_abs_error = np.abs(alpha_mean - alpha_true)
ax.scatter(alpha_true, alpha_abs_error, alpha=0.7, s=50, c='steelblue', edgecolor='white')
ax.set_xlabel('True α', fontsize=12)
ax.set_ylabel('|Estimated α − True α|', fontsize=12)
ax.set_title('α Recovery Error vs. True α', fontsize=12)
ax.grid(True, alpha=0.3)
# Delta absolute error vs true alpha
ax = axes[1]
delta_true_vals = np.array([p['delta'][0] for p in true_params_list])
delta_mean_vals = np.array([s.loc['delta[1]', 'Mean'] for s in posterior_summaries])
delta_abs_error = np.abs(delta_mean_vals - delta_true_vals)
ax.scatter(alpha_true, delta_abs_error, alpha=0.7, s=50, c='forestgreen', edgecolor='white')
ax.set_xlabel('True α', fontsize=12)
ax.set_ylabel('|Estimated δ₁ − True δ₁|', fontsize=12)
ax.set_title('δ₁ Recovery Error vs. True α', fontsize=12)
ax.grid(True, alpha=0.3)
# Delta uncertainty vs true alpha
ax = axes[2]
# We compute CI width of delta as a proxy for how constrained it is.
delta_ci_widths = np.array([s.loc['delta[1]', '95%'] - s.loc['delta[1]', '5%'] for s in posterior_summaries])
ax.scatter(alpha_true, delta_ci_widths, alpha=0.7, s=50, c='coral', edgecolor='white')
ax.set_xlabel('True α', fontsize=12)
ax.set_ylabel('δ₁ 90% CI Width', fontsize=12)
ax.set_title('δ₁ Posterior Uncertainty vs. True α', fontsize=12)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Report low-alpha vs high-alpha recovery
median_alpha = np.median(alpha_true)
low_mask = alpha_true < median_alpha
high_mask = alpha_true >= median_alpha
print(f"\nRecovery stratified by true α (median split at α = {median_alpha:.2f}):")
print(f" α RMSE (low α): {np.sqrt(np.mean((alpha_mean[low_mask] - alpha_true[low_mask])**2)):.4f}")
print(f" α RMSE (high α): {np.sqrt(np.mean((alpha_mean[high_mask] - alpha_true[high_mask])**2)):.4f}")
print(f" δ₁ RMSE (low α): {np.sqrt(np.mean((delta_mean_vals[low_mask] - delta_true_vals[low_mask])**2)):.4f}")
print(f" δ₁ RMSE (high α): {np.sqrt(np.mean((delta_mean_vals[high_mask] - delta_true_vals[high_mask])**2)):.4f}")
print(f" δ₁ mean CI width (low α): {delta_ci_widths[low_mask].mean():.3f}")
print(f" δ₁ mean CI width (high α): {delta_ci_widths[high_mask].mean():.3f}")
```
When α is small, choices are nearly random — they carry little information about the underlying utilities, so δ (and β) become hard to pin down. When α is large, choices are near-deterministic, so the utility *ordering* is revealed clearly but the precise *magnitudes* of utility differences are less leveraged by the choice data. Across the lognormal(0,1) prior range examined here, α recovery error remains modest, while δ posterior uncertainty varies systematically with the choice regime. This α-dependent information content is consistent with the structural coupling discussed in @sec-utility-identification.
## Discussion
### Differential Recovery Across Parameters {#sec-utility-identification}
::: {.callout-note}
## Key Finding: Differential Parameter Recovery
The results above reveal an asymmetry in parameter recovery:
- **α (sensitivity)** is recovered well, with posterior means clustering tightly around true values across the lognormal(0,1) prior range examined here
- **β (feature weights)** and **δ (utility differences)** show materially weaker recovery, with wider credible intervals and lower precision compared to α
The finding extends to β as well as δ: the β summaries show the same qualitative pattern of slower posterior concentration, consistent with their structural coupling. We focus the prose on δ as the representative utility-function parameter.
This pattern is consistent with the identification structure analyzed formally in [Report 5](05_adding_risky_choices.qmd) (see the proposition there on m_0 vs. m_1 identification): α is identified more directly through choice sensitivity, while β and δ are only **weakly identified** through their composite mapping to expected utilities.
**For applications focused on estimating sensitivity (α):** within the prior range examined here, the weaker identification of β and δ does not substantially impair α recovery. If the primary research question concerns sensitivity to expected-utility differences rather than the decomposition of expected utility into beliefs and values, the identification challenges for (β, δ) may be of secondary concern.
:::
To understand why β and δ are harder to recover than α, consider the structure of the model. In decisions under uncertainty, choices depend on **expected utilities**:
$$
\eta_r = \sum_{k=1}^K \psi_{rk} \upsilon_k
$$
where $\boldsymbol{\psi}_r$ are subjective probabilities (determined by β) and $\boldsymbol{\upsilon}$ are utilities (determined by δ). The key insight is that we only observe *rankings* of expected utilities through choices—not the utilities themselves, and not the separate contributions of beliefs and values.
This creates a multiplicative coupling between β and δ: different (β, δ) pairs can produce similar expected utility functions, making both parameters harder to pin down precisely:
- **α** directly governs how *sensitively* choices respond to expected utility differences; within the prior and design regime examined here, it is recovered well because choice probabilities depend monotonically on α for any fixed expected utility difference
- **β and δ** jointly determine expected utilities through their product; uncertainty in one propagates to the other
Note that this does not mean β and δ are completely uninformed by the data — the credible intervals narrow relative to their prior support, as the CI-width diagnostics above imply — but information accumulates more slowly than for α.
The multiplicative coupling also leaves an illustrative signature *across* recovery iterations: when the posterior mean for β[1,1] errs in a direction that increases expected utility for consequence $k$, the posterior mean for $\delta_k$ may compensate in the opposite direction. The figure below visualizes this across-iteration error coupling. (We do not directly visualize the within-posterior correlation between β and δ here, since only marginal posterior summaries are saved by the recovery pipeline; the across-iteration diagnostic should therefore be read as suggestive rather than as direct posterior evidence of compensation.)
```{python}
#| label: fig-beta-delta-correlation
#| fig-cap: "Across-iteration signatures of the β–δ coupling. Left: estimation errors in β[1,1] vs. estimation errors in δ₁ across recovery iterations; a negative Pearson r (reported in the title) is consistent with the predicted compensatory tradeoff. Right: joint posterior uncertainty proxies — δ₁ 90% CI width vs. β[1,1] 90% CI width across iterations, color-coded by the true α; iterations with more informative choices (larger α) tend to cluster toward narrower CI widths for both parameters."
# Since only marginal summaries are saved, use across-iteration errors and
# CI widths as diagnostics rather than direct posterior correlations.
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: Distribution of signed errors showing coupling
ax = axes[0]
# For each iteration, compute the bias direction for beta[1,1] and delta[1]
beta_11_errors = np.array([s.loc['beta[1,1]', 'Mean'] - p['beta'][0][0]
for p, s in zip(true_params_list, posterior_summaries)])
delta_1_errors = np.array([s.loc['delta[1]', 'Mean'] - p['delta'][0]
for p, s in zip(true_params_list, posterior_summaries)])
ax.scatter(beta_11_errors, delta_1_errors, alpha=0.7, s=50, c='mediumpurple', edgecolor='white')
ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
corr_errors = np.corrcoef(beta_11_errors, delta_1_errors)[0, 1]
ax.set_xlabel('β[1,1] estimation error', fontsize=12)
ax.set_ylabel('δ₁ estimation error', fontsize=12)
ax.set_title(f'Estimation Error Coupling (r = {corr_errors:.3f})', fontsize=12)
ax.grid(True, alpha=0.3)
# Right: CI width of delta vs CI width of beta (across iterations)
ax = axes[1]
beta_11_ci = np.array([s.loc['beta[1,1]', '95%'] - s.loc['beta[1,1]', '5%'] for s in posterior_summaries])
delta_1_ci = np.array([s.loc['delta[1]', '95%'] - s.loc['delta[1]', '5%'] for s in posterior_summaries])
scatter = ax.scatter(beta_11_ci, delta_1_ci, alpha=0.7, s=50, c=alpha_true,
cmap='viridis', edgecolor='white')
ax.set_xlabel('β[1,1] 90% CI Width', fontsize=12)
ax.set_ylabel('δ₁ 90% CI Width', fontsize=12)
ax.set_title('Joint Posterior Uncertainty (color = true α)', fontsize=12)
ax.grid(True, alpha=0.3)
plt.colorbar(scatter, ax=ax, label='True α')
plt.tight_layout()
plt.show()
```
### Does Increasing Sample Size Help?
A natural question is whether the weaker δ recovery is simply a matter of insufficient data. (As noted in the Discussion, β follows the same pattern as δ due to their structural coupling; we focus on δ as the representative utility-function parameter.) The study design with M=25 decision problems might not provide enough information. We test this by doubling the sample size to M=50.
To ensure a valid comparison, we use the `extend()` method to create the larger design from the original M=25 design. This keeps the same alternatives (features) and original problems, adding 25 new problems that use those same alternatives. Any differences in recovery can then be attributed to sample size rather than different feature configurations.
::: {.callout-tip collapse="true"}
## How `extend()` preserves the base design
The `StudyDesign.extend()` method creates a new design object that:
1. **Copies the identical feature matrix** `w` (all R=15 alternatives retain their original feature vectors)
2. **Preserves the original M=25 problems** (the first 25 rows of the indicator matrix `I` are unchanged)
3. **Appends new problems** by randomly assigning subsets of the *same* R=15 alternatives to each new problem
This guarantees that any change in recovery quality is due to having more choice observations, not to different feature configurations. The companion method `extend_with_alternatives()` (used below) additionally generates new feature vectors, enabling comparisons where the feature space itself expands.
:::
```{python}
#| label: config-m50
#| echo: true
# Extend the original study design to M=50
# This preserves the same alternatives and original problems
study_m50 = study.extend(additional_M=25, design_name="parameter_recovery_m50")
print(f"Extended Study Design (M=50):")
print(f" Decision problems: {study_m50.M} (original {study.M} + 25 new)")
print(f" Same R={study_m50.R} alternatives with identical features")
print(f" Total expected observations: ~{study_m50.M * 3.5:.0f} choices")
```
```{python}
#| label: run-recovery-m50
#| output: false
# Create output directory
output_dir_m50 = tempfile.mkdtemp(prefix="param_recovery_m50_")
# Run parameter recovery with larger sample
recovery_m50 = ParameterRecovery(
inference_model_path=None,
sim_model_path=None,
study_design=study_m50,
output_dir=output_dir_m50,
n_mcmc_samples=1000,
n_mcmc_chains=4,
n_iterations=50
)
true_params_m50, posterior_summaries_m50 = recovery_m50.run()
```
```{python}
#| label: compare-sample-sizes
#| echo: false
# Compute recovery metrics for M=50
# Alpha
alpha_true_m50 = np.array([p['alpha'] for p in true_params_m50])
alpha_mean_m50 = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries_m50])
alpha_lower_m50 = np.array([s.loc['alpha', '5%'] for s in posterior_summaries_m50])
alpha_upper_m50 = np.array([s.loc['alpha', '95%'] for s in posterior_summaries_m50])
alpha_rmse_m50 = np.sqrt(np.mean((alpha_mean_m50 - alpha_true_m50)**2))
alpha_coverage_m50 = np.mean((alpha_true_m50 >= alpha_lower_m50) & (alpha_true_m50 <= alpha_upper_m50))
alpha_ci_width_m50 = np.mean(alpha_upper_m50 - alpha_lower_m50)
# Delta (using config from original study)
K_minus_1 = config['K'] - 1
delta_stats_m50 = []
for k in range(K_minus_1):
param_name = f"delta[{k+1}]"
delta_true_m50 = np.array([p['delta'][k] for p in true_params_m50])
delta_mean_m50 = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries_m50])
delta_lower_m50 = np.array([s.loc[param_name, '5%'] for s in posterior_summaries_m50])
delta_upper_m50 = np.array([s.loc[param_name, '95%'] for s in posterior_summaries_m50])
delta_stats_m50.append({
'parameter': f'δ_{k+1}',
'rmse': np.sqrt(np.mean((delta_mean_m50 - delta_true_m50)**2)),
'coverage': np.mean((delta_true_m50 >= delta_lower_m50) & (delta_true_m50 <= delta_upper_m50)),
'ci_width': np.mean(delta_upper_m50 - delta_lower_m50)
})
delta_df_m50 = pd.DataFrame(delta_stats_m50)
```
```{python}
#| label: fig-sample-size-comparison
#| fig-cap: "Comparison of parameter recovery between M=25 and M=50. Doubling the number of decision problems improves α recovery but has a more modest effect on δ recovery. (β exhibits similar patterns to δ due to their structural coupling.)"
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# RMSE comparison
ax = axes[0]
params = ['α', 'δ₁', 'δ₂']
rmse_m25 = [alpha_rmse, delta_df.iloc[0]['rmse'], delta_df.iloc[1]['rmse']]
rmse_m50 = [alpha_rmse_m50, delta_df_m50.iloc[0]['rmse'], delta_df_m50.iloc[1]['rmse']]
x = np.arange(len(params))
width = 0.35
bars1 = ax.bar(x - width/2, rmse_m25, width, label='M=25', color='steelblue', alpha=0.7)
bars2 = ax.bar(x + width/2, rmse_m50, width, label='M=50', color='coral', alpha=0.7)
ax.set_xlabel('Parameter', fontsize=12)
ax.set_ylabel('RMSE', fontsize=12)
ax.set_title('RMSE by Sample Size', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(params)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
# Coverage comparison
ax = axes[1]
cov_m25 = [alpha_coverage, delta_df.iloc[0]['coverage'], delta_df.iloc[1]['coverage']]
cov_m50 = [alpha_coverage_m50, delta_df_m50.iloc[0]['coverage'], delta_df_m50.iloc[1]['coverage']]
bars1 = ax.bar(x - width/2, cov_m25, width, label='M=25', color='steelblue', alpha=0.7)
bars2 = ax.bar(x + width/2, cov_m50, width, label='M=50', color='coral', alpha=0.7)
ax.axhline(y=0.90, color='red', linestyle='--', linewidth=2, label='Target (90%)')
ax.set_xlabel('Parameter', fontsize=12)
ax.set_ylabel('Coverage', fontsize=12)
ax.set_title('90% CI Coverage by Sample Size', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(params)
ax.set_ylim(0, 1.05)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
# CI Width comparison
ax = axes[2]
ci_m25 = [alpha_ci_width, delta_df.iloc[0]['ci_width'], delta_df.iloc[1]['ci_width']]
ci_m50 = [alpha_ci_width_m50, delta_df_m50.iloc[0]['ci_width'], delta_df_m50.iloc[1]['ci_width']]
bars1 = ax.bar(x - width/2, ci_m25, width, label='M=25', color='steelblue', alpha=0.7)
bars2 = ax.bar(x + width/2, ci_m50, width, label='M=50', color='coral', alpha=0.7)
ax.set_xlabel('Parameter', fontsize=12)
ax.set_ylabel('CI Width', fontsize=12)
ax.set_title('90% CI Width by Sample Size', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(params)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
```
```{python}
#| label: tbl-sample-size-comparison
#| echo: false
# Create comparison table
comparison_data = []
# Alpha
comparison_data.append({
'Parameter': 'α',
'RMSE (M=25)': f'{alpha_rmse:.4f}',
'RMSE (M=50)': f'{alpha_rmse_m50:.4f}',
'Δ RMSE': f'{(alpha_rmse_m50 - alpha_rmse)/alpha_rmse * 100:+.1f}%',
'CI Width (M=25)': f'{alpha_ci_width:.3f}',
'CI Width (M=50)': f'{alpha_ci_width_m50:.3f}'
})
# Delta
for i in range(K_minus_1):
comparison_data.append({
'Parameter': f'δ_{i+1}',
'RMSE (M=25)': f'{delta_df.iloc[i]["rmse"]:.4f}',
'RMSE (M=50)': f'{delta_df_m50.iloc[i]["rmse"]:.4f}',
'Δ RMSE': f'{(delta_df_m50.iloc[i]["rmse"] - delta_df.iloc[i]["rmse"])/delta_df.iloc[i]["rmse"] * 100:+.1f}%',
'CI Width (M=25)': f'{delta_df.iloc[i]["ci_width"]:.3f}',
'CI Width (M=50)': f'{delta_df_m50.iloc[i]["ci_width"]:.3f}'
})
comparison_df = pd.DataFrame(comparison_data)
print("Sample Size Comparison (M=25 vs M=50):")
print(comparison_df.to_string(index=False))
```
In these runs, the table shows the asymmetric sample-size pattern anticipated in @sec-utility-identification: doubling M produces a larger relative reduction in α RMSE and CI width than in δ RMSE and CI width. (β follows the same pattern as δ, as noted in the Discussion.) This is consistent with α being informed more directly by choice sensitivity while (β, δ) are weakly identified through their composite mapping to expected utilities — additional uncertain-choice data sharpens the parameter that the data most directly inform, while improving (β, δ) more slowly.
### Does Adding New Alternatives Help?
The M=50 extension above adds problems over the **same** alternatives — it does not expand the feature space. We might hypothesize that adding **new** alternatives (with new feature vectors) provides qualitatively different information that could help with δ recovery.
To test this, we use `extend_with_alternatives()` to add 15 new alternatives along with 25 new problems that use those alternatives. This parallels how model m_1 (in [Report 5](05_adding_risky_choices.qmd)) adds new risky alternatives along with new risky problems, enabling a fairer comparison.
```{python}
#| label: config-m50-new-alts
#| echo: true
# Extend with both new alternatives AND new problems
# New problems only use the new alternatives (paralleling m_1's risky component)
study_m50_new_alts = study.extend_with_alternatives(
additional_M=25,
additional_R=15, # Same number of new alternatives as m_1 has risky alternatives
design_name="parameter_recovery_m50_new_alts",
new_problems_use_new_alts_only=True
)
print(f"Extended Study Design (M=50 with new alternatives):")
print(f" Decision problems: {study_m50_new_alts.M} (original {study.M} + 25 new)")
print(f" Alternatives: R={study_m50_new_alts.R} (original {study.R} + 15 new)")
print(f" Original problems (1-25) use original alternatives (1-15)")
print(f" New problems (26-50) use new alternatives (16-30)")
```
::: {.callout-tip collapse="true"}
## Why `new_problems_use_new_alts_only=True`?
Setting `new_problems_use_new_alts_only=True` creates a **block structure**: original problems draw from original alternatives, and new problems draw exclusively from new alternatives. This design choice mirrors the structure of model m_1 in [Report 5](05_adding_risky_choices.qmd), where risky problems use only risky alternatives (lotteries). Maintaining this parallel enables a fair comparison: any difference between "M=50 with new uncertain alternatives" and "M=25 + N=25 with risky alternatives" can be attributed to the *type* of data (risk vs. uncertainty) rather than to differences in how alternatives are assigned to problems.
An alternative design—mixing old and new alternatives within new problems—would provide cross-referencing information that might improve identification. However, this would break the structural parallel with m_1 and make the comparison less clean. We prioritize the controlled comparison here; designs that mix alternatives are worth exploring in future work.
:::
```{python}
#| label: run-recovery-m50-new-alts
#| output: false
# Create output directory
output_dir_m50_new_alts = tempfile.mkdtemp(prefix="param_recovery_m50_new_alts_")
# Run parameter recovery with new alternatives
recovery_m50_new_alts = ParameterRecovery(
inference_model_path=None,
sim_model_path=None,
study_design=study_m50_new_alts,
output_dir=output_dir_m50_new_alts,
n_mcmc_samples=1000,
n_mcmc_chains=4,
n_iterations=50
)
true_params_m50_new_alts, posterior_summaries_m50_new_alts = recovery_m50_new_alts.run()
```
```{python}
#| label: compare-all-extensions
#| echo: false
# Compute recovery metrics for M=50 with new alternatives
# Alpha
alpha_true_m50_new = np.array([p['alpha'] for p in true_params_m50_new_alts])
alpha_mean_m50_new = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries_m50_new_alts])
alpha_lower_m50_new = np.array([s.loc['alpha', '5%'] for s in posterior_summaries_m50_new_alts])
alpha_upper_m50_new = np.array([s.loc['alpha', '95%'] for s in posterior_summaries_m50_new_alts])
alpha_rmse_m50_new = np.sqrt(np.mean((alpha_mean_m50_new - alpha_true_m50_new)**2))
alpha_coverage_m50_new = np.mean((alpha_true_m50_new >= alpha_lower_m50_new) & (alpha_true_m50_new <= alpha_upper_m50_new))
alpha_ci_width_m50_new = np.mean(alpha_upper_m50_new - alpha_lower_m50_new)
# Delta
delta_stats_m50_new = []
for k in range(K_minus_1):
param_name = f"delta[{k+1}]"
delta_true_m50_new = np.array([p['delta'][k] for p in true_params_m50_new_alts])
delta_mean_m50_new = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries_m50_new_alts])
delta_lower_m50_new = np.array([s.loc[param_name, '5%'] for s in posterior_summaries_m50_new_alts])
delta_upper_m50_new = np.array([s.loc[param_name, '95%'] for s in posterior_summaries_m50_new_alts])
delta_stats_m50_new.append({
'parameter': f'δ_{k+1}',
'rmse': np.sqrt(np.mean((delta_mean_m50_new - delta_true_m50_new)**2)),
'coverage': np.mean((delta_true_m50_new >= delta_lower_m50_new) & (delta_true_m50_new <= delta_upper_m50_new)),
'ci_width': np.mean(delta_upper_m50_new - delta_lower_m50_new)
})
delta_df_m50_new = pd.DataFrame(delta_stats_m50_new)
```
```{python}
#| label: fig-three-way-comparison
#| fig-cap: "Three-way comparison of δ recovery: M=25 (base), M=50 (same alternatives), and M=50 (new alternatives). Adding new alternatives may provide different information than simply adding more problems over existing alternatives. (β exhibits similar patterns to δ due to their structural coupling.)"
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
params = ['δ₁', 'δ₂']
x = np.arange(len(params))
width = 0.25
# RMSE comparison
ax = axes[0]
rmse_m25_d = [delta_df.iloc[k]['rmse'] for k in range(K_minus_1)]
rmse_m50_d = [delta_df_m50.iloc[k]['rmse'] for k in range(K_minus_1)]
rmse_m50_new_d = [delta_stats_m50_new[k]['rmse'] for k in range(K_minus_1)]
bars1 = ax.bar(x - width, rmse_m25_d, width, label='M=25 (base)', color='steelblue', alpha=0.7)
bars2 = ax.bar(x, rmse_m50_d, width, label='M=50 (same alts)', color='coral', alpha=0.7)
bars3 = ax.bar(x + width, rmse_m50_new_d, width, label='M=50 (new alts)', color='mediumseagreen', alpha=0.7)
ax.set_xlabel('Parameter', fontsize=12)
ax.set_ylabel('RMSE', fontsize=12)
ax.set_title('δ RMSE by Extension Type', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(params)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3, axis='y')
# Coverage comparison
ax = axes[1]
cov_m25_d = [delta_df.iloc[k]['coverage'] for k in range(K_minus_1)]
cov_m50_d = [delta_df_m50.iloc[k]['coverage'] for k in range(K_minus_1)]
cov_m50_new_d = [delta_stats_m50_new[k]['coverage'] for k in range(K_minus_1)]
bars1 = ax.bar(x - width, cov_m25_d, width, label='M=25 (base)', color='steelblue', alpha=0.7)
bars2 = ax.bar(x, cov_m50_d, width, label='M=50 (same alts)', color='coral', alpha=0.7)
bars3 = ax.bar(x + width, cov_m50_new_d, width, label='M=50 (new alts)', color='mediumseagreen', alpha=0.7)
ax.axhline(y=0.90, color='red', linestyle='--', linewidth=2, label='Target (90%)')
ax.set_xlabel('Parameter', fontsize=12)
ax.set_ylabel('Coverage', fontsize=12)
ax.set_title('δ 90% CI Coverage by Extension Type', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(params)
ax.set_ylim(0, 1.05)
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3, axis='y')
# CI Width comparison
ax = axes[2]
ci_m25_d = [delta_df.iloc[k]['ci_width'] for k in range(K_minus_1)]
ci_m50_d = [delta_df_m50.iloc[k]['ci_width'] for k in range(K_minus_1)]
ci_m50_new_d = [delta_stats_m50_new[k]['ci_width'] for k in range(K_minus_1)]
bars1 = ax.bar(x - width, ci_m25_d, width, label='M=25 (base)', color='steelblue', alpha=0.7)
bars2 = ax.bar(x, ci_m50_d, width, label='M=50 (same alts)', color='coral', alpha=0.7)
bars3 = ax.bar(x + width, ci_m50_new_d, width, label='M=50 (new alts)', color='mediumseagreen', alpha=0.7)
ax.set_xlabel('Parameter', fontsize=12)
ax.set_ylabel('CI Width', fontsize=12)
ax.set_title('δ 90% CI Width by Extension Type', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(params)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
```
```{python}
#| label: tbl-three-way-comparison
#| echo: false
# Create three-way comparison table using pandas (no extra dependencies)
three_way_rows = []
# Alpha
three_way_rows.append({
'Param': 'α',
'RMSE M=25': f'{alpha_rmse:.4f}',
'RMSE M=50 (same)': f'{alpha_rmse_m50:.4f}',
'RMSE M=50 (new)': f'{alpha_rmse_m50_new:.4f}',
'CI Width M=25': f'{alpha_ci_width:.3f}',
'CI Width M=50 (same)': f'{alpha_ci_width_m50:.3f}',
'CI Width M=50 (new)': f'{alpha_ci_width_m50_new:.3f}'
})
# Delta
for i in range(K_minus_1):
three_way_rows.append({
'Param': f'δ_{i+1}',
'RMSE M=25': f'{delta_df.iloc[i]["rmse"]:.4f}',
'RMSE M=50 (same)': f'{delta_df_m50.iloc[i]["rmse"]:.4f}',
'RMSE M=50 (new)': f'{delta_stats_m50_new[i]["rmse"]:.4f}',
'CI Width M=25': f'{delta_df.iloc[i]["ci_width"]:.3f}',
'CI Width M=50 (same)': f'{delta_df_m50.iloc[i]["ci_width"]:.3f}',
'CI Width M=50 (new)': f'{delta_stats_m50_new[i]["ci_width"]:.3f}'
})
three_way_df = pd.DataFrame(three_way_rows)
print("Three-Way Extension Comparison:")
print(three_way_df.to_string(index=False))
```
### Interpretation: Sample Size and Alternative Effects
The three-way comparison reveals important patterns about what helps δ recovery (β exhibits similar patterns due to the structural coupling between these parameters):
::: {.callout-note}
## Extension Type Matters
Comparing the three conditions:
1. **M=25 (base)**: Baseline recovery with modest δ precision
2. **M=50 (same alternatives)**: Adding 25 problems over the same R=15 alternatives
3. **M=50 (new alternatives)**: Adding 25 problems using 15 **new** alternatives (R increases to 30)
Directional findings from the three-way table:
- **α recovery** improves with more data under either extension; the two M=50 conditions are comparable for α.
- **δ recovery** suggests a possible modest additional benefit from new alternatives over same alternatives, consistent with the idea that expanding the feature space provides qualitatively different information about how features map to subjective probabilities (and thus indirectly about δ through the multiplicative coupling).
These directional comparisons should be read with the precision caveat in mind: with $n = 50$ recovery iterations, the standard error on coverage is roughly four percentage points, and small differences in RMSE or CI width should not be over-interpreted. More direct across-model comparisons — with matched designs and/or a calibration-focused framing — are provided in [Report 5](05_adding_risky_choices.qmd) and [Report 6](06_sbc_validation.qmd).
:::
This three-way comparison sets up a crucial comparison for [Report 5](05_adding_risky_choices.qmd), where we add risky alternatives (lotteries with known probabilities) instead of additional uncertain alternatives. By comparing:
- **M=50 with new uncertain alternatives** (this report): Expands uncertain feature space
- **M=25 + N=25 with new risky alternatives** (Report 5): Adds known-probability lotteries
...we can more cleanly distinguish whether improvements come from the *type* of data (risk vs. uncertainty) or simply from adding more diverse choice problems.
The asymmetry in learning rates for (β, δ) versus α reflects the model structure: in decisions under uncertainty, utilities and subjective probabilities enter the likelihood through their product (expected utilities). This creates a multiplicative coupling that makes (β, δ) harder to pin down than α, which more directly governs choice sensitivity.
### Study Design Parameters in m_0
The `m_0.stan` data block defines several parameters that characterize study size:
| Parameter | Description | Current Value |
|-----------|-------------|---------------|
| M | Number of decision problems | 25 → 50 |
| K | Number of consequences | 3 |
| D | Feature dimensions | 5 |
| R | Distinct alternatives | 15 |
We might consider varying other parameters:
- **Increasing K** (more consequences): Would require estimating more δ parameters, potentially making recovery harder
- **Increasing R** (more alternatives): Provides more information about β, and may indirectly help δ through better-constrained expected utilities
- **Increasing D** (more features): Similar to increasing R—primarily helps with β
The effectiveness of these changes for (β, δ) recovery is an empirical question worth exploring in future work.
### Alternatives to More Uncertain-Choice Data
Beyond simply collecting more uncertain-choice observations, three further routes can sharpen recovery of (β, δ). They differ in *where* the additional information comes from — the design, the prior, or cross-agent structure — and have very different epistemic standing.
#### Route 1: Risky alternatives as a direct probe of utility
The history of decision theory suggests the use of alternatives with known probabilities — *risky alternatives* — as a direct probe of the utility function. The mechanical reason is transparent in our notation: for a risky alternative the probability vector $\boldsymbol{\psi}_r$ is fixed exogenously by the experimenter rather than being a function of $\boldsymbol{\beta}$ and the feature vector. Conditional on those observations, the expected utility $\eta_r = \boldsymbol{\psi}_r^\top \boldsymbol{\upsilon}$ is *linear* in $\boldsymbol{\upsilon}$ (and hence in $\boldsymbol{\delta}$), so the multiplicative β–δ coupling that hampers recovery in m_0 is broken for the risky portion of the data.
The use of objectively risky prospects to identify utilities has a long lineage — running through Ramsey, von Neumann–Morgenstern, and the Anscombe–Aumann construction that combines risky and uncertain prospects in a single domain. We do not need the formal Anscombe–Aumann representation theorem to motivate the design choice here; the identification gain is mechanical, as just described. The formal relationship between our parameter-recovery argument and the classical representation result is examined in [Report 5](05_adding_risky_choices.qmd).
#### Route 2: A more concentrated Dirichlet prior on δ
A different lever is available without changing the design at all. In `m_0.stan`, $\boldsymbol{\delta}$ is a `simplex[K-1]` with prior $\boldsymbol{\delta} \sim \text{Dirichlet}(\mathbf{1})$ — i.e., a flat prior over admissible utility differences. Replacing the concentration vector $\mathbf{1}$ with $\alpha_0 \mathbf{1}$ for some $\alpha_0 > 1$ tightens the marginal prior on each $\delta_k$ around $1/(K-1)$. Through the multiplicative coupling between β and δ, this also indirectly tightens β: posterior uncertainty in δ that the likelihood cannot resolve is now absorbed by the prior, leaving less slack for β to drift.
The trade-offs deserve emphasis:
- The identification gain here comes from prior information, not data. It is appropriate only when the analyst has substantive grounds — domain knowledge, pilot data, or a deliberate modeling commitment that consequences are roughly equally spaced on the unit utility scale — for a more concentrated prior.
- A misspecified concentrated prior biases $\boldsymbol{\delta}$ (and, via coupling, $\boldsymbol{\beta}$) toward the prior mean and shrinks credible intervals below their nominal calibration. Calibration under the concentrated prior should be re-checked via SBC.
- Unlike adding risky choices, this route does not improve identification *of the data-generating process*; it improves *posterior precision* by importing assumptions.
We do not pursue this route empirically in this report series; we flag it as a methodologically natural option whose evaluation is left to future work.
#### Route 3: Hierarchical modeling across agents or cells
A third route becomes available when data are collected from multiple agents, experimental cells, or related contexts. If those agents can defensibly be modeled as sharing a utility scale — or as drawing utilities from a common population distribution — a hierarchical model can borrow strength across datasets. Data from one agent then help constrain the utility parameters for others through the shared or partially pooled structure.
This is the logic developed in the hierarchical sequence ([Reports 08–12](08_hierarchical_formulation.qmd)). In the implemented `h_m01` model, cell-level sensitivities and belief mappings vary across cells, while the utility simplex $\boldsymbol{\delta}$ is shared across the stacked data. The hierarchical recovery and SBC analyses ([Report 11](11_hierarchical_parameter_recovery.qmd), [Report 12](12_hierarchical_sbc_validation.qmd)) validate that structure at the alignment-study scale.
The caveat is important: hierarchical modeling does not mechanically break the single-agent β–δ multiplicative coupling in the way risky choices do. Its gain comes from pooling information under a cross-agent or cross-cell assumption. If the shared-utility or population-utility assumption is substantively wrong, the model can import bias just as a too-concentrated prior can. For that reason, hierarchical variants require the same prior predictive, recovery, and SBC validation used elsewhere in this report series.
::: {.callout-note}
## Three complementary routes beyond more uncertain choices
1. **Add risky-choice data** (model m_1, [Report 5](05_adding_risky_choices.qmd)). Adds new identifying information by fixing $\boldsymbol{\psi}_r$ exogenously, breaking the β–δ multiplicative coupling for those observations.
2. **Tighten the Dirichlet concentration on δ.** Adds prior information rather than data; cheap but assumption-laden, and only as defensible as the substantive grounds for the tighter prior.
3. **Use hierarchical modeling across agents or cells** ([Reports 08–12](08_hierarchical_formulation.qmd)). Adds cross-agent structure and partial pooling; powerful when shared-utility or population-utility assumptions are defensible, but not a substitute for design-based identification.
The three are not exclusive and can be combined.
:::
## Conclusion
Parameter recovery analysis for model m_0 reveals differential recovery across parameters:
1. **Well-recovered parameter**: α (sensitivity) can be recovered with good precision across the lognormal(0,1) prior range examined here.
2. **Slower-learning parameters**: β (feature weights) and δ (utility differences) show wider uncertainty; credible intervals do narrow relative to the prior support, but more slowly than for α.
3. **Sample size helps, but differentially**: in these runs, doubling M improves α recovery more than (β, δ) recovery, and adding *new* alternatives suggests a possible modest additional benefit over adding more problems over the same alternatives.
This pattern reflects the identification structure analyzed formally in [Report 5](05_adding_risky_choices.qmd): α is informed more directly by choice sensitivity, while β and δ interact multiplicatively in the expected utility, causing uncertainty in one to propagate to the other.
**Implications for applications:**
- If the primary research question concerns sensitivity (α) — how strongly choices respond to expected utility differences — model m_0 performs well even with modest sample sizes.
- If precise decomposition of expected utility into beliefs (β) and values (δ) is needed, additional information is required.
**Four routes for sharpening (β, δ):**
1. **More uncertain-choice data**: larger M (and perhaps R) eventually yields more precise (β, δ) estimates, though the rate of improvement is slower than for α.
2. **Risky-choice data** (model m_1, [Report 5](05_adding_risky_choices.qmd)): adds new identifying information by fixing $\boldsymbol{\psi}_r$ exogenously, breaking the β–δ multiplicative coupling for those observations.
3. **A more concentrated Dirichlet prior on δ**: imports prior information rather than data; cheap but assumption-laden, and only as defensible as the substantive grounds for the tighter prior. Not pursued empirically in this report series; flagged as future work.
4. **Hierarchical modeling across agents or cells** ([Reports 08–12](08_hierarchical_formulation.qmd)): borrows strength through shared or partially pooled structure when multiple related datasets are available. This can sharpen utility recovery under defensible cross-agent assumptions, but should be validated with prior predictive checks, recovery, and SBC.
The choice among these routes — or their combination — depends on practical considerations like study feasibility, the relative importance of precise utility estimation for the research question, the availability of related agents or experimental cells, and the strength of substantive prior information available.
```{python}
#| label: cleanup
#| include: false
# Clean up temporary directories
import shutil
for d in [output_dir, output_dir_m50, output_dir_m50_new_alts]:
if os.path.exists(d):
shutil.rmtree(d)
```