Parameter Recovery Analysis

Jeff Helzner

Parameter Recovery Analysis

Foundational Report 4

foundations

validation

m_0

Validation that model parameters can be accurately recovered from simulated data, demonstrating model identifiability.

Author

Jeff Helzner

Published

May 12, 2026

0.1 Introduction

Having examined the prior predictive distribution (Report 3), we now ask: can we recover the true parameters when we simulate data from our model with known parameters? This is the fundamental question of parameter recovery analysis.

Parameter recovery is a critical validation step for any Bayesian model:

Identifiability: Can the model distinguish between different parameter configurations?
Estimation accuracy: Are posterior means close to true values?
Uncertainty calibration: Do 90% credible intervals contain the true value ~90% of the time?
Precision: How narrow are the credible intervals?

Preview: A Key Finding

This report reveals an asymmetry in parameter recovery: while α (sensitivity) is recovered well in the simulation regime examined here, the pair (β, δ) that determines the expected utility function shows weaker recovery — credible intervals narrow more slowly relative to those of α. As shown formally in Report 5, this reflects a structural identification challenge where β and δ interact multiplicatively in the expected utility, causing uncertainty in one to propagate to the other. Importantly, this does not prevent good recovery of α, which remains our primary parameter of interest within the prior range examined here.

The Recovery Paradigm

Parameter recovery follows a simple logic:

Simulate: Generate data from m_0_sim.stan with known (“true”) parameter values
Estimate: Fit m_0.stan to the simulated data
Compare: Assess how well posterior estimates match the true values
Repeat: Do this many times to characterize recovery performance

If the model is well-specified and identifiable, the posterior should concentrate around the true values.

0.2 Study Design

We use the same study design as in Report 3 to maintain consistency across our foundational analyses:

Show code

# Study design configuration (same as Report 3)
config = {
    "M": 25,                    # Number of decision problems
    "K": 3,                     # Number of consequences  
    "D": 5,                     # Feature dimensions
    "R": 15,                    # Number of distinct alternatives
    "min_alts_per_problem": 2,  # Minimum alternatives per problem
    "max_alts_per_problem": 5,  # Maximum alternatives per problem
    "feature_dist": "normal",   # Feature distribution
    "feature_params": {"loc": 0, "scale": 1}
}

print(f"Study Design Configuration:")
print(f"  Decision problems (M): {config['M']}")
print(f"  Consequences (K): {config['K']}")
print(f"  Feature dimensions (D): {config['D']}")
print(f"  Distinct alternatives (R): {config['R']}")
print(f"  Alternatives per problem: {config['min_alts_per_problem']}-{config['max_alts_per_problem']}")

Study Design Configuration:
  Decision problems (M): 25
  Consequences (K): 3
  Feature dimensions (D): 5
  Distinct alternatives (R): 15
  Alternatives per problem: 2-5

Show code

from utils.study_design import StudyDesign

# Create and generate the study design
study = StudyDesign(
    M=config["M"],
    K=config["K"],
    D=config["D"],
    R=config["R"],
    min_alts_per_problem=config["min_alts_per_problem"],
    max_alts_per_problem=config["max_alts_per_problem"],
    feature_dist=config["feature_dist"],
    feature_params=config["feature_params"],
    design_name="parameter_recovery"
)
study.generate()


Generated Study Design:
  Total alternatives across problems: 87
  Alternatives per problem: min=2, max=5, mean=3.5

0.3 Parameter Recovery Analysis

We use the ParameterRecovery class to systematically evaluate how well parameters can be recovered. This class:

Uses m_0_sim.stan to generate data with known parameters drawn from the prior
Fits m_0.stan to each simulated dataset
Compares posterior estimates to true values across many iterations

Show code

from analysis.parameter_recovery import ParameterRecovery
import tempfile

# Create output directory for this analysis
output_dir = tempfile.mkdtemp(prefix="param_recovery_")

# Initialize parameter recovery analysis
recovery = ParameterRecovery(
    inference_model_path=None,  # Uses default m_0.stan
    sim_model_path=None,        # Uses default m_0_sim.stan
    study_design=study,
    output_dir=output_dir,
    n_mcmc_samples=1000,        # Samples per chain
    n_mcmc_chains=4,            # Number of chains
    n_iterations=50             # Number of simulation-recovery iterations
)

# Run the analysis
true_params_list, posterior_summaries = recovery.run()

Parameter Recovery Analysis Complete:
  Successful iterations: 50
  Parameters recovered per iteration:
    - α (sensitivity): 1 parameter
    - β (feature weights): 3 × 5 = 15 parameters
    - δ (utility increments): 2 parameters

0.3.1 MCMC Diagnostics

Before interpreting recovery statistics, we verify that the MCMC sampler converged properly across all iterations. CmdStanPy’s summary() method provides $\hat{R}$ (split R-hat) and effective sample size (ESS) for each parameter, and diagnostics are saved per iteration by the ParameterRecovery class.

MCMC Diagnostic Summary Across All Recovery Iterations:
  Max R-hat per iteration:
    Median: 1.0030
    95th percentile: 1.0049
    Maximum: 1.0100
    Iterations with max R-hat > 1.01: 1/50
  Min ESS (ESS_bulk) per iteration:
    Median: 2337
    5th percentile: 1098
    Minimum: 416

Diagnostic Assessment

The diagnostics above confirm that the sampler converged reliably across recovery iterations. All iterations meet standard convergence thresholds ($\hat{R} < 1.01$, ESS well above recommended minimums), so the recovery statistics below reflect genuine posterior behavior rather than MCMC artifacts. Per-iteration diagnostic files (including divergence checks) are saved by the ParameterRecovery class and can be inspected individually if needed.

0.4 Recovery Metrics

We evaluate parameter recovery using four key metrics:

Metric	Definition	Target
Bias	Mean(estimate - true)	≈ 0
RMSE	√Mean((estimate - true)²)	Small
Coverage	P(true ∈ 90% CI)	≈ 0.90
CI Width	Mean(upper - lower)	Small but calibrated

0.4.1 Sensitivity Parameter ($\alpha$)

Show code

# Extract alpha values
alpha_true = np.array([p['alpha'] for p in true_params_list])
alpha_mean = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries])
alpha_lower = np.array([s.loc['alpha', '5%'] for s in posterior_summaries])
alpha_upper = np.array([s.loc['alpha', '95%'] for s in posterior_summaries])

# Calculate metrics
alpha_bias = np.mean(alpha_mean - alpha_true)
alpha_rmse = np.sqrt(np.mean((alpha_mean - alpha_true)**2))
alpha_coverage = np.mean((alpha_true >= alpha_lower) & (alpha_true <= alpha_upper))
alpha_ci_width = np.mean(alpha_upper - alpha_lower)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# True vs Estimated
ax = axes[0]
ax.scatter(alpha_true, alpha_mean, alpha=0.7, s=60, c='steelblue', edgecolor='white')
lims = [min(alpha_true.min(), alpha_mean.min()) * 0.9, 
        max(alpha_true.max(), alpha_mean.max()) * 1.1]
ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line')
ax.set_xlim(lims)
ax.set_ylim(lims)
ax.set_xlabel('True α', fontsize=12)
ax.set_ylabel('Estimated α (posterior mean)', fontsize=12)
ax.set_title(f'α Recovery: Bias={alpha_bias:.3f}, RMSE={alpha_rmse:.3f}', fontsize=12)
ax.legend()
ax.set_aspect('equal')

# Coverage plot
ax = axes[1]
iterations = np.arange(len(alpha_true))
for i in range(len(alpha_true)):
    covered = (alpha_true[i] >= alpha_lower[i]) & (alpha_true[i] <= alpha_upper[i])
    color = 'forestgreen' if covered else 'crimson'
    ax.plot([i, i], [alpha_lower[i], alpha_upper[i]], color=color, linewidth=2, alpha=0.7)
    ax.scatter(i, alpha_mean[i], color=color, s=40, zorder=3)

ax.scatter(iterations, alpha_true, color='black', s=60, marker='x', 
           label='True value', zorder=4, linewidth=2)
ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('α', fontsize=12)
ax.set_title(f'α: 90% Credible Intervals (Coverage = {alpha_coverage:.0%})', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

alpha_coverage_se = np.sqrt(alpha_coverage * (1 - alpha_coverage) / len(alpha_true))
print(f"\nα Recovery Statistics:")
print(f"  Bias: {alpha_bias:.4f}")
print(f"  RMSE: {alpha_rmse:.4f}")
print(f"  90% CI Coverage: {alpha_coverage:.1%} (SE = {alpha_coverage_se:.1%})")
print(f"  Mean CI Width: {alpha_ci_width:.3f}")

Figure 1: Recovery of the sensitivity parameter α. Left: True vs. estimated values with identity line. Right: 90% credible intervals for each iteration, colored by whether they contain the true value.


α Recovery Statistics:
  Bias: 0.1957
  RMSE: 1.1667
  90% CI Coverage: 90.0% (SE = 4.2%)
  Mean CI Width: 3.248

0.4.2 Feature-to-Probability Weights ($\boldsymbol{\beta}$)

The β matrix has K × D = 15 parameters (3 consequences × 5 features). We examine recovery across all of them:

Show code

# Compute recovery metrics for all beta parameters
K, D = config['K'], config['D']
beta_recovery = []

for k in range(K):
    for d in range(D):
        param_name = f"beta[{k+1},{d+1}]"
        
        beta_true = np.array([p['beta'][k][d] for p in true_params_list])
        beta_mean = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries])
        beta_lower = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
        beta_upper = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])
        
        bias = np.mean(beta_mean - beta_true)
        rmse = np.sqrt(np.mean((beta_mean - beta_true)**2))
        coverage = np.mean((beta_true >= beta_lower) & (beta_true <= beta_upper))
        ci_width = np.mean(beta_upper - beta_lower)
        
        beta_recovery.append({
            'parameter': param_name,
            'k': k + 1,
            'd': d + 1,
            'bias': bias,
            'rmse': rmse,
            'coverage': coverage,
            'ci_width': ci_width,
            'true': beta_true,
            'mean': beta_mean,
            'lower': beta_lower,
            'upper': beta_upper
        })

beta_df = pd.DataFrame(beta_recovery)

Show code

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# RMSE heatmap
ax = axes[0]
rmse_matrix = beta_df.pivot(index='k', columns='d', values='rmse')
im = ax.imshow(rmse_matrix.values, cmap='Blues', aspect='auto')
ax.set_xlabel('Feature Dimension (d)', fontsize=12)
ax.set_ylabel('Consequence (k)', fontsize=12)
ax.set_title('β RMSE by Position', fontsize=12)
ax.set_xticks(range(D))
ax.set_xticklabels([str(d+1) for d in range(D)])
ax.set_yticks(range(K))
ax.set_yticklabels([str(k+1) for k in range(K)])
for i in range(K):
    for j in range(D):
        ax.text(j, i, f'{rmse_matrix.values[i, j]:.2f}', ha='center', va='center', fontsize=10)
plt.colorbar(im, ax=ax, label='RMSE')

# Coverage bar plot
ax = axes[1]
coverage_values = beta_df['coverage'].values
x = np.arange(len(coverage_values))
colors = ['forestgreen' if c >= 0.85 else 'orange' if c >= 0.75 else 'crimson' 
          for c in coverage_values]
ax.bar(x, coverage_values, color=colors, alpha=0.7, edgecolor='white')
ax.axhline(y=0.90, color='red', linestyle='--', linewidth=2, label='Target (90%)')
ax.set_xlabel('β Parameter Index', fontsize=12)
ax.set_ylabel('Coverage', fontsize=12)
ax.set_title('β 90% CI Coverage by Parameter', fontsize=12)
ax.set_ylim(0, 1.05)
ax.legend()

plt.tight_layout()
plt.show()

print(f"\nβ Recovery Summary:")
print(f"  Mean Bias: {beta_df['bias'].mean():.4f}")
print(f"  Mean RMSE: {beta_df['rmse'].mean():.4f}")
mean_cov = beta_df['coverage'].mean()
mean_cov_se = np.sqrt(mean_cov * (1 - mean_cov) / len(true_params_list))
print(f"  Mean Coverage: {mean_cov:.1%} (SE ≈ {mean_cov_se:.1%})")
print(f"  Mean CI Width: {beta_df['ci_width'].mean():.3f}")

Figure 2: Summary of β parameter recovery. Left: RMSE for each β coefficient. Right: Coverage rates (target = 90%, shown as dashed line).


β Recovery Summary:
  Mean Bias: -0.0080
  Mean RMSE: 0.9525
  Mean Coverage: 90.8% (SE ≈ 4.1%)
  Mean CI Width: 3.152

Interpreting β Coverage

The β coverage panel summarizes 15 coefficients, each estimated over 50 recovery iterations. Individual coverage rates therefore have substantial Monte Carlo variation, and the full set of bars should be read as a pattern-level diagnostic rather than as precise evidence about any single β component.

Show code

# Identify best and worst recovered beta components by RMSE
best_idx = beta_df['rmse'].idxmin()
worst_idx = beta_df['rmse'].idxmax()

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Panel 1: Pool all beta values
ax = axes[0]
all_beta_true = np.concatenate([r['true'] for r in beta_recovery])
all_beta_mean = np.concatenate([r['mean'] for r in beta_recovery])

ax.scatter(all_beta_true, all_beta_mean, alpha=0.4, s=30, c='purple', edgecolor='none')
lims = [min(all_beta_true.min(), all_beta_mean.min()) * 1.1, 
        max(all_beta_true.max(), all_beta_mean.max()) * 1.1]
ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line')
ax.set_xlim(lims)
ax.set_ylim(lims)
ax.set_xlabel('True β', fontsize=12)
ax.set_ylabel('Estimated β (posterior mean)', fontsize=12)
ax.set_title(f'β Recovery (all {K*D} parameters pooled)', fontsize=12)
ax.legend()
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)

# Add correlation
corr = np.corrcoef(all_beta_true, all_beta_mean)[0, 1]
ax.text(0.05, 0.95, f'r = {corr:.3f}', transform=ax.transAxes, fontsize=12,
        verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

# Panel 2: Best-recovered component
ax = axes[1]
best_rec = beta_recovery[best_idx]
ax.scatter(best_rec['true'], best_rec['mean'], alpha=0.7, s=50, c='steelblue', edgecolor='white')
b_lims = [min(best_rec['true'].min(), best_rec['mean'].min()) * 1.1 - 0.1,
          max(best_rec['true'].max(), best_rec['mean'].max()) * 1.1 + 0.1]
ax.plot(b_lims, b_lims, 'r--', linewidth=2)
ax.set_xlim(b_lims)
ax.set_ylim(b_lims)
ax.set_xlabel(f'True {best_rec["parameter"]}', fontsize=12)
ax.set_ylabel(f'Estimated (post. mean)', fontsize=12)
ax.set_title(f'Best recovered: {best_rec["parameter"]} (RMSE={best_rec["rmse"]:.3f})', fontsize=11)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)

# Panel 3: Worst-recovered component
ax = axes[2]
worst_rec = beta_recovery[worst_idx]
ax.scatter(worst_rec['true'], worst_rec['mean'], alpha=0.7, s=50, c='coral', edgecolor='white')
w_lims = [min(worst_rec['true'].min(), worst_rec['mean'].min()) * 1.1 - 0.1,
          max(worst_rec['true'].max(), worst_rec['mean'].max()) * 1.1 + 0.1]
ax.plot(w_lims, w_lims, 'r--', linewidth=2)
ax.set_xlim(w_lims)
ax.set_ylim(w_lims)
ax.set_xlabel(f'True {worst_rec["parameter"]}', fontsize=12)
ax.set_ylabel(f'Estimated (post. mean)', fontsize=12)
ax.set_title(f'Worst recovered: {worst_rec["parameter"]} (RMSE={worst_rec["rmse"]:.3f})', fontsize=11)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Figure 3: True vs. estimated values for all β parameters pooled together (left), and for the best- and worst-recovered individual components (right panels). The identity line shows perfect recovery.

0.4.3 Utility Differences ($\boldsymbol{\delta}$)

Show code

K_minus_1 = config['K'] - 1
fig, axes = plt.subplots(2, K_minus_1, figsize=(6 * K_minus_1, 10))
if K_minus_1 == 1:
    axes = axes.reshape(2, 1)

delta_stats = []

for k in range(K_minus_1):
    param_name = f"delta[{k+1}]"
    
    delta_true = np.array([p['delta'][k] for p in true_params_list])
    delta_mean = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries])
    delta_lower = np.array([s.loc[param_name, '5%'] for s in posterior_summaries])
    delta_upper = np.array([s.loc[param_name, '95%'] for s in posterior_summaries])
    
    # Metrics
    bias = np.mean(delta_mean - delta_true)
    rmse = np.sqrt(np.mean((delta_mean - delta_true)**2))
    coverage = np.mean((delta_true >= delta_lower) & (delta_true <= delta_upper))
    ci_width = np.mean(delta_upper - delta_lower)
    
    delta_stats.append({
        'parameter': f'δ_{k+1}',
        'bias': bias,
        'rmse': rmse,
        'coverage': coverage,
        'ci_width': ci_width
    })
    
    # True vs Estimated
    ax = axes[0, k]
    ax.scatter(delta_true, delta_mean, alpha=0.7, s=60, c='forestgreen', edgecolor='white')
    ax.plot([0, 1], [0, 1], 'r--', linewidth=2)
    ax.set_xlim(-0.05, 1.05)
    ax.set_ylim(-0.05, 1.05)
    ax.set_xlabel(f'True δ_{k+1}', fontsize=11)
    ax.set_ylabel(f'Estimated δ_{k+1}', fontsize=11)
    ax.set_title(f'δ_{k+1}: Bias={bias:.3f}, RMSE={rmse:.3f}', fontsize=11)
    ax.set_aspect('equal')
    
    # Coverage
    ax = axes[1, k]
    iterations = np.arange(len(delta_true))
    for i in range(len(delta_true)):
        covered = (delta_true[i] >= delta_lower[i]) & (delta_true[i] <= delta_upper[i])
        color = 'forestgreen' if covered else 'crimson'
        ax.plot([i, i], [delta_lower[i], delta_upper[i]], color=color, linewidth=2, alpha=0.7)
        ax.scatter(i, delta_mean[i], color=color, s=40, zorder=3)
    ax.scatter(iterations, delta_true, color='black', s=60, marker='x', 
               label='True value', zorder=4, linewidth=2)
    ax.set_xlabel('Iteration', fontsize=11)
    ax.set_ylabel(f'δ_{k+1}', fontsize=11)
    ax.set_title(f'δ_{k+1}: Coverage = {coverage:.0%}', fontsize=11)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

delta_df = pd.DataFrame(delta_stats)
print(f"\nδ Recovery Summary:")
for _, row in delta_df.iterrows():
    cov_se = np.sqrt(row['coverage'] * (1 - row['coverage']) / len(true_params_list))
    print(f"  {row['parameter']}: Bias={row['bias']:.4f}, RMSE={row['rmse']:.4f}, Coverage={row['coverage']:.0%} (SE={cov_se:.1%})")

Figure 4: Recovery of utility difference parameters δ. Each panel shows true vs. estimated values and coverage for one δ component.


δ Recovery Summary:
  δ_1: Bias=0.0507, RMSE=0.3037, Coverage=86% (SE=4.9%)
  δ_2: Bias=-0.0507, RMSE=0.3037, Coverage=86% (SE=4.9%)

Precision of Coverage Estimates

With $n = 50$ recovery iterations, the standard error for a coverage rate near the nominal 90% is $\text{SE} = \sqrt{0.9 \times 0.1 / 50} \approx 0.042$, giving an approximate 95% confidence interval of [82%, 98%]. Thus, an observed coverage as low as 82% is still consistent with nominal calibration. The definitive assessment of calibration with much greater precision ($n = 999$ iterations) is provided by the simulation-based calibration (SBC) analysis in Report 6.

0.5 Summary Table

Show code

# Create comprehensive summary table
summary_rows = []

# Alpha
summary_rows.append({
    'Parameter': 'α',
    'Bias': alpha_bias,
    'RMSE': alpha_rmse,
    'Coverage': alpha_coverage,
    'CI Width': alpha_ci_width,
    'Notes': 'Sensitivity'
})

# Beta (aggregated)
summary_rows.append({
    'Parameter': 'β (all)',
    'Bias': beta_df['bias'].mean(),
    'RMSE': beta_df['rmse'].mean(),
    'Coverage': beta_df['coverage'].mean(),
    'CI Width': beta_df['ci_width'].mean(),
    'Notes': f'{K}×{D} parameters'
})

# Delta (individual)
for _, row in delta_df.iterrows():
    summary_rows.append({
        'Parameter': row['parameter'],
        'Bias': row['bias'],
        'RMSE': row['rmse'],
        'Coverage': row['coverage'],
        'CI Width': row['ci_width'],
        'Notes': 'Utility increment'
    })

summary_df = pd.DataFrame(summary_rows)

# Format for display
display_df = summary_df.copy()
display_df['Bias'] = display_df['Bias'].apply(lambda x: f'{x:.4f}')
display_df['RMSE'] = display_df['RMSE'].apply(lambda x: f'{x:.4f}')
display_df['Coverage'] = display_df['Coverage'].apply(lambda x: f'{x:.0%}')
display_df['CI Width'] = display_df['CI Width'].apply(lambda x: f'{x:.3f}')

print(display_df.to_string(index=False))

Table 1: Parameter recovery statistics for all parameters in model m_0.

Parameter    Bias   RMSE Coverage CI Width             Notes
        α  0.1957 1.1667      90%    3.248       Sensitivity
  β (all) -0.0080 0.9525      91%    3.152    3×5 parameters
      δ_1  0.0507 0.3037      86%    0.891 Utility increment
      δ_2 -0.0507 0.3037      86%    0.891 Utility increment

0.5.1 Recovery Conditional on True $\alpha$

The marginal recovery statistics above average over all true parameter values drawn from the prior. Because the lognormal(0, 1) prior on α spans a wide range—from near-zero (nearly random choices) to large values (near-deterministic choices)—recovery difficulty varies considerably across iterations. To reveal this heterogeneity, we examine recovery metrics as a function of the true α value.

Show code

fig, axes = plt.subplots(1, 3, figsize=(17, 5))

# Alpha absolute error vs true alpha
ax = axes[0]
alpha_abs_error = np.abs(alpha_mean - alpha_true)
ax.scatter(alpha_true, alpha_abs_error, alpha=0.7, s=50, c='steelblue', edgecolor='white')
ax.set_xlabel('True α', fontsize=12)
ax.set_ylabel('|Estimated α − True α|', fontsize=12)
ax.set_title('α Recovery Error vs. True α', fontsize=12)
ax.grid(True, alpha=0.3)

# Delta absolute error vs true alpha
ax = axes[1]
delta_true_vals = np.array([p['delta'][0] for p in true_params_list])
delta_mean_vals = np.array([s.loc['delta[1]', 'Mean'] for s in posterior_summaries])
delta_abs_error = np.abs(delta_mean_vals - delta_true_vals)
ax.scatter(alpha_true, delta_abs_error, alpha=0.7, s=50, c='forestgreen', edgecolor='white')
ax.set_xlabel('True α', fontsize=12)
ax.set_ylabel('|Estimated δ₁ − True δ₁|', fontsize=12)
ax.set_title('δ₁ Recovery Error vs. True α', fontsize=12)
ax.grid(True, alpha=0.3)

# Delta uncertainty vs true alpha
ax = axes[2]
# We compute CI width of delta as a proxy for how constrained it is.
delta_ci_widths = np.array([s.loc['delta[1]', '95%'] - s.loc['delta[1]', '5%'] for s in posterior_summaries])
ax.scatter(alpha_true, delta_ci_widths, alpha=0.7, s=50, c='coral', edgecolor='white')
ax.set_xlabel('True α', fontsize=12)
ax.set_ylabel('δ₁ 90% CI Width', fontsize=12)
ax.set_title('δ₁ Posterior Uncertainty vs. True α', fontsize=12)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Report low-alpha vs high-alpha recovery
median_alpha = np.median(alpha_true)
low_mask = alpha_true < median_alpha
high_mask = alpha_true >= median_alpha

print(f"\nRecovery stratified by true α (median split at α = {median_alpha:.2f}):")
print(f"  α RMSE (low α): {np.sqrt(np.mean((alpha_mean[low_mask] - alpha_true[low_mask])**2)):.4f}")
print(f"  α RMSE (high α): {np.sqrt(np.mean((alpha_mean[high_mask] - alpha_true[high_mask])**2)):.4f}")
print(f"  δ₁ RMSE (low α): {np.sqrt(np.mean((delta_mean_vals[low_mask] - delta_true_vals[low_mask])**2)):.4f}")
print(f"  δ₁ RMSE (high α): {np.sqrt(np.mean((delta_mean_vals[high_mask] - delta_true_vals[high_mask])**2)):.4f}")
print(f"  δ₁ mean CI width (low α): {delta_ci_widths[low_mask].mean():.3f}")
print(f"  δ₁ mean CI width (high α): {delta_ci_widths[high_mask].mean():.3f}")

Figure 5: Recovery metrics stratified by true α. Left: absolute error in α recovery vs. true α. Center: absolute error in δ₁ vs. true α — δ recovery degrades when α is small, since choices then carry little information about utilities. Right: posterior 90% credible-interval width for δ₁ vs. true α, used as a proxy for how tightly the data constrain δ₁ in each iteration.


Recovery stratified by true α (median split at α = 1.00):
  α RMSE (low α): 1.0892
  α RMSE (high α): 1.2393
  δ₁ RMSE (low α): 0.2950
  δ₁ RMSE (high α): 0.3122
  δ₁ mean CI width (low α): 0.895
  δ₁ mean CI width (high α): 0.887

When α is small, choices are nearly random — they carry little information about the underlying utilities, so δ (and β) become hard to pin down. When α is large, choices are near-deterministic, so the utility ordering is revealed clearly but the precise magnitudes of utility differences are less leveraged by the choice data. Across the lognormal(0,1) prior range examined here, α recovery error remains modest, while δ posterior uncertainty varies systematically with the choice regime. This α-dependent information content is consistent with the structural coupling discussed in Section 0.6.1.

0.6 Discussion

0.6.1 Differential Recovery Across Parameters

Key Finding: Differential Parameter Recovery

The results above reveal an asymmetry in parameter recovery:

α (sensitivity) is recovered well, with posterior means clustering tightly around true values across the lognormal(0,1) prior range examined here
β (feature weights) and δ (utility differences) show materially weaker recovery, with wider credible intervals and lower precision compared to α

The finding extends to β as well as δ: the β summaries show the same qualitative pattern of slower posterior concentration, consistent with their structural coupling. We focus the prose on δ as the representative utility-function parameter.

This pattern is consistent with the identification structure analyzed formally in Report 5 (see the proposition there on m_0 vs. m_1 identification): α is identified more directly through choice sensitivity, while β and δ are only weakly identified through their composite mapping to expected utilities.

For applications focused on estimating sensitivity (α): within the prior range examined here, the weaker identification of β and δ does not substantially impair α recovery. If the primary research question concerns sensitivity to expected-utility differences rather than the decomposition of expected utility into beliefs and values, the identification challenges for (β, δ) may be of secondary concern.

To understand why β and δ are harder to recover than α, consider the structure of the model. In decisions under uncertainty, choices depend on expected utilities:

\[ \eta_r = \sum_{k=1}^K \psi_{rk} \upsilon_k \]

where $\boldsymbol{\psi}_r$ are subjective probabilities (determined by β) and $\boldsymbol{\upsilon}$ are utilities (determined by δ). The key insight is that we only observe rankings of expected utilities through choices—not the utilities themselves, and not the separate contributions of beliefs and values.

This creates a multiplicative coupling between β and δ: different (β, δ) pairs can produce similar expected utility functions, making both parameters harder to pin down precisely:

α directly governs how sensitively choices respond to expected utility differences; within the prior and design regime examined here, it is recovered well because choice probabilities depend monotonically on α for any fixed expected utility difference
β and δ jointly determine expected utilities through their product; uncertainty in one propagates to the other

Note that this does not mean β and δ are completely uninformed by the data — the credible intervals narrow relative to their prior support, as the CI-width diagnostics above imply — but information accumulates more slowly than for α.

The multiplicative coupling also leaves an illustrative signature across recovery iterations: when the posterior mean for β[1,1] errs in a direction that increases expected utility for consequence $k$, the posterior mean for $\delta_k$ may compensate in the opposite direction. The figure below visualizes this across-iteration error coupling. (We do not directly visualize the within-posterior correlation between β and δ here, since only marginal posterior summaries are saved by the recovery pipeline; the across-iteration diagnostic should therefore be read as suggestive rather than as direct posterior evidence of compensation.)

Show code

# Since only marginal summaries are saved, use across-iteration errors and
# CI widths as diagnostics rather than direct posterior correlations.
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Distribution of signed errors showing coupling
ax = axes[0]
# For each iteration, compute the bias direction for beta[1,1] and delta[1]
beta_11_errors = np.array([s.loc['beta[1,1]', 'Mean'] - p['beta'][0][0] 
                           for p, s in zip(true_params_list, posterior_summaries)])
delta_1_errors = np.array([s.loc['delta[1]', 'Mean'] - p['delta'][0] 
                           for p, s in zip(true_params_list, posterior_summaries)])
ax.scatter(beta_11_errors, delta_1_errors, alpha=0.7, s=50, c='mediumpurple', edgecolor='white')
ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
corr_errors = np.corrcoef(beta_11_errors, delta_1_errors)[0, 1]
ax.set_xlabel('β[1,1] estimation error', fontsize=12)
ax.set_ylabel('δ₁ estimation error', fontsize=12)
ax.set_title(f'Estimation Error Coupling (r = {corr_errors:.3f})', fontsize=12)
ax.grid(True, alpha=0.3)

# Right: CI width of delta vs CI width of beta (across iterations)
ax = axes[1]
beta_11_ci = np.array([s.loc['beta[1,1]', '95%'] - s.loc['beta[1,1]', '5%'] for s in posterior_summaries])
delta_1_ci = np.array([s.loc['delta[1]', '95%'] - s.loc['delta[1]', '5%'] for s in posterior_summaries])
scatter = ax.scatter(beta_11_ci, delta_1_ci, alpha=0.7, s=50, c=alpha_true, 
                     cmap='viridis', edgecolor='white')
ax.set_xlabel('β[1,1] 90% CI Width', fontsize=12)
ax.set_ylabel('δ₁ 90% CI Width', fontsize=12)
ax.set_title('Joint Posterior Uncertainty (color = true α)', fontsize=12)
ax.grid(True, alpha=0.3)
plt.colorbar(scatter, ax=ax, label='True α')

plt.tight_layout()
plt.show()

Figure 6: Across-iteration signatures of the β–δ coupling. Left: estimation errors in β[1,1] vs. estimation errors in δ₁ across recovery iterations; a negative Pearson r (reported in the title) is consistent with the predicted compensatory tradeoff. Right: joint posterior uncertainty proxies — δ₁ 90% CI width vs. β[1,1] 90% CI width across iterations, color-coded by the true α; iterations with more informative choices (larger α) tend to cluster toward narrower CI widths for both parameters.

0.6.2 Does Increasing Sample Size Help?

A natural question is whether the weaker δ recovery is simply a matter of insufficient data. (As noted in the Discussion, β follows the same pattern as δ due to their structural coupling; we focus on δ as the representative utility-function parameter.) The study design with M=25 decision problems might not provide enough information. We test this by doubling the sample size to M=50.

To ensure a valid comparison, we use the extend() method to create the larger design from the original M=25 design. This keeps the same alternatives (features) and original problems, adding 25 new problems that use those same alternatives. Any differences in recovery can then be attributed to sample size rather than different feature configurations.

How extend() preserves the base design

The StudyDesign.extend() method creates a new design object that:

Copies the identical feature matrix w (all R=15 alternatives retain their original feature vectors)
Preserves the original M=25 problems (the first 25 rows of the indicator matrix I are unchanged)
Appends new problems by randomly assigning subsets of the same R=15 alternatives to each new problem

This guarantees that any change in recovery quality is due to having more choice observations, not to different feature configurations. The companion method extend_with_alternatives() (used below) additionally generates new feature vectors, enabling comparisons where the feature space itself expands.

Show code

# Extend the original study design to M=50
# This preserves the same alternatives and original problems
study_m50 = study.extend(additional_M=25, design_name="parameter_recovery_m50")

print(f"Extended Study Design (M=50):")
print(f"  Decision problems: {study_m50.M} (original {study.M} + 25 new)")
print(f"  Same R={study_m50.R} alternatives with identical features")
print(f"  Total expected observations: ~{study_m50.M * 3.5:.0f} choices")

Extended Study Design (M=50):
  Decision problems: 50 (original 25 + 25 new)
  Same R=15 alternatives with identical features
  Total expected observations: ~175 choices

Show code

# Create output directory
output_dir_m50 = tempfile.mkdtemp(prefix="param_recovery_m50_")

# Run parameter recovery with larger sample
recovery_m50 = ParameterRecovery(
    inference_model_path=None,
    sim_model_path=None,
    study_design=study_m50,
    output_dir=output_dir_m50,
    n_mcmc_samples=1000,
    n_mcmc_chains=4,
    n_iterations=50
)

true_params_m50, posterior_summaries_m50 = recovery_m50.run()

Show code

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# RMSE comparison
ax = axes[0]
params = ['α', 'δ₁', 'δ₂']
rmse_m25 = [alpha_rmse, delta_df.iloc[0]['rmse'], delta_df.iloc[1]['rmse']]
rmse_m50 = [alpha_rmse_m50, delta_df_m50.iloc[0]['rmse'], delta_df_m50.iloc[1]['rmse']]

x = np.arange(len(params))
width = 0.35
bars1 = ax.bar(x - width/2, rmse_m25, width, label='M=25', color='steelblue', alpha=0.7)
bars2 = ax.bar(x + width/2, rmse_m50, width, label='M=50', color='coral', alpha=0.7)
ax.set_xlabel('Parameter', fontsize=12)
ax.set_ylabel('RMSE', fontsize=12)
ax.set_title('RMSE by Sample Size', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(params)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# Coverage comparison
ax = axes[1]
cov_m25 = [alpha_coverage, delta_df.iloc[0]['coverage'], delta_df.iloc[1]['coverage']]
cov_m50 = [alpha_coverage_m50, delta_df_m50.iloc[0]['coverage'], delta_df_m50.iloc[1]['coverage']]

bars1 = ax.bar(x - width/2, cov_m25, width, label='M=25', color='steelblue', alpha=0.7)
bars2 = ax.bar(x + width/2, cov_m50, width, label='M=50', color='coral', alpha=0.7)
ax.axhline(y=0.90, color='red', linestyle='--', linewidth=2, label='Target (90%)')
ax.set_xlabel('Parameter', fontsize=12)
ax.set_ylabel('Coverage', fontsize=12)
ax.set_title('90% CI Coverage by Sample Size', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(params)
ax.set_ylim(0, 1.05)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# CI Width comparison  
ax = axes[2]
ci_m25 = [alpha_ci_width, delta_df.iloc[0]['ci_width'], delta_df.iloc[1]['ci_width']]
ci_m50 = [alpha_ci_width_m50, delta_df_m50.iloc[0]['ci_width'], delta_df_m50.iloc[1]['ci_width']]

bars1 = ax.bar(x - width/2, ci_m25, width, label='M=25', color='steelblue', alpha=0.7)
bars2 = ax.bar(x + width/2, ci_m50, width, label='M=50', color='coral', alpha=0.7)
ax.set_xlabel('Parameter', fontsize=12)
ax.set_ylabel('CI Width', fontsize=12)
ax.set_title('90% CI Width by Sample Size', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(params)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Figure 7: Comparison of parameter recovery between M=25 and M=50. Doubling the number of decision problems improves α recovery but has a more modest effect on δ recovery. (β exhibits similar patterns to δ due to their structural coupling.)

Table 2

Sample Size Comparison (M=25 vs M=50):
Parameter RMSE (M=25) RMSE (M=50) Δ RMSE CI Width (M=25) CI Width (M=50)
        α      1.1667      0.7903 -32.3%           3.248           2.528
      δ_1      0.3037      0.2945  -3.0%           0.891           0.881
      δ_2      0.3037      0.2945  -3.0%           0.891           0.881

In these runs, the table shows the asymmetric sample-size pattern anticipated in Section 0.6.1: doubling M produces a larger relative reduction in α RMSE and CI width than in δ RMSE and CI width. (β follows the same pattern as δ, as noted in the Discussion.) This is consistent with α being informed more directly by choice sensitivity while (β, δ) are weakly identified through their composite mapping to expected utilities — additional uncertain-choice data sharpens the parameter that the data most directly inform, while improving (β, δ) more slowly.

0.6.3 Does Adding New Alternatives Help?

The M=50 extension above adds problems over the same alternatives — it does not expand the feature space. We might hypothesize that adding new alternatives (with new feature vectors) provides qualitatively different information that could help with δ recovery.

To test this, we use extend_with_alternatives() to add 15 new alternatives along with 25 new problems that use those alternatives. This parallels how model m_1 (in Report 5) adds new risky alternatives along with new risky problems, enabling a fairer comparison.

Show code

# Extend with both new alternatives AND new problems
# New problems only use the new alternatives (paralleling m_1's risky component)
study_m50_new_alts = study.extend_with_alternatives(
    additional_M=25,
    additional_R=15,  # Same number of new alternatives as m_1 has risky alternatives
    design_name="parameter_recovery_m50_new_alts",
    new_problems_use_new_alts_only=True
)

print(f"Extended Study Design (M=50 with new alternatives):")
print(f"  Decision problems: {study_m50_new_alts.M} (original {study.M} + 25 new)")
print(f"  Alternatives: R={study_m50_new_alts.R} (original {study.R} + 15 new)")
print(f"  Original problems (1-25) use original alternatives (1-15)")
print(f"  New problems (26-50) use new alternatives (16-30)")

Extended Study Design (M=50 with new alternatives):
  Decision problems: 50 (original 25 + 25 new)
  Alternatives: R=30 (original 15 + 15 new)
  Original problems (1-25) use original alternatives (1-15)
  New problems (26-50) use new alternatives (16-30)

Why new_problems_use_new_alts_only=True?

Setting new_problems_use_new_alts_only=True creates a block structure: original problems draw from original alternatives, and new problems draw exclusively from new alternatives. This design choice mirrors the structure of model m_1 in Report 5, where risky problems use only risky alternatives (lotteries). Maintaining this parallel enables a fair comparison: any difference between “M=50 with new uncertain alternatives” and “M=25 + N=25 with risky alternatives” can be attributed to the type of data (risk vs. uncertainty) rather than to differences in how alternatives are assigned to problems.

An alternative design—mixing old and new alternatives within new problems—would provide cross-referencing information that might improve identification. However, this would break the structural parallel with m_1 and make the comparison less clean. We prioritize the controlled comparison here; designs that mix alternatives are worth exploring in future work.

Show code

# Create output directory
output_dir_m50_new_alts = tempfile.mkdtemp(prefix="param_recovery_m50_new_alts_")

# Run parameter recovery with new alternatives
recovery_m50_new_alts = ParameterRecovery(
    inference_model_path=None,
    sim_model_path=None,
    study_design=study_m50_new_alts,
    output_dir=output_dir_m50_new_alts,
    n_mcmc_samples=1000,
    n_mcmc_chains=4,
    n_iterations=50
)

true_params_m50_new_alts, posterior_summaries_m50_new_alts = recovery_m50_new_alts.run()

Show code

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

params = ['δ₁', 'δ₂']
x = np.arange(len(params))
width = 0.25

# RMSE comparison
ax = axes[0]
rmse_m25_d = [delta_df.iloc[k]['rmse'] for k in range(K_minus_1)]
rmse_m50_d = [delta_df_m50.iloc[k]['rmse'] for k in range(K_minus_1)]
rmse_m50_new_d = [delta_stats_m50_new[k]['rmse'] for k in range(K_minus_1)]

bars1 = ax.bar(x - width, rmse_m25_d, width, label='M=25 (base)', color='steelblue', alpha=0.7)
bars2 = ax.bar(x, rmse_m50_d, width, label='M=50 (same alts)', color='coral', alpha=0.7)
bars3 = ax.bar(x + width, rmse_m50_new_d, width, label='M=50 (new alts)', color='mediumseagreen', alpha=0.7)
ax.set_xlabel('Parameter', fontsize=12)
ax.set_ylabel('RMSE', fontsize=12)
ax.set_title('δ RMSE by Extension Type', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(params)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3, axis='y')

# Coverage comparison
ax = axes[1]
cov_m25_d = [delta_df.iloc[k]['coverage'] for k in range(K_minus_1)]
cov_m50_d = [delta_df_m50.iloc[k]['coverage'] for k in range(K_minus_1)]
cov_m50_new_d = [delta_stats_m50_new[k]['coverage'] for k in range(K_minus_1)]

bars1 = ax.bar(x - width, cov_m25_d, width, label='M=25 (base)', color='steelblue', alpha=0.7)
bars2 = ax.bar(x, cov_m50_d, width, label='M=50 (same alts)', color='coral', alpha=0.7)
bars3 = ax.bar(x + width, cov_m50_new_d, width, label='M=50 (new alts)', color='mediumseagreen', alpha=0.7)
ax.axhline(y=0.90, color='red', linestyle='--', linewidth=2, label='Target (90%)')
ax.set_xlabel('Parameter', fontsize=12)
ax.set_ylabel('Coverage', fontsize=12)
ax.set_title('δ 90% CI Coverage by Extension Type', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(params)
ax.set_ylim(0, 1.05)
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3, axis='y')

# CI Width comparison  
ax = axes[2]
ci_m25_d = [delta_df.iloc[k]['ci_width'] for k in range(K_minus_1)]
ci_m50_d = [delta_df_m50.iloc[k]['ci_width'] for k in range(K_minus_1)]
ci_m50_new_d = [delta_stats_m50_new[k]['ci_width'] for k in range(K_minus_1)]

bars1 = ax.bar(x - width, ci_m25_d, width, label='M=25 (base)', color='steelblue', alpha=0.7)
bars2 = ax.bar(x, ci_m50_d, width, label='M=50 (same alts)', color='coral', alpha=0.7)
bars3 = ax.bar(x + width, ci_m50_new_d, width, label='M=50 (new alts)', color='mediumseagreen', alpha=0.7)
ax.set_xlabel('Parameter', fontsize=12)
ax.set_ylabel('CI Width', fontsize=12)
ax.set_title('δ 90% CI Width by Extension Type', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(params)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Figure 8: Three-way comparison of δ recovery: M=25 (base), M=50 (same alternatives), and M=50 (new alternatives). Adding new alternatives may provide different information than simply adding more problems over existing alternatives. (β exhibits similar patterns to δ due to their structural coupling.)

Table 3

Three-Way Extension Comparison:
Param RMSE M=25 RMSE M=50 (same) RMSE M=50 (new) CI Width M=25 CI Width M=50 (same) CI Width M=50 (new)
    α    1.1667           0.7903          0.7038         3.248                2.528               2.341
  δ_1    0.3037           0.2945          0.2887         0.891                0.881               0.880
  δ_2    0.3037           0.2945          0.2887         0.891                0.881               0.880

0.6.4 Interpretation: Sample Size and Alternative Effects

The three-way comparison reveals important patterns about what helps δ recovery (β exhibits similar patterns due to the structural coupling between these parameters):

Extension Type Matters

Comparing the three conditions:

M=25 (base): Baseline recovery with modest δ precision
M=50 (same alternatives): Adding 25 problems over the same R=15 alternatives
M=50 (new alternatives): Adding 25 problems using 15 new alternatives (R increases to 30)

Directional findings from the three-way table:

α recovery improves with more data under either extension; the two M=50 conditions are comparable for α.
δ recovery suggests a possible modest additional benefit from new alternatives over same alternatives, consistent with the idea that expanding the feature space provides qualitatively different information about how features map to subjective probabilities (and thus indirectly about δ through the multiplicative coupling).

These directional comparisons should be read with the precision caveat in mind: with $n = 50$ recovery iterations, the standard error on coverage is roughly four percentage points, and small differences in RMSE or CI width should not be over-interpreted. More direct across-model comparisons — with matched designs and/or a calibration-focused framing — are provided in Report 5 and Report 6.

This three-way comparison sets up a crucial comparison for Report 5, where we add risky alternatives (lotteries with known probabilities) instead of additional uncertain alternatives. By comparing:

M=50 with new uncertain alternatives (this report): Expands uncertain feature space
M=25 + N=25 with new risky alternatives (Report 5): Adds known-probability lotteries

…we can more cleanly distinguish whether improvements come from the type of data (risk vs. uncertainty) or simply from adding more diverse choice problems.

The asymmetry in learning rates for (β, δ) versus α reflects the model structure: in decisions under uncertainty, utilities and subjective probabilities enter the likelihood through their product (expected utilities). This creates a multiplicative coupling that makes (β, δ) harder to pin down than α, which more directly governs choice sensitivity.

0.6.5 Study Design Parameters in m_0

The m_0.stan data block defines several parameters that characterize study size:

Parameter	Description	Current Value
M	Number of decision problems	25 → 50
K	Number of consequences	3
D	Feature dimensions	5
R	Distinct alternatives	15

We might consider varying other parameters:

Increasing K (more consequences): Would require estimating more δ parameters, potentially making recovery harder
Increasing R (more alternatives): Provides more information about β, and may indirectly help δ through better-constrained expected utilities
Increasing D (more features): Similar to increasing R—primarily helps with β

The effectiveness of these changes for (β, δ) recovery is an empirical question worth exploring in future work.

0.6.6 Alternatives to More Uncertain-Choice Data

Beyond simply collecting more uncertain-choice observations, three further routes can sharpen recovery of (β, δ). They differ in where the additional information comes from — the design, the prior, or cross-agent structure — and have very different epistemic standing.

0.6.6.1 Route 1: Risky alternatives as a direct probe of utility

The history of decision theory suggests the use of alternatives with known probabilities — risky alternatives — as a direct probe of the utility function. The mechanical reason is transparent in our notation: for a risky alternative the probability vector $\boldsymbol{\psi}_r$ is fixed exogenously by the experimenter rather than being a function of $\boldsymbol{\beta}$ and the feature vector. Conditional on those observations, the expected utility $\eta_r = \boldsymbol{\psi}_r^\top \boldsymbol{\upsilon}$ is linear in $\boldsymbol{\upsilon}$ (and hence in $\boldsymbol{\delta}$), so the multiplicative β–δ coupling that hampers recovery in m_0 is broken for the risky portion of the data.

The use of objectively risky prospects to identify utilities has a long lineage — running through Ramsey, von Neumann–Morgenstern, and the Anscombe–Aumann construction that combines risky and uncertain prospects in a single domain. We do not need the formal Anscombe–Aumann representation theorem to motivate the design choice here; the identification gain is mechanical, as just described. The formal relationship between our parameter-recovery argument and the classical representation result is examined in Report 5.

0.6.6.2 Route 2: A more concentrated Dirichlet prior on δ

A different lever is available without changing the design at all. In m_0.stan, $\boldsymbol{\delta}$ is a simplex[K-1] with prior $\boldsymbol{\delta} \sim \text{Dirichlet}(\mathbf{1})$ — i.e., a flat prior over admissible utility differences. Replacing the concentration vector $\mathbf{1}$ with $\alpha_0 \mathbf{1}$ for some $\alpha_0 > 1$ tightens the marginal prior on each $\delta_k$ around $1/(K-1)$. Through the multiplicative coupling between β and δ, this also indirectly tightens β: posterior uncertainty in δ that the likelihood cannot resolve is now absorbed by the prior, leaving less slack for β to drift.

The trade-offs deserve emphasis:

The identification gain here comes from prior information, not data. It is appropriate only when the analyst has substantive grounds — domain knowledge, pilot data, or a deliberate modeling commitment that consequences are roughly equally spaced on the unit utility scale — for a more concentrated prior.
A misspecified concentrated prior biases $\boldsymbol{\delta}$ (and, via coupling, $\boldsymbol{\beta}$) toward the prior mean and shrinks credible intervals below their nominal calibration. Calibration under the concentrated prior should be re-checked via SBC.
Unlike adding risky choices, this route does not improve identification of the data-generating process; it improves posterior precision by importing assumptions.

We do not pursue this route empirically in this report series; we flag it as a methodologically natural option whose evaluation is left to future work.

0.6.6.3 Route 3: Hierarchical modeling across agents or cells

A third route becomes available when data are collected from multiple agents, experimental cells, or related contexts. If those agents can defensibly be modeled as sharing a utility scale — or as drawing utilities from a common population distribution — a hierarchical model can borrow strength across datasets. Data from one agent then help constrain the utility parameters for others through the shared or partially pooled structure.

This is the logic developed in the hierarchical sequence (Reports 08–12). In the implemented h_m01 model, cell-level sensitivities and belief mappings vary across cells, while the utility simplex $\boldsymbol{\delta}$ is shared across the stacked data. The hierarchical recovery and SBC analyses (Report 11, Report 12) validate that structure at the alignment-study scale.

The caveat is important: hierarchical modeling does not mechanically break the single-agent β–δ multiplicative coupling in the way risky choices do. Its gain comes from pooling information under a cross-agent or cross-cell assumption. If the shared-utility or population-utility assumption is substantively wrong, the model can import bias just as a too-concentrated prior can. For that reason, hierarchical variants require the same prior predictive, recovery, and SBC validation used elsewhere in this report series.

Three complementary routes beyond more uncertain choices

Add risky-choice data (model m_1, Report 5). Adds new identifying information by fixing $\boldsymbol{\psi}_r$ exogenously, breaking the β–δ multiplicative coupling for those observations.
Tighten the Dirichlet concentration on δ. Adds prior information rather than data; cheap but assumption-laden, and only as defensible as the substantive grounds for the tighter prior.
Use hierarchical modeling across agents or cells (Reports 08–12). Adds cross-agent structure and partial pooling; powerful when shared-utility or population-utility assumptions are defensible, but not a substitute for design-based identification.

The three are not exclusive and can be combined.

0.7 Conclusion

Parameter recovery analysis for model m_0 reveals differential recovery across parameters:

Well-recovered parameter: α (sensitivity) can be recovered with good precision across the lognormal(0,1) prior range examined here.
Slower-learning parameters: β (feature weights) and δ (utility differences) show wider uncertainty; credible intervals do narrow relative to the prior support, but more slowly than for α.
Sample size helps, but differentially: in these runs, doubling M improves α recovery more than (β, δ) recovery, and adding new alternatives suggests a possible modest additional benefit over adding more problems over the same alternatives.

This pattern reflects the identification structure analyzed formally in Report 5: α is informed more directly by choice sensitivity, while β and δ interact multiplicatively in the expected utility, causing uncertainty in one to propagate to the other.

Implications for applications:

If the primary research question concerns sensitivity (α) — how strongly choices respond to expected utility differences — model m_0 performs well even with modest sample sizes.
If precise decomposition of expected utility into beliefs (β) and values (δ) is needed, additional information is required.

Four routes for sharpening (β, δ):

More uncertain-choice data: larger M (and perhaps R) eventually yields more precise (β, δ) estimates, though the rate of improvement is slower than for α.
Risky-choice data (model m_1, Report 5): adds new identifying information by fixing $\boldsymbol{\psi}_r$ exogenously, breaking the β–δ multiplicative coupling for those observations.
A more concentrated Dirichlet prior on δ: imports prior information rather than data; cheap but assumption-laden, and only as defensible as the substantive grounds for the tighter prior. Not pursued empirically in this report series; flagged as future work.
Hierarchical modeling across agents or cells (Reports 08–12): borrows strength through shared or partially pooled structure when multiple related datasets are available. This can sharpen utility recovery under defensible cross-agent assumptions, but should be validated with prior predictive checks, recovery, and SBC.

The choice among these routes — or their combination — depends on practical considerations like study feasibility, the relative importance of precise utility estimation for the research question, the availability of related agents or experimental cells, and the strength of substantive prior information available.

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:

@online{helzner2026,
  author = {Helzner, Jeff},
  title = {Parameter {Recovery} {Analysis}},
  date = {2026-05-12},
  url = {https://jeffhelzner.github.io/seu-sensitivity/foundations/04_parameter_recovery.html},
  langid = {en}
}

For attribution, please cite this work as:

Helzner, Jeff. 2026. “Parameter Recovery Analysis.” SEU Sensitivity Project, May 12. https://jeffhelzner.github.io/seu-sensitivity/foundations/04_parameter_recovery.html.

--- title: "Parameter Recovery Analysis" subtitle: "Foundational Report 4" description: | Validation that model parameters can be accurately recovered from simulated data, demonstrating model identifiability. categories: [foundations, validation, m_0] execute: cache: true --- ```{python} #| label: setup #| include: false import sys import os # Add parent directories to path sys.path.insert(0, os.path.join(os.getcwd(), '..')) project_root = os.path.dirname(os.path.dirname(os.getcwd())) sys.path.insert(0, project_root) import numpy as np import pandas as pd import matplotlib.pyplot as plt from scipy import stats import json import warnings warnings.filterwarnings('ignore') # Set random seed for reproducibility np.random.seed(42) ``` ## Introduction Having examined the prior predictive distribution ([Report 3](03_prior_analysis.qmd)), we now ask: **can we recover the true parameters when we simulate data from our model with known parameters?** This is the fundamental question of parameter recovery analysis. Parameter recovery is a critical validation step for any Bayesian model: 1. **Identifiability**: Can the model distinguish between different parameter configurations? 2. **Estimation accuracy**: Are posterior means close to true values? 3. **Uncertainty calibration**: Do 90% credible intervals contain the true value ~90% of the time? 4. **Precision**: How narrow are the credible intervals? ::: {.callout-warning} ## Preview: A Key Finding This report reveals an asymmetry in parameter recovery: while α (sensitivity) is recovered well in the simulation regime examined here, the pair (β, δ) that determines the expected utility function shows weaker recovery — credible intervals narrow more slowly relative to those of α. As shown formally in [Report 5](05_adding_risky_choices.qmd), this reflects a **structural identification challenge** where β and δ interact multiplicatively in the expected utility, causing uncertainty in one to propagate to the other. Importantly, this does not prevent good recovery of α, which remains our primary parameter of interest within the prior range examined here. ::: ::: {.callout-note} ## The Recovery Paradigm Parameter recovery follows a simple logic: 1. **Simulate**: Generate data from `m_0_sim.stan` with known ("true") parameter values 2. **Estimate**: Fit `m_0.stan` to the simulated data 3. **Compare**: Assess how well posterior estimates match the true values 4. **Repeat**: Do this many times to characterize recovery performance If the model is well-specified and identifiable, the posterior should concentrate around the true values. ::: ## Study Design We use the same study design as in [Report 3](03_prior_analysis.qmd) to maintain consistency across our foundational analyses: ```{python} #| label: study-design-config #| echo: true # Study design configuration (same as Report 3) config = { "M": 25, # Number of decision problems "K": 3, # Number of consequences "D": 5, # Feature dimensions "R": 15, # Number of distinct alternatives "min_alts_per_problem": 2, # Minimum alternatives per problem "max_alts_per_problem": 5, # Maximum alternatives per problem "feature_dist": "normal", # Feature distribution "feature_params": {"loc": 0, "scale": 1} } print(f"Study Design Configuration:") print(f" Decision problems (M): {config['M']}") print(f" Consequences (K): {config['K']}") print(f" Feature dimensions (D): {config['D']}") print(f" Distinct alternatives (R): {config['R']}") print(f" Alternatives per problem: {config['min_alts_per_problem']}-{config['max_alts_per_problem']}") ``` ```{python} #| label: generate-study-design #| output: false from utils.study_design import StudyDesign # Create and generate the study design study = StudyDesign( M=config["M"], K=config["K"], D=config["D"], R=config["R"], min_alts_per_problem=config["min_alts_per_problem"], max_alts_per_problem=config["max_alts_per_problem"], feature_dist=config["feature_dist"], feature_params=config["feature_params"], design_name="parameter_recovery" ) study.generate() ``` ```{python} #| label: design-summary #| echo: false # Extract design properties w = np.array(study.w) I = np.array(study.I) N = I.sum(axis=1) print(f"\nGenerated Study Design:") print(f" Total alternatives across problems: {I.sum()}") print(f" Alternatives per problem: min={N.min()}, max={N.max()}, mean={N.mean():.1f}") ``` ## Parameter Recovery Analysis We use the `ParameterRecovery` class to systematically evaluate how well parameters can be recovered. This class: 1. Uses `m_0_sim.stan` to generate data with known parameters drawn from the prior 2. Fits `m_0.stan` to each simulated dataset 3. Compares posterior estimates to true values across many iterations ```{python} #| label: run-recovery #| output: false from analysis.parameter_recovery import ParameterRecovery import tempfile # Create output directory for this analysis output_dir = tempfile.mkdtemp(prefix="param_recovery_") # Initialize parameter recovery analysis recovery = ParameterRecovery( inference_model_path=None, # Uses default m_0.stan sim_model_path=None, # Uses default m_0_sim.stan study_design=study, output_dir=output_dir, n_mcmc_samples=1000, # Samples per chain n_mcmc_chains=4, # Number of chains n_iterations=50 # Number of simulation-recovery iterations ) # Run the analysis true_params_list, posterior_summaries = recovery.run() ``` ```{python} #| label: recovery-summary #| echo: false n_successful = len(true_params_list) print(f"Parameter Recovery Analysis Complete:") print(f" Successful iterations: {n_successful}") print(f" Parameters recovered per iteration:") print(f" - α (sensitivity): 1 parameter") print(f" - β (feature weights): {config['K']} × {config['D']} = {config['K'] * config['D']} parameters") print(f" - δ (utility increments): {config['K'] - 1} parameters") ``` ### MCMC Diagnostics Before interpreting recovery statistics, we verify that the MCMC sampler converged properly across all iterations. CmdStanPy's `summary()` method provides $\hat{R}$ (split R-hat) and effective sample size (ESS) for each parameter, and diagnostics are saved per iteration by the `ParameterRecovery` class. ```{python} #| label: mcmc-diagnostics #| echo: false # Aggregate MCMC diagnostics across all iterations rhat_max_list = [] ess_min_list = [] # Detect ESS column name (varies across CmdStanPy versions) ess_col = None sample_cols = posterior_summaries[0].columns for candidate in ['ESS_bulk', 'N_Eff', 'ESS']: if candidate in sample_cols: ess_col = candidate break for s in posterior_summaries: # Filter to model parameters only (exclude lp__, etc.) param_rows = s[s.index.str.match(r'^(alpha|beta\[|delta\[|upsilon\[)')] if 'R_hat' in param_rows.columns: rhat_max_list.append(param_rows['R_hat'].max()) if ess_col is not None and ess_col in param_rows.columns: ess_min_list.append(param_rows[ess_col].min()) rhat_max = np.array(rhat_max_list) ess_min = np.array(ess_min_list) print("MCMC Diagnostic Summary Across All Recovery Iterations:") print(f" Max R-hat per iteration:") print(f" Median: {np.median(rhat_max):.4f}") print(f" 95th percentile: {np.percentile(rhat_max, 95):.4f}") print(f" Maximum: {rhat_max.max():.4f}") print(f" Iterations with max R-hat > 1.01: {(rhat_max > 1.01).sum()}/{len(rhat_max)}") if len(ess_min) > 0: print(f" Min ESS ({ess_col}) per iteration:") print(f" Median: {np.median(ess_min):.0f}") print(f" 5th percentile: {np.percentile(ess_min, 5):.0f}") print(f" Minimum: {ess_min.min():.0f}") else: print(f" ESS: column not found in summary (available: {list(sample_cols)})") ``` ::: {.callout-note} ## Diagnostic Assessment The diagnostics above confirm that the sampler converged reliably across recovery iterations. All iterations meet standard convergence thresholds ($\hat{R} < 1.01$, ESS well above recommended minimums), so the recovery statistics below reflect genuine posterior behavior rather than MCMC artifacts. Per-iteration diagnostic files (including divergence checks) are saved by the `ParameterRecovery` class and can be inspected individually if needed. ::: ## Recovery Metrics We evaluate parameter recovery using four key metrics: | Metric | Definition | Target | |--------|------------|--------| | **Bias** | Mean(estimate - true) | ≈ 0 | | **RMSE** | √Mean((estimate - true)²) | Small | | **Coverage** | P(true ∈ 90% CI) | ≈ 0.90 | | **CI Width** | Mean(upper - lower) | Small but calibrated | ### Sensitivity Parameter ($\alpha$) ```{python} #| label: fig-alpha-recovery #| fig-cap: "Recovery of the sensitivity parameter α. Left: True vs. estimated values with identity line. Right: 90% credible intervals for each iteration, colored by whether they contain the true value." # Extract alpha values alpha_true = np.array([p['alpha'] for p in true_params_list]) alpha_mean = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries]) alpha_lower = np.array([s.loc['alpha', '5%'] for s in posterior_summaries]) alpha_upper = np.array([s.loc['alpha', '95%'] for s in posterior_summaries]) # Calculate metrics alpha_bias = np.mean(alpha_mean - alpha_true) alpha_rmse = np.sqrt(np.mean((alpha_mean - alpha_true)**2)) alpha_coverage = np.mean((alpha_true >= alpha_lower) & (alpha_true <= alpha_upper)) alpha_ci_width = np.mean(alpha_upper - alpha_lower) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # True vs Estimated ax = axes[0] ax.scatter(alpha_true, alpha_mean, alpha=0.7, s=60, c='steelblue', edgecolor='white') lims = [min(alpha_true.min(), alpha_mean.min()) * 0.9, max(alpha_true.max(), alpha_mean.max()) * 1.1] ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line') ax.set_xlim(lims) ax.set_ylim(lims) ax.set_xlabel('True α', fontsize=12) ax.set_ylabel('Estimated α (posterior mean)', fontsize=12) ax.set_title(f'α Recovery: Bias={alpha_bias:.3f}, RMSE={alpha_rmse:.3f}', fontsize=12) ax.legend() ax.set_aspect('equal') # Coverage plot ax = axes[1] iterations = np.arange(len(alpha_true)) for i in range(len(alpha_true)): covered = (alpha_true[i] >= alpha_lower[i]) & (alpha_true[i] <= alpha_upper[i]) color = 'forestgreen' if covered else 'crimson' ax.plot([i, i], [alpha_lower[i], alpha_upper[i]], color=color, linewidth=2, alpha=0.7) ax.scatter(i, alpha_mean[i], color=color, s=40, zorder=3) ax.scatter(iterations, alpha_true, color='black', s=60, marker='x', label='True value', zorder=4, linewidth=2) ax.set_xlabel('Iteration', fontsize=12) ax.set_ylabel('α', fontsize=12) ax.set_title(f'α: 90% Credible Intervals (Coverage = {alpha_coverage:.0%})', fontsize=12) ax.legend() ax.grid(True, alpha=0.3) plt.tight_layout() plt.show() alpha_coverage_se = np.sqrt(alpha_coverage * (1 - alpha_coverage) / len(alpha_true)) print(f"\nα Recovery Statistics:") print(f" Bias: {alpha_bias:.4f}") print(f" RMSE: {alpha_rmse:.4f}") print(f" 90% CI Coverage: {alpha_coverage:.1%} (SE = {alpha_coverage_se:.1%})") print(f" Mean CI Width: {alpha_ci_width:.3f}") ``` ### Feature-to-Probability Weights ($\boldsymbol{\beta}$) The β matrix has K × D = 15 parameters (3 consequences × 5 features). We examine recovery across all of them: ```{python} #| label: compute-beta-recovery #| output: false # Compute recovery metrics for all beta parameters K, D = config['K'], config['D'] beta_recovery = [] for k in range(K): for d in range(D): param_name = f"beta[{k+1},{d+1}]" beta_true = np.array([p['beta'][k][d] for p in true_params_list]) beta_mean = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries]) beta_lower = np.array([s.loc[param_name, '5%'] for s in posterior_summaries]) beta_upper = np.array([s.loc[param_name, '95%'] for s in posterior_summaries]) bias = np.mean(beta_mean - beta_true) rmse = np.sqrt(np.mean((beta_mean - beta_true)**2)) coverage = np.mean((beta_true >= beta_lower) & (beta_true <= beta_upper)) ci_width = np.mean(beta_upper - beta_lower) beta_recovery.append({ 'parameter': param_name, 'k': k + 1, 'd': d + 1, 'bias': bias, 'rmse': rmse, 'coverage': coverage, 'ci_width': ci_width, 'true': beta_true, 'mean': beta_mean, 'lower': beta_lower, 'upper': beta_upper }) beta_df = pd.DataFrame(beta_recovery) ``` ```{python} #| label: fig-beta-recovery-summary #| fig-cap: "Summary of β parameter recovery. Left: RMSE for each β coefficient. Right: Coverage rates (target = 90%, shown as dashed line)." fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # RMSE heatmap ax = axes[0] rmse_matrix = beta_df.pivot(index='k', columns='d', values='rmse') im = ax.imshow(rmse_matrix.values, cmap='Blues', aspect='auto') ax.set_xlabel('Feature Dimension (d)', fontsize=12) ax.set_ylabel('Consequence (k)', fontsize=12) ax.set_title('β RMSE by Position', fontsize=12) ax.set_xticks(range(D)) ax.set_xticklabels([str(d+1) for d in range(D)]) ax.set_yticks(range(K)) ax.set_yticklabels([str(k+1) for k in range(K)]) for i in range(K): for j in range(D): ax.text(j, i, f'{rmse_matrix.values[i, j]:.2f}', ha='center', va='center', fontsize=10) plt.colorbar(im, ax=ax, label='RMSE') # Coverage bar plot ax = axes[1] coverage_values = beta_df['coverage'].values x = np.arange(len(coverage_values)) colors = ['forestgreen' if c >= 0.85 else 'orange' if c >= 0.75 else 'crimson' for c in coverage_values] ax.bar(x, coverage_values, color=colors, alpha=0.7, edgecolor='white') ax.axhline(y=0.90, color='red', linestyle='--', linewidth=2, label='Target (90%)') ax.set_xlabel('β Parameter Index', fontsize=12) ax.set_ylabel('Coverage', fontsize=12) ax.set_title('β 90% CI Coverage by Parameter', fontsize=12) ax.set_ylim(0, 1.05) ax.legend() plt.tight_layout() plt.show() print(f"\nβ Recovery Summary:") print(f" Mean Bias: {beta_df['bias'].mean():.4f}") print(f" Mean RMSE: {beta_df['rmse'].mean():.4f}") mean_cov = beta_df['coverage'].mean() mean_cov_se = np.sqrt(mean_cov * (1 - mean_cov) / len(true_params_list)) print(f" Mean Coverage: {mean_cov:.1%} (SE ≈ {mean_cov_se:.1%})") print(f" Mean CI Width: {beta_df['ci_width'].mean():.3f}") ``` ::: {.callout-note} ## Interpreting β Coverage The β coverage panel summarizes 15 coefficients, each estimated over 50 recovery iterations. Individual coverage rates therefore have substantial Monte Carlo variation, and the full set of bars should be read as a pattern-level diagnostic rather than as precise evidence about any single β component. ::: ```{python} #| label: fig-beta-true-vs-estimated #| fig-cap: "True vs. estimated values for all β parameters pooled together (left), and for the best- and worst-recovered individual components (right panels). The identity line shows perfect recovery." # Identify best and worst recovered beta components by RMSE best_idx = beta_df['rmse'].idxmin() worst_idx = beta_df['rmse'].idxmax() fig, axes = plt.subplots(1, 3, figsize=(18, 6)) # Panel 1: Pool all beta values ax = axes[0] all_beta_true = np.concatenate([r['true'] for r in beta_recovery]) all_beta_mean = np.concatenate([r['mean'] for r in beta_recovery]) ax.scatter(all_beta_true, all_beta_mean, alpha=0.4, s=30, c='purple', edgecolor='none') lims = [min(all_beta_true.min(), all_beta_mean.min()) * 1.1, max(all_beta_true.max(), all_beta_mean.max()) * 1.1] ax.plot(lims, lims, 'r--', linewidth=2, label='Identity line') ax.set_xlim(lims) ax.set_ylim(lims) ax.set_xlabel('True β', fontsize=12) ax.set_ylabel('Estimated β (posterior mean)', fontsize=12) ax.set_title(f'β Recovery (all {K*D} parameters pooled)', fontsize=12) ax.legend() ax.set_aspect('equal') ax.grid(True, alpha=0.3) # Add correlation corr = np.corrcoef(all_beta_true, all_beta_mean)[0, 1] ax.text(0.05, 0.95, f'r = {corr:.3f}', transform=ax.transAxes, fontsize=12, verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8)) # Panel 2: Best-recovered component ax = axes[1] best_rec = beta_recovery[best_idx] ax.scatter(best_rec['true'], best_rec['mean'], alpha=0.7, s=50, c='steelblue', edgecolor='white') b_lims = [min(best_rec['true'].min(), best_rec['mean'].min()) * 1.1 - 0.1, max(best_rec['true'].max(), best_rec['mean'].max()) * 1.1 + 0.1] ax.plot(b_lims, b_lims, 'r--', linewidth=2) ax.set_xlim(b_lims) ax.set_ylim(b_lims) ax.set_xlabel(f'True {best_rec["parameter"]}', fontsize=12) ax.set_ylabel(f'Estimated (post. mean)', fontsize=12) ax.set_title(f'Best recovered: {best_rec["parameter"]} (RMSE={best_rec["rmse"]:.3f})', fontsize=11) ax.set_aspect('equal') ax.grid(True, alpha=0.3) # Panel 3: Worst-recovered component ax = axes[2] worst_rec = beta_recovery[worst_idx] ax.scatter(worst_rec['true'], worst_rec['mean'], alpha=0.7, s=50, c='coral', edgecolor='white') w_lims = [min(worst_rec['true'].min(), worst_rec['mean'].min()) * 1.1 - 0.1, max(worst_rec['true'].max(), worst_rec['mean'].max()) * 1.1 + 0.1] ax.plot(w_lims, w_lims, 'r--', linewidth=2) ax.set_xlim(w_lims) ax.set_ylim(w_lims) ax.set_xlabel(f'True {worst_rec["parameter"]}', fontsize=12) ax.set_ylabel(f'Estimated (post. mean)', fontsize=12) ax.set_title(f'Worst recovered: {worst_rec["parameter"]} (RMSE={worst_rec["rmse"]:.3f})', fontsize=11) ax.set_aspect('equal') ax.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` ### Utility Differences ($\boldsymbol{\delta}$) ```{python} #| label: fig-delta-recovery #| fig-cap: "Recovery of utility difference parameters δ. Each panel shows true vs. estimated values and coverage for one δ component." K_minus_1 = config['K'] - 1 fig, axes = plt.subplots(2, K_minus_1, figsize=(6 * K_minus_1, 10)) if K_minus_1 == 1: axes = axes.reshape(2, 1) delta_stats = [] for k in range(K_minus_1): param_name = f"delta[{k+1}]" delta_true = np.array([p['delta'][k] for p in true_params_list]) delta_mean = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries]) delta_lower = np.array([s.loc[param_name, '5%'] for s in posterior_summaries]) delta_upper = np.array([s.loc[param_name, '95%'] for s in posterior_summaries]) # Metrics bias = np.mean(delta_mean - delta_true) rmse = np.sqrt(np.mean((delta_mean - delta_true)**2)) coverage = np.mean((delta_true >= delta_lower) & (delta_true <= delta_upper)) ci_width = np.mean(delta_upper - delta_lower) delta_stats.append({ 'parameter': f'δ_{k+1}', 'bias': bias, 'rmse': rmse, 'coverage': coverage, 'ci_width': ci_width }) # True vs Estimated ax = axes[0, k] ax.scatter(delta_true, delta_mean, alpha=0.7, s=60, c='forestgreen', edgecolor='white') ax.plot([0, 1], [0, 1], 'r--', linewidth=2) ax.set_xlim(-0.05, 1.05) ax.set_ylim(-0.05, 1.05) ax.set_xlabel(f'True δ_{k+1}', fontsize=11) ax.set_ylabel(f'Estimated δ_{k+1}', fontsize=11) ax.set_title(f'δ_{k+1}: Bias={bias:.3f}, RMSE={rmse:.3f}', fontsize=11) ax.set_aspect('equal') # Coverage ax = axes[1, k] iterations = np.arange(len(delta_true)) for i in range(len(delta_true)): covered = (delta_true[i] >= delta_lower[i]) & (delta_true[i] <= delta_upper[i]) color = 'forestgreen' if covered else 'crimson' ax.plot([i, i], [delta_lower[i], delta_upper[i]], color=color, linewidth=2, alpha=0.7) ax.scatter(i, delta_mean[i], color=color, s=40, zorder=3) ax.scatter(iterations, delta_true, color='black', s=60, marker='x', label='True value', zorder=4, linewidth=2) ax.set_xlabel('Iteration', fontsize=11) ax.set_ylabel(f'δ_{k+1}', fontsize=11) ax.set_title(f'δ_{k+1}: Coverage = {coverage:.0%}', fontsize=11) ax.legend() ax.grid(True, alpha=0.3) plt.tight_layout() plt.show() delta_df = pd.DataFrame(delta_stats) print(f"\nδ Recovery Summary:") for _, row in delta_df.iterrows(): cov_se = np.sqrt(row['coverage'] * (1 - row['coverage']) / len(true_params_list)) print(f" {row['parameter']}: Bias={row['bias']:.4f}, RMSE={row['rmse']:.4f}, Coverage={row['coverage']:.0%} (SE={cov_se:.1%})") ``` ::: {.callout-note} ## Precision of Coverage Estimates With $n = 50$ recovery iterations, the standard error for a coverage rate near the nominal 90% is $\text{SE} = \sqrt{0.9 \times 0.1 / 50} \approx 0.042$, giving an approximate 95% confidence interval of [82%, 98%]. Thus, an observed coverage as low as 82% is still consistent with nominal calibration. The definitive assessment of calibration with much greater precision ($n = 999$ iterations) is provided by the simulation-based calibration (SBC) analysis in [Report 6](06_sbc_validation.qmd). ::: ## Summary Table ```{python} #| label: tbl-recovery-summary #| tbl-cap: "Parameter recovery statistics for all parameters in model m_0." # Create comprehensive summary table summary_rows = [] # Alpha summary_rows.append({ 'Parameter': 'α', 'Bias': alpha_bias, 'RMSE': alpha_rmse, 'Coverage': alpha_coverage, 'CI Width': alpha_ci_width, 'Notes': 'Sensitivity' }) # Beta (aggregated) summary_rows.append({ 'Parameter': 'β (all)', 'Bias': beta_df['bias'].mean(), 'RMSE': beta_df['rmse'].mean(), 'Coverage': beta_df['coverage'].mean(), 'CI Width': beta_df['ci_width'].mean(), 'Notes': f'{K}×{D} parameters' }) # Delta (individual) for _, row in delta_df.iterrows(): summary_rows.append({ 'Parameter': row['parameter'], 'Bias': row['bias'], 'RMSE': row['rmse'], 'Coverage': row['coverage'], 'CI Width': row['ci_width'], 'Notes': 'Utility increment' }) summary_df = pd.DataFrame(summary_rows) # Format for display display_df = summary_df.copy() display_df['Bias'] = display_df['Bias'].apply(lambda x: f'{x:.4f}') display_df['RMSE'] = display_df['RMSE'].apply(lambda x: f'{x:.4f}') display_df['Coverage'] = display_df['Coverage'].apply(lambda x: f'{x:.0%}') display_df['CI Width'] = display_df['CI Width'].apply(lambda x: f'{x:.3f}') print(display_df.to_string(index=False)) ``` ### Recovery Conditional on True $\alpha$ {#sec-conditional-recovery} The marginal recovery statistics above average over all true parameter values drawn from the prior. Because the lognormal(0, 1) prior on α spans a wide range—from near-zero (nearly random choices) to large values (near-deterministic choices)—recovery difficulty varies considerably across iterations. To reveal this heterogeneity, we examine recovery metrics as a function of the true α value. ```{python} #| label: fig-conditional-recovery #| fig-cap: "Recovery metrics stratified by true α. Left: absolute error in α recovery vs. true α. Center: absolute error in δ₁ vs. true α — δ recovery degrades when α is small, since choices then carry little information about utilities. Right: posterior 90% credible-interval width for δ₁ vs. true α, used as a proxy for how tightly the data constrain δ₁ in each iteration." fig, axes = plt.subplots(1, 3, figsize=(17, 5)) # Alpha absolute error vs true alpha ax = axes[0] alpha_abs_error = np.abs(alpha_mean - alpha_true) ax.scatter(alpha_true, alpha_abs_error, alpha=0.7, s=50, c='steelblue', edgecolor='white') ax.set_xlabel('True α', fontsize=12) ax.set_ylabel('|Estimated α − True α|', fontsize=12) ax.set_title('α Recovery Error vs. True α', fontsize=12) ax.grid(True, alpha=0.3) # Delta absolute error vs true alpha ax = axes[1] delta_true_vals = np.array([p['delta'][0] for p in true_params_list]) delta_mean_vals = np.array([s.loc['delta[1]', 'Mean'] for s in posterior_summaries]) delta_abs_error = np.abs(delta_mean_vals - delta_true_vals) ax.scatter(alpha_true, delta_abs_error, alpha=0.7, s=50, c='forestgreen', edgecolor='white') ax.set_xlabel('True α', fontsize=12) ax.set_ylabel('|Estimated δ₁ − True δ₁|', fontsize=12) ax.set_title('δ₁ Recovery Error vs. True α', fontsize=12) ax.grid(True, alpha=0.3) # Delta uncertainty vs true alpha ax = axes[2] # We compute CI width of delta as a proxy for how constrained it is. delta_ci_widths = np.array([s.loc['delta[1]', '95%'] - s.loc['delta[1]', '5%'] for s in posterior_summaries]) ax.scatter(alpha_true, delta_ci_widths, alpha=0.7, s=50, c='coral', edgecolor='white') ax.set_xlabel('True α', fontsize=12) ax.set_ylabel('δ₁ 90% CI Width', fontsize=12) ax.set_title('δ₁ Posterior Uncertainty vs. True α', fontsize=12) ax.grid(True, alpha=0.3) plt.tight_layout() plt.show() # Report low-alpha vs high-alpha recovery median_alpha = np.median(alpha_true) low_mask = alpha_true < median_alpha high_mask = alpha_true >= median_alpha print(f"\nRecovery stratified by true α (median split at α = {median_alpha:.2f}):") print(f" α RMSE (low α): {np.sqrt(np.mean((alpha_mean[low_mask] - alpha_true[low_mask])**2)):.4f}") print(f" α RMSE (high α): {np.sqrt(np.mean((alpha_mean[high_mask] - alpha_true[high_mask])**2)):.4f}") print(f" δ₁ RMSE (low α): {np.sqrt(np.mean((delta_mean_vals[low_mask] - delta_true_vals[low_mask])**2)):.4f}") print(f" δ₁ RMSE (high α): {np.sqrt(np.mean((delta_mean_vals[high_mask] - delta_true_vals[high_mask])**2)):.4f}") print(f" δ₁ mean CI width (low α): {delta_ci_widths[low_mask].mean():.3f}") print(f" δ₁ mean CI width (high α): {delta_ci_widths[high_mask].mean():.3f}") ``` When α is small, choices are nearly random — they carry little information about the underlying utilities, so δ (and β) become hard to pin down. When α is large, choices are near-deterministic, so the utility *ordering* is revealed clearly but the precise *magnitudes* of utility differences are less leveraged by the choice data. Across the lognormal(0,1) prior range examined here, α recovery error remains modest, while δ posterior uncertainty varies systematically with the choice regime. This α-dependent information content is consistent with the structural coupling discussed in @sec-utility-identification. ## Discussion ### Differential Recovery Across Parameters {#sec-utility-identification} ::: {.callout-note} ## Key Finding: Differential Parameter Recovery The results above reveal an asymmetry in parameter recovery: - **α (sensitivity)** is recovered well, with posterior means clustering tightly around true values across the lognormal(0,1) prior range examined here - **β (feature weights)** and **δ (utility differences)** show materially weaker recovery, with wider credible intervals and lower precision compared to α The finding extends to β as well as δ: the β summaries show the same qualitative pattern of slower posterior concentration, consistent with their structural coupling. We focus the prose on δ as the representative utility-function parameter. This pattern is consistent with the identification structure analyzed formally in [Report 5](05_adding_risky_choices.qmd) (see the proposition there on m_0 vs. m_1 identification): α is identified more directly through choice sensitivity, while β and δ are only **weakly identified** through their composite mapping to expected utilities. **For applications focused on estimating sensitivity (α):** within the prior range examined here, the weaker identification of β and δ does not substantially impair α recovery. If the primary research question concerns sensitivity to expected-utility differences rather than the decomposition of expected utility into beliefs and values, the identification challenges for (β, δ) may be of secondary concern. ::: To understand why β and δ are harder to recover than α, consider the structure of the model. In decisions under uncertainty, choices depend on **expected utilities**: $$ \eta_r = \sum_{k=1}^K \psi_{rk} \upsilon_k $$ where $\boldsymbol{\psi}_r$ are subjective probabilities (determined by β) and $\boldsymbol{\upsilon}$ are utilities (determined by δ). The key insight is that we only observe *rankings* of expected utilities through choices—not the utilities themselves, and not the separate contributions of beliefs and values. This creates a multiplicative coupling between β and δ: different (β, δ) pairs can produce similar expected utility functions, making both parameters harder to pin down precisely: - **α** directly governs how *sensitively* choices respond to expected utility differences; within the prior and design regime examined here, it is recovered well because choice probabilities depend monotonically on α for any fixed expected utility difference - **β and δ** jointly determine expected utilities through their product; uncertainty in one propagates to the other Note that this does not mean β and δ are completely uninformed by the data — the credible intervals narrow relative to their prior support, as the CI-width diagnostics above imply — but information accumulates more slowly than for α. The multiplicative coupling also leaves an illustrative signature *across* recovery iterations: when the posterior mean for β[1,1] errs in a direction that increases expected utility for consequence $k$, the posterior mean for $\delta_k$ may compensate in the opposite direction. The figure below visualizes this across-iteration error coupling. (We do not directly visualize the within-posterior correlation between β and δ here, since only marginal posterior summaries are saved by the recovery pipeline; the across-iteration diagnostic should therefore be read as suggestive rather than as direct posterior evidence of compensation.) ```{python} #| label: fig-beta-delta-correlation #| fig-cap: "Across-iteration signatures of the β–δ coupling. Left: estimation errors in β[1,1] vs. estimation errors in δ₁ across recovery iterations; a negative Pearson r (reported in the title) is consistent with the predicted compensatory tradeoff. Right: joint posterior uncertainty proxies — δ₁ 90% CI width vs. β[1,1] 90% CI width across iterations, color-coded by the true α; iterations with more informative choices (larger α) tend to cluster toward narrower CI widths for both parameters." # Since only marginal summaries are saved, use across-iteration errors and # CI widths as diagnostics rather than direct posterior correlations. fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: Distribution of signed errors showing coupling ax = axes[0] # For each iteration, compute the bias direction for beta[1,1] and delta[1] beta_11_errors = np.array([s.loc['beta[1,1]', 'Mean'] - p['beta'][0][0] for p, s in zip(true_params_list, posterior_summaries)]) delta_1_errors = np.array([s.loc['delta[1]', 'Mean'] - p['delta'][0] for p, s in zip(true_params_list, posterior_summaries)]) ax.scatter(beta_11_errors, delta_1_errors, alpha=0.7, s=50, c='mediumpurple', edgecolor='white') ax.axhline(0, color='gray', linestyle='--', alpha=0.5) ax.axvline(0, color='gray', linestyle='--', alpha=0.5) corr_errors = np.corrcoef(beta_11_errors, delta_1_errors)[0, 1] ax.set_xlabel('β[1,1] estimation error', fontsize=12) ax.set_ylabel('δ₁ estimation error', fontsize=12) ax.set_title(f'Estimation Error Coupling (r = {corr_errors:.3f})', fontsize=12) ax.grid(True, alpha=0.3) # Right: CI width of delta vs CI width of beta (across iterations) ax = axes[1] beta_11_ci = np.array([s.loc['beta[1,1]', '95%'] - s.loc['beta[1,1]', '5%'] for s in posterior_summaries]) delta_1_ci = np.array([s.loc['delta[1]', '95%'] - s.loc['delta[1]', '5%'] for s in posterior_summaries]) scatter = ax.scatter(beta_11_ci, delta_1_ci, alpha=0.7, s=50, c=alpha_true, cmap='viridis', edgecolor='white') ax.set_xlabel('β[1,1] 90% CI Width', fontsize=12) ax.set_ylabel('δ₁ 90% CI Width', fontsize=12) ax.set_title('Joint Posterior Uncertainty (color = true α)', fontsize=12) ax.grid(True, alpha=0.3) plt.colorbar(scatter, ax=ax, label='True α') plt.tight_layout() plt.show() ``` ### Does Increasing Sample Size Help? A natural question is whether the weaker δ recovery is simply a matter of insufficient data. (As noted in the Discussion, β follows the same pattern as δ due to their structural coupling; we focus on δ as the representative utility-function parameter.) The study design with M=25 decision problems might not provide enough information. We test this by doubling the sample size to M=50. To ensure a valid comparison, we use the `extend()` method to create the larger design from the original M=25 design. This keeps the same alternatives (features) and original problems, adding 25 new problems that use those same alternatives. Any differences in recovery can then be attributed to sample size rather than different feature configurations. ::: {.callout-tip collapse="true"} ## How `extend()` preserves the base design The `StudyDesign.extend()` method creates a new design object that: 1. **Copies the identical feature matrix** `w` (all R=15 alternatives retain their original feature vectors) 2. **Preserves the original M=25 problems** (the first 25 rows of the indicator matrix `I` are unchanged) 3. **Appends new problems** by randomly assigning subsets of the *same* R=15 alternatives to each new problem This guarantees that any change in recovery quality is due to having more choice observations, not to different feature configurations. The companion method `extend_with_alternatives()` (used below) additionally generates new feature vectors, enabling comparisons where the feature space itself expands. ::: ```{python} #| label: config-m50 #| echo: true # Extend the original study design to M=50 # This preserves the same alternatives and original problems study_m50 = study.extend(additional_M=25, design_name="parameter_recovery_m50") print(f"Extended Study Design (M=50):") print(f" Decision problems: {study_m50.M} (original {study.M} + 25 new)") print(f" Same R={study_m50.R} alternatives with identical features") print(f" Total expected observations: ~{study_m50.M * 3.5:.0f} choices") ``` ```{python} #| label: run-recovery-m50 #| output: false # Create output directory output_dir_m50 = tempfile.mkdtemp(prefix="param_recovery_m50_") # Run parameter recovery with larger sample recovery_m50 = ParameterRecovery( inference_model_path=None, sim_model_path=None, study_design=study_m50, output_dir=output_dir_m50, n_mcmc_samples=1000, n_mcmc_chains=4, n_iterations=50 ) true_params_m50, posterior_summaries_m50 = recovery_m50.run() ``` ```{python} #| label: compare-sample-sizes #| echo: false # Compute recovery metrics for M=50 # Alpha alpha_true_m50 = np.array([p['alpha'] for p in true_params_m50]) alpha_mean_m50 = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries_m50]) alpha_lower_m50 = np.array([s.loc['alpha', '5%'] for s in posterior_summaries_m50]) alpha_upper_m50 = np.array([s.loc['alpha', '95%'] for s in posterior_summaries_m50]) alpha_rmse_m50 = np.sqrt(np.mean((alpha_mean_m50 - alpha_true_m50)**2)) alpha_coverage_m50 = np.mean((alpha_true_m50 >= alpha_lower_m50) & (alpha_true_m50 <= alpha_upper_m50)) alpha_ci_width_m50 = np.mean(alpha_upper_m50 - alpha_lower_m50) # Delta (using config from original study) K_minus_1 = config['K'] - 1 delta_stats_m50 = [] for k in range(K_minus_1): param_name = f"delta[{k+1}]" delta_true_m50 = np.array([p['delta'][k] for p in true_params_m50]) delta_mean_m50 = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries_m50]) delta_lower_m50 = np.array([s.loc[param_name, '5%'] for s in posterior_summaries_m50]) delta_upper_m50 = np.array([s.loc[param_name, '95%'] for s in posterior_summaries_m50]) delta_stats_m50.append({ 'parameter': f'δ_{k+1}', 'rmse': np.sqrt(np.mean((delta_mean_m50 - delta_true_m50)**2)), 'coverage': np.mean((delta_true_m50 >= delta_lower_m50) & (delta_true_m50 <= delta_upper_m50)), 'ci_width': np.mean(delta_upper_m50 - delta_lower_m50) }) delta_df_m50 = pd.DataFrame(delta_stats_m50) ``` ```{python} #| label: fig-sample-size-comparison #| fig-cap: "Comparison of parameter recovery between M=25 and M=50. Doubling the number of decision problems improves α recovery but has a more modest effect on δ recovery. (β exhibits similar patterns to δ due to their structural coupling.)" fig, axes = plt.subplots(1, 3, figsize=(15, 5)) # RMSE comparison ax = axes[0] params = ['α', 'δ₁', 'δ₂'] rmse_m25 = [alpha_rmse, delta_df.iloc[0]['rmse'], delta_df.iloc[1]['rmse']] rmse_m50 = [alpha_rmse_m50, delta_df_m50.iloc[0]['rmse'], delta_df_m50.iloc[1]['rmse']] x = np.arange(len(params)) width = 0.35 bars1 = ax.bar(x - width/2, rmse_m25, width, label='M=25', color='steelblue', alpha=0.7) bars2 = ax.bar(x + width/2, rmse_m50, width, label='M=50', color='coral', alpha=0.7) ax.set_xlabel('Parameter', fontsize=12) ax.set_ylabel('RMSE', fontsize=12) ax.set_title('RMSE by Sample Size', fontsize=12) ax.set_xticks(x) ax.set_xticklabels(params) ax.legend() ax.grid(True, alpha=0.3, axis='y') # Coverage comparison ax = axes[1] cov_m25 = [alpha_coverage, delta_df.iloc[0]['coverage'], delta_df.iloc[1]['coverage']] cov_m50 = [alpha_coverage_m50, delta_df_m50.iloc[0]['coverage'], delta_df_m50.iloc[1]['coverage']] bars1 = ax.bar(x - width/2, cov_m25, width, label='M=25', color='steelblue', alpha=0.7) bars2 = ax.bar(x + width/2, cov_m50, width, label='M=50', color='coral', alpha=0.7) ax.axhline(y=0.90, color='red', linestyle='--', linewidth=2, label='Target (90%)') ax.set_xlabel('Parameter', fontsize=12) ax.set_ylabel('Coverage', fontsize=12) ax.set_title('90% CI Coverage by Sample Size', fontsize=12) ax.set_xticks(x) ax.set_xticklabels(params) ax.set_ylim(0, 1.05) ax.legend() ax.grid(True, alpha=0.3, axis='y') # CI Width comparison ax = axes[2] ci_m25 = [alpha_ci_width, delta_df.iloc[0]['ci_width'], delta_df.iloc[1]['ci_width']] ci_m50 = [alpha_ci_width_m50, delta_df_m50.iloc[0]['ci_width'], delta_df_m50.iloc[1]['ci_width']] bars1 = ax.bar(x - width/2, ci_m25, width, label='M=25', color='steelblue', alpha=0.7) bars2 = ax.bar(x + width/2, ci_m50, width, label='M=50', color='coral', alpha=0.7) ax.set_xlabel('Parameter', fontsize=12) ax.set_ylabel('CI Width', fontsize=12) ax.set_title('90% CI Width by Sample Size', fontsize=12) ax.set_xticks(x) ax.set_xticklabels(params) ax.legend() ax.grid(True, alpha=0.3, axis='y') plt.tight_layout() plt.show() ``` ```{python} #| label: tbl-sample-size-comparison #| echo: false # Create comparison table comparison_data = [] # Alpha comparison_data.append({ 'Parameter': 'α', 'RMSE (M=25)': f'{alpha_rmse:.4f}', 'RMSE (M=50)': f'{alpha_rmse_m50:.4f}', 'Δ RMSE': f'{(alpha_rmse_m50 - alpha_rmse)/alpha_rmse * 100:+.1f}%', 'CI Width (M=25)': f'{alpha_ci_width:.3f}', 'CI Width (M=50)': f'{alpha_ci_width_m50:.3f}' }) # Delta for i in range(K_minus_1): comparison_data.append({ 'Parameter': f'δ_{i+1}', 'RMSE (M=25)': f'{delta_df.iloc[i]["rmse"]:.4f}', 'RMSE (M=50)': f'{delta_df_m50.iloc[i]["rmse"]:.4f}', 'Δ RMSE': f'{(delta_df_m50.iloc[i]["rmse"] - delta_df.iloc[i]["rmse"])/delta_df.iloc[i]["rmse"] * 100:+.1f}%', 'CI Width (M=25)': f'{delta_df.iloc[i]["ci_width"]:.3f}', 'CI Width (M=50)': f'{delta_df_m50.iloc[i]["ci_width"]:.3f}' }) comparison_df = pd.DataFrame(comparison_data) print("Sample Size Comparison (M=25 vs M=50):") print(comparison_df.to_string(index=False)) ``` In these runs, the table shows the asymmetric sample-size pattern anticipated in @sec-utility-identification: doubling M produces a larger relative reduction in α RMSE and CI width than in δ RMSE and CI width. (β follows the same pattern as δ, as noted in the Discussion.) This is consistent with α being informed more directly by choice sensitivity while (β, δ) are weakly identified through their composite mapping to expected utilities — additional uncertain-choice data sharpens the parameter that the data most directly inform, while improving (β, δ) more slowly. ### Does Adding New Alternatives Help? The M=50 extension above adds problems over the **same** alternatives — it does not expand the feature space. We might hypothesize that adding **new** alternatives (with new feature vectors) provides qualitatively different information that could help with δ recovery. To test this, we use `extend_with_alternatives()` to add 15 new alternatives along with 25 new problems that use those alternatives. This parallels how model m_1 (in [Report 5](05_adding_risky_choices.qmd)) adds new risky alternatives along with new risky problems, enabling a fairer comparison. ```{python} #| label: config-m50-new-alts #| echo: true # Extend with both new alternatives AND new problems # New problems only use the new alternatives (paralleling m_1's risky component) study_m50_new_alts = study.extend_with_alternatives( additional_M=25, additional_R=15, # Same number of new alternatives as m_1 has risky alternatives design_name="parameter_recovery_m50_new_alts", new_problems_use_new_alts_only=True ) print(f"Extended Study Design (M=50 with new alternatives):") print(f" Decision problems: {study_m50_new_alts.M} (original {study.M} + 25 new)") print(f" Alternatives: R={study_m50_new_alts.R} (original {study.R} + 15 new)") print(f" Original problems (1-25) use original alternatives (1-15)") print(f" New problems (26-50) use new alternatives (16-30)") ``` ::: {.callout-tip collapse="true"} ## Why `new_problems_use_new_alts_only=True`? Setting `new_problems_use_new_alts_only=True` creates a **block structure**: original problems draw from original alternatives, and new problems draw exclusively from new alternatives. This design choice mirrors the structure of model m_1 in [Report 5](05_adding_risky_choices.qmd), where risky problems use only risky alternatives (lotteries). Maintaining this parallel enables a fair comparison: any difference between "M=50 with new uncertain alternatives" and "M=25 + N=25 with risky alternatives" can be attributed to the *type* of data (risk vs. uncertainty) rather than to differences in how alternatives are assigned to problems. An alternative design—mixing old and new alternatives within new problems—would provide cross-referencing information that might improve identification. However, this would break the structural parallel with m_1 and make the comparison less clean. We prioritize the controlled comparison here; designs that mix alternatives are worth exploring in future work. ::: ```{python} #| label: run-recovery-m50-new-alts #| output: false # Create output directory output_dir_m50_new_alts = tempfile.mkdtemp(prefix="param_recovery_m50_new_alts_") # Run parameter recovery with new alternatives recovery_m50_new_alts = ParameterRecovery( inference_model_path=None, sim_model_path=None, study_design=study_m50_new_alts, output_dir=output_dir_m50_new_alts, n_mcmc_samples=1000, n_mcmc_chains=4, n_iterations=50 ) true_params_m50_new_alts, posterior_summaries_m50_new_alts = recovery_m50_new_alts.run() ``` ```{python} #| label: compare-all-extensions #| echo: false # Compute recovery metrics for M=50 with new alternatives # Alpha alpha_true_m50_new = np.array([p['alpha'] for p in true_params_m50_new_alts]) alpha_mean_m50_new = np.array([s.loc['alpha', 'Mean'] for s in posterior_summaries_m50_new_alts]) alpha_lower_m50_new = np.array([s.loc['alpha', '5%'] for s in posterior_summaries_m50_new_alts]) alpha_upper_m50_new = np.array([s.loc['alpha', '95%'] for s in posterior_summaries_m50_new_alts]) alpha_rmse_m50_new = np.sqrt(np.mean((alpha_mean_m50_new - alpha_true_m50_new)**2)) alpha_coverage_m50_new = np.mean((alpha_true_m50_new >= alpha_lower_m50_new) & (alpha_true_m50_new <= alpha_upper_m50_new)) alpha_ci_width_m50_new = np.mean(alpha_upper_m50_new - alpha_lower_m50_new) # Delta delta_stats_m50_new = [] for k in range(K_minus_1): param_name = f"delta[{k+1}]" delta_true_m50_new = np.array([p['delta'][k] for p in true_params_m50_new_alts]) delta_mean_m50_new = np.array([s.loc[param_name, 'Mean'] for s in posterior_summaries_m50_new_alts]) delta_lower_m50_new = np.array([s.loc[param_name, '5%'] for s in posterior_summaries_m50_new_alts]) delta_upper_m50_new = np.array([s.loc[param_name, '95%'] for s in posterior_summaries_m50_new_alts]) delta_stats_m50_new.append({ 'parameter': f'δ_{k+1}', 'rmse': np.sqrt(np.mean((delta_mean_m50_new - delta_true_m50_new)**2)), 'coverage': np.mean((delta_true_m50_new >= delta_lower_m50_new) & (delta_true_m50_new <= delta_upper_m50_new)), 'ci_width': np.mean(delta_upper_m50_new - delta_lower_m50_new) }) delta_df_m50_new = pd.DataFrame(delta_stats_m50_new) ``` ```{python} #| label: fig-three-way-comparison #| fig-cap: "Three-way comparison of δ recovery: M=25 (base), M=50 (same alternatives), and M=50 (new alternatives). Adding new alternatives may provide different information than simply adding more problems over existing alternatives. (β exhibits similar patterns to δ due to their structural coupling.)" fig, axes = plt.subplots(1, 3, figsize=(16, 5)) params = ['δ₁', 'δ₂'] x = np.arange(len(params)) width = 0.25 # RMSE comparison ax = axes[0] rmse_m25_d = [delta_df.iloc[k]['rmse'] for k in range(K_minus_1)] rmse_m50_d = [delta_df_m50.iloc[k]['rmse'] for k in range(K_minus_1)] rmse_m50_new_d = [delta_stats_m50_new[k]['rmse'] for k in range(K_minus_1)] bars1 = ax.bar(x - width, rmse_m25_d, width, label='M=25 (base)', color='steelblue', alpha=0.7) bars2 = ax.bar(x, rmse_m50_d, width, label='M=50 (same alts)', color='coral', alpha=0.7) bars3 = ax.bar(x + width, rmse_m50_new_d, width, label='M=50 (new alts)', color='mediumseagreen', alpha=0.7) ax.set_xlabel('Parameter', fontsize=12) ax.set_ylabel('RMSE', fontsize=12) ax.set_title('δ RMSE by Extension Type', fontsize=12) ax.set_xticks(x) ax.set_xticklabels(params) ax.legend(loc='upper right') ax.grid(True, alpha=0.3, axis='y') # Coverage comparison ax = axes[1] cov_m25_d = [delta_df.iloc[k]['coverage'] for k in range(K_minus_1)] cov_m50_d = [delta_df_m50.iloc[k]['coverage'] for k in range(K_minus_1)] cov_m50_new_d = [delta_stats_m50_new[k]['coverage'] for k in range(K_minus_1)] bars1 = ax.bar(x - width, cov_m25_d, width, label='M=25 (base)', color='steelblue', alpha=0.7) bars2 = ax.bar(x, cov_m50_d, width, label='M=50 (same alts)', color='coral', alpha=0.7) bars3 = ax.bar(x + width, cov_m50_new_d, width, label='M=50 (new alts)', color='mediumseagreen', alpha=0.7) ax.axhline(y=0.90, color='red', linestyle='--', linewidth=2, label='Target (90%)') ax.set_xlabel('Parameter', fontsize=12) ax.set_ylabel('Coverage', fontsize=12) ax.set_title('δ 90% CI Coverage by Extension Type', fontsize=12) ax.set_xticks(x) ax.set_xticklabels(params) ax.set_ylim(0, 1.05) ax.legend(loc='lower right') ax.grid(True, alpha=0.3, axis='y') # CI Width comparison ax = axes[2] ci_m25_d = [delta_df.iloc[k]['ci_width'] for k in range(K_minus_1)] ci_m50_d = [delta_df_m50.iloc[k]['ci_width'] for k in range(K_minus_1)] ci_m50_new_d = [delta_stats_m50_new[k]['ci_width'] for k in range(K_minus_1)] bars1 = ax.bar(x - width, ci_m25_d, width, label='M=25 (base)', color='steelblue', alpha=0.7) bars2 = ax.bar(x, ci_m50_d, width, label='M=50 (same alts)', color='coral', alpha=0.7) bars3 = ax.bar(x + width, ci_m50_new_d, width, label='M=50 (new alts)', color='mediumseagreen', alpha=0.7) ax.set_xlabel('Parameter', fontsize=12) ax.set_ylabel('CI Width', fontsize=12) ax.set_title('δ 90% CI Width by Extension Type', fontsize=12) ax.set_xticks(x) ax.set_xticklabels(params) ax.legend(loc='upper right') ax.grid(True, alpha=0.3, axis='y') plt.tight_layout() plt.show() ``` ```{python} #| label: tbl-three-way-comparison #| echo: false # Create three-way comparison table using pandas (no extra dependencies) three_way_rows = [] # Alpha three_way_rows.append({ 'Param': 'α', 'RMSE M=25': f'{alpha_rmse:.4f}', 'RMSE M=50 (same)': f'{alpha_rmse_m50:.4f}', 'RMSE M=50 (new)': f'{alpha_rmse_m50_new:.4f}', 'CI Width M=25': f'{alpha_ci_width:.3f}', 'CI Width M=50 (same)': f'{alpha_ci_width_m50:.3f}', 'CI Width M=50 (new)': f'{alpha_ci_width_m50_new:.3f}' }) # Delta for i in range(K_minus_1): three_way_rows.append({ 'Param': f'δ_{i+1}', 'RMSE M=25': f'{delta_df.iloc[i]["rmse"]:.4f}', 'RMSE M=50 (same)': f'{delta_df_m50.iloc[i]["rmse"]:.4f}', 'RMSE M=50 (new)': f'{delta_stats_m50_new[i]["rmse"]:.4f}', 'CI Width M=25': f'{delta_df.iloc[i]["ci_width"]:.3f}', 'CI Width M=50 (same)': f'{delta_df_m50.iloc[i]["ci_width"]:.3f}', 'CI Width M=50 (new)': f'{delta_stats_m50_new[i]["ci_width"]:.3f}' }) three_way_df = pd.DataFrame(three_way_rows) print("Three-Way Extension Comparison:") print(three_way_df.to_string(index=False)) ``` ### Interpretation: Sample Size and Alternative Effects The three-way comparison reveals important patterns about what helps δ recovery (β exhibits similar patterns due to the structural coupling between these parameters): ::: {.callout-note} ## Extension Type Matters Comparing the three conditions: 1. **M=25 (base)**: Baseline recovery with modest δ precision 2. **M=50 (same alternatives)**: Adding 25 problems over the same R=15 alternatives 3. **M=50 (new alternatives)**: Adding 25 problems using 15 **new** alternatives (R increases to 30) Directional findings from the three-way table: - **α recovery** improves with more data under either extension; the two M=50 conditions are comparable for α. - **δ recovery** suggests a possible modest additional benefit from new alternatives over same alternatives, consistent with the idea that expanding the feature space provides qualitatively different information about how features map to subjective probabilities (and thus indirectly about δ through the multiplicative coupling). These directional comparisons should be read with the precision caveat in mind: with $n = 50$ recovery iterations, the standard error on coverage is roughly four percentage points, and small differences in RMSE or CI width should not be over-interpreted. More direct across-model comparisons — with matched designs and/or a calibration-focused framing — are provided in [Report 5](05_adding_risky_choices.qmd) and [Report 6](06_sbc_validation.qmd). ::: This three-way comparison sets up a crucial comparison for [Report 5](05_adding_risky_choices.qmd), where we add risky alternatives (lotteries with known probabilities) instead of additional uncertain alternatives. By comparing: - **M=50 with new uncertain alternatives** (this report): Expands uncertain feature space - **M=25 + N=25 with new risky alternatives** (Report 5): Adds known-probability lotteries ...we can more cleanly distinguish whether improvements come from the *type* of data (risk vs. uncertainty) or simply from adding more diverse choice problems. The asymmetry in learning rates for (β, δ) versus α reflects the model structure: in decisions under uncertainty, utilities and subjective probabilities enter the likelihood through their product (expected utilities). This creates a multiplicative coupling that makes (β, δ) harder to pin down than α, which more directly governs choice sensitivity. ### Study Design Parameters in m_0 The `m_0.stan` data block defines several parameters that characterize study size: | Parameter | Description | Current Value | |-----------|-------------|---------------| | M | Number of decision problems | 25 → 50 | | K | Number of consequences | 3 | | D | Feature dimensions | 5 | | R | Distinct alternatives | 15 | We might consider varying other parameters: - **Increasing K** (more consequences): Would require estimating more δ parameters, potentially making recovery harder - **Increasing R** (more alternatives): Provides more information about β, and may indirectly help δ through better-constrained expected utilities - **Increasing D** (more features): Similar to increasing R—primarily helps with β The effectiveness of these changes for (β, δ) recovery is an empirical question worth exploring in future work. ### Alternatives to More Uncertain-Choice Data Beyond simply collecting more uncertain-choice observations, three further routes can sharpen recovery of (β, δ). They differ in *where* the additional information comes from — the design, the prior, or cross-agent structure — and have very different epistemic standing. #### Route 1: Risky alternatives as a direct probe of utility The history of decision theory suggests the use of alternatives with known probabilities — *risky alternatives* — as a direct probe of the utility function. The mechanical reason is transparent in our notation: for a risky alternative the probability vector $\boldsymbol{\psi}_r$ is fixed exogenously by the experimenter rather than being a function of $\boldsymbol{\beta}$ and the feature vector. Conditional on those observations, the expected utility $\eta_r = \boldsymbol{\psi}_r^\top \boldsymbol{\upsilon}$ is *linear* in $\boldsymbol{\upsilon}$ (and hence in $\boldsymbol{\delta}$), so the multiplicative β–δ coupling that hampers recovery in m_0 is broken for the risky portion of the data. The use of objectively risky prospects to identify utilities has a long lineage — running through Ramsey, von Neumann–Morgenstern, and the Anscombe–Aumann construction that combines risky and uncertain prospects in a single domain. We do not need the formal Anscombe–Aumann representation theorem to motivate the design choice here; the identification gain is mechanical, as just described. The formal relationship between our parameter-recovery argument and the classical representation result is examined in [Report 5](05_adding_risky_choices.qmd). #### Route 2: A more concentrated Dirichlet prior on δ A different lever is available without changing the design at all. In `m_0.stan`, $\boldsymbol{\delta}$ is a `simplex[K-1]` with prior $\boldsymbol{\delta} \sim \text{Dirichlet}(\mathbf{1})$ — i.e., a flat prior over admissible utility differences. Replacing the concentration vector $\mathbf{1}$ with $\alpha_0 \mathbf{1}$ for some $\alpha_0 > 1$ tightens the marginal prior on each $\delta_k$ around $1/(K-1)$. Through the multiplicative coupling between β and δ, this also indirectly tightens β: posterior uncertainty in δ that the likelihood cannot resolve is now absorbed by the prior, leaving less slack for β to drift. The trade-offs deserve emphasis: - The identification gain here comes from prior information, not data. It is appropriate only when the analyst has substantive grounds — domain knowledge, pilot data, or a deliberate modeling commitment that consequences are roughly equally spaced on the unit utility scale — for a more concentrated prior. - A misspecified concentrated prior biases $\boldsymbol{\delta}$ (and, via coupling, $\boldsymbol{\beta}$) toward the prior mean and shrinks credible intervals below their nominal calibration. Calibration under the concentrated prior should be re-checked via SBC. - Unlike adding risky choices, this route does not improve identification *of the data-generating process*; it improves *posterior precision* by importing assumptions. We do not pursue this route empirically in this report series; we flag it as a methodologically natural option whose evaluation is left to future work. #### Route 3: Hierarchical modeling across agents or cells A third route becomes available when data are collected from multiple agents, experimental cells, or related contexts. If those agents can defensibly be modeled as sharing a utility scale — or as drawing utilities from a common population distribution — a hierarchical model can borrow strength across datasets. Data from one agent then help constrain the utility parameters for others through the shared or partially pooled structure. This is the logic developed in the hierarchical sequence ([Reports 08–12](08_hierarchical_formulation.qmd)). In the implemented `h_m01` model, cell-level sensitivities and belief mappings vary across cells, while the utility simplex $\boldsymbol{\delta}$ is shared across the stacked data. The hierarchical recovery and SBC analyses ([Report 11](11_hierarchical_parameter_recovery.qmd), [Report 12](12_hierarchical_sbc_validation.qmd)) validate that structure at the alignment-study scale. The caveat is important: hierarchical modeling does not mechanically break the single-agent β–δ multiplicative coupling in the way risky choices do. Its gain comes from pooling information under a cross-agent or cross-cell assumption. If the shared-utility or population-utility assumption is substantively wrong, the model can import bias just as a too-concentrated prior can. For that reason, hierarchical variants require the same prior predictive, recovery, and SBC validation used elsewhere in this report series. ::: {.callout-note} ## Three complementary routes beyond more uncertain choices 1. **Add risky-choice data** (model m_1, [Report 5](05_adding_risky_choices.qmd)). Adds new identifying information by fixing $\boldsymbol{\psi}_r$ exogenously, breaking the β–δ multiplicative coupling for those observations. 2. **Tighten the Dirichlet concentration on δ.** Adds prior information rather than data; cheap but assumption-laden, and only as defensible as the substantive grounds for the tighter prior. 3. **Use hierarchical modeling across agents or cells** ([Reports 08–12](08_hierarchical_formulation.qmd)). Adds cross-agent structure and partial pooling; powerful when shared-utility or population-utility assumptions are defensible, but not a substitute for design-based identification. The three are not exclusive and can be combined. ::: ## Conclusion Parameter recovery analysis for model m_0 reveals differential recovery across parameters: 1. **Well-recovered parameter**: α (sensitivity) can be recovered with good precision across the lognormal(0,1) prior range examined here. 2. **Slower-learning parameters**: β (feature weights) and δ (utility differences) show wider uncertainty; credible intervals do narrow relative to the prior support, but more slowly than for α. 3. **Sample size helps, but differentially**: in these runs, doubling M improves α recovery more than (β, δ) recovery, and adding *new* alternatives suggests a possible modest additional benefit over adding more problems over the same alternatives. This pattern reflects the identification structure analyzed formally in [Report 5](05_adding_risky_choices.qmd): α is informed more directly by choice sensitivity, while β and δ interact multiplicatively in the expected utility, causing uncertainty in one to propagate to the other. **Implications for applications:** - If the primary research question concerns sensitivity (α) — how strongly choices respond to expected utility differences — model m_0 performs well even with modest sample sizes. - If precise decomposition of expected utility into beliefs (β) and values (δ) is needed, additional information is required. **Four routes for sharpening (β, δ):** 1. **More uncertain-choice data**: larger M (and perhaps R) eventually yields more precise (β, δ) estimates, though the rate of improvement is slower than for α. 2. **Risky-choice data** (model m_1, [Report 5](05_adding_risky_choices.qmd)): adds new identifying information by fixing $\boldsymbol{\psi}_r$ exogenously, breaking the β–δ multiplicative coupling for those observations. 3. **A more concentrated Dirichlet prior on δ**: imports prior information rather than data; cheap but assumption-laden, and only as defensible as the substantive grounds for the tighter prior. Not pursued empirically in this report series; flagged as future work. 4. **Hierarchical modeling across agents or cells** ([Reports 08–12](08_hierarchical_formulation.qmd)): borrows strength through shared or partially pooled structure when multiple related datasets are available. This can sharpen utility recovery under defensible cross-agent assumptions, but should be validated with prior predictive checks, recovery, and SBC. The choice among these routes — or their combination — depends on practical considerations like study feasibility, the relative importance of precise utility estimation for the research question, the availability of related agents or experimental cells, and the strength of substantive prior information available. ```{python} #| label: cleanup #| include: false # Clean up temporary directories import shutil for d in [output_dir, output_dir_m50, output_dir_m50_new_alts]: if os.path.exists(d): shutil.rmtree(d) ```

Parameter Recovery Analysis

0.1 Introduction

0.2 Study Design

0.3 Parameter Recovery Analysis

0.3.1 MCMC Diagnostics

0.4 Recovery Metrics

0.4.1 Sensitivity Parameter (\(\alpha\))

0.4.2 Feature-to-Probability Weights (\(\boldsymbol{\beta}\))

0.4.3 Utility Differences (\(\boldsymbol{\delta}\))

0.5 Summary Table

0.5.1 Recovery Conditional on True \(\alpha\)

0.6 Discussion

0.6.1 Differential Recovery Across Parameters

0.6.2 Does Increasing Sample Size Help?

0.6.3 Does Adding New Alternatives Help?

0.6.4 Interpretation: Sample Size and Alternative Effects

0.6.5 Study Design Parameters in m_0

0.6.6 Alternatives to More Uncertain-Choice Data

0.6.6.1 Route 1: Risky alternatives as a direct probe of utility

0.6.6.2 Route 2: A more concentrated Dirichlet prior on δ

0.6.6.3 Route 3: Hierarchical modeling across agents or cells

0.7 Conclusion

Reuse

Citation