Simulation-Based Calibration

Foundational Report 6

foundations
validation
m_0
m_1

SBC validation of m_0 and m_1. Both models show marginal calibration in this run; the SBC qualifies, rather than independently confirms, the δ identification-improvement story from Reports 4 and 5.

Author
Published

May 14, 2026

0.1 Introduction

In Report 4, we observed poor recovery of the utility increment parameters δ in model m_0, and in Report 5, we hypothesized that adding risky choices would improve identification. The parameter recovery experiments in Report 5 suggested improvement, but with only 20 iterations, the results may not be statistically robust.

Simulation-Based Calibration (SBC) provides a more principled approach to assess parameter identification. Rather than asking “can we recover specific true values?” (as in parameter recovery), SBC asks: “does the posterior correctly represent uncertainty about the parameters?” That is, if we repeatedly (i) draw parameters from the prior, (ii) simulate data given those parameters, and (iii) compute the posterior, then the posterior should, on average, be calibrated—assigning the correct probability to regions of parameter space. SBC operationalizes this idea by checking whether the true parameter value occupies a uniform rank position within the posterior samples; systematic departures from uniformity signal that the posterior is too wide, too narrow, or shifted relative to the data-generating process.

NoteThe SBC Principle

If the inference algorithm is correct and the model is identified, then:

  1. Draw θ* from the prior: \(\theta^* \sim p(\theta)\)
  2. Simulate data: \(y \sim p(y | \theta^*)\)
  3. Compute posterior: \(p(\theta | y)\)
  4. Calculate the rank of θ* in the posterior samples

The distribution of thinned ranks should be uniform if the model is well-calibrated. (Thinning is necessary because HMC produces dependent posterior samples; by retaining only every \(t\)-th draw, we reduce autocorrelation and better approximate the independence assumption underlying the rank statistic.) Non-uniformity indicates problems with either the inference algorithm or parameter identification.

Following the Stan User’s Guide (§10.1), we set the number of post-thinned posterior draws \(L\) and the number of SBC iterations \(N\) to be equal, both 999, so that the rank statistic takes one of \(L + 1 = 1000\) possible values. We use a single MCMC chain to avoid any between-chain alignment issues and thin by a factor of 4 to reduce autocorrelation.

0.2 SBC Methodology

0.2.1 Rank Statistics

For each parameter θ, we compute the rank of the true value θ* within the posterior samples {θ₁, θ₂, …, θₛ}:

\[ \text{rank}(\theta^*) = \sum_{s=1}^{S} \mathbf{1}[\theta_s < \theta^*] \]

If the posterior is calibrated, this rank follows a discrete uniform distribution on {0, 1, …, S}.

0.2.2 Diagnostics

We use several complementary diagnostics to assess calibration. Each provides a different lens on the same underlying question—whether the posterior is well-calibrated—so that no single summary statistic drives our conclusions.

  1. Rank histograms: A visual check for uniformity. Each bin of the histogram counts how many SBC simulations produced a thinned rank in a given range. If the posterior is calibrated, all bins should contain roughly the same number of counts (i.e., the histogram should be approximately flat). Systematic patterns—such as a U-shape (underdispersed posteriors, i.e., posteriors too narrow) or an inverted-U (overdispersed posteriors, i.e., posteriors too wide)—indicate specific calibration failures.

  2. ECDF plots: The empirical cumulative distribution function of the normalized ranks is plotted against the CDF of a Uniform(0, 1) distribution (the diagonal line). Departures from the diagonal reveal the same calibration issues as the rank histogram, but can be easier to read when the number of SBC simulations is small. Each ECDF plot includes a 95% simultaneous confidence band derived from the Kolmogorov-Smirnov distribution: for \(n\) SBC simulations the half-width is \(\epsilon = \sqrt{\ln(2/0.05)/(2n)}\), so any ECDF that remains entirely within the band is consistent with uniformity at the 5% level.

  3. Chi-square goodness-of-fit tests: A formal test of whether the observed bin counts in the rank histogram are consistent with a uniform distribution. The test statistic \(\chi^2 = \sum_b (O_b - E_b)^2 / E_b\) is compared to a \(\chi^2\) distribution with \(B - 1\) degrees of freedom (where \(B\) is the number of bins). A small \(p\)-value (e.g., \(p < 0.05\)) suggests the ranks are not uniform. With 999 SBC simulations and 20 bins, the expected count per bin is approximately 50, which is well within the range where the chi-square approximation is reliable.

  4. Kolmogorov-Smirnov (KS) tests: A non-parametric test comparing the rank ECDF to the Uniform(0, 1) CDF. The KS statistic equals the maximum vertical distance between the two curves. Like the chi-square test, a small \(p\)-value suggests non-uniformity. The KS test does not require binning, but can behave conservatively with discrete rank data.

NoteInterpreting Rank Histogram Shapes

Different departures from uniformity signal different calibration failures:

  • Flat (uniform): The posterior is well-calibrated—it correctly represents uncertainty about the parameter.
  • U-shaped (excess counts in extreme bins): The posterior is underdispersed—it is too narrow relative to the true uncertainty, so true values fall in the tails too often.
  • Inverted-U (excess counts in central bins): The posterior is overdispersed—it is too wide and fails to concentrate around the true value. This is the characteristic signature of weak identification.
  • Left- or right-skewed: The posterior is systematically biased—shifted away from the true value in one direction.
NoteThinning and Autocorrelation

Thinning by a factor of \(t = 4\) is intended to produce approximately independent posterior draws. If the effective sample size (ESS) per unthinned draw is much less than \(1/t\), thinning may be insufficient and the rank statistic could be affected by residual autocorrelation. We verify thinning adequacy by examining ESS from representative fits in the MCMC Diagnostics section below.

0.2.3 Study Design

We use comparable but independently generated study designs for m_0 and m_1: the uncertain-choice dimensions match (same \(M\), \(K\), \(D\), \(R\), and alternative-count bounds), while m_1 adds a risky-choice block of the same nominal size (\(N = 25\), \(S = 15\)). The designs are not paired at the level of individual problems (unlike Report 5’s matched-pair recovery comparison).

Show code
# Study design configurations
config_m0 = {
    "M": 25,                    # Number of uncertain decision problems
    "K": 3,                     # Number of consequences
    "D": 5,                     # Feature dimensions
    "R": 15,                    # Distinct alternatives
    "min_alts_per_problem": 2,
    "max_alts_per_problem": 5,
    "feature_dist": "normal",
    "feature_params": {"loc": 0, "scale": 1}
}

config_m1 = {
    **config_m0,                # Same uncertain problem structure
    "N": 25,                    # Risky problems (matching M)
    "S": 15,                    # Risky alternatives (matching R)
}

print("Study Design Comparison:")
print(f"\nm_0 (Uncertain Only):")
print(f"  M = {config_m0['M']} decision problems")
print(f"  Total choices: ~{config_m0['M'] * 3.5:.0f}")

print(f"\nm_1 (Uncertain + Risky):")
print(f"  M = {config_m1['M']} uncertain + N = {config_m1['N']} risky")
print(f"  Total choices: ~{(config_m1['M'] + config_m1['N']) * 3.5:.0f}")
Study Design Comparison:

m_0 (Uncertain Only):
  M = 25 decision problems
  Total choices: ~88

m_1 (Uncertain + Risky):
  M = 25 uncertain + N = 25 risky
  Total choices: ~175
NoteIndependent Study Designs

Unlike the matched-pair comparison in Report 5, which uses from_base_study() to ensure identical uncertain problems across models, SBC here generates one study design per model and reuses it across all SBC iterations for that model. Each SBC iteration draws a fresh parameter vector from the prior and simulates new choices conditional on the fixed design. SBC’s calibration test therefore averages over the prior and data-generating process for that fixed design, rather than over fresh designs at each iteration. The cross-model comparison reflects the models’ calibration under structurally comparable—but not paired—designs.

Show code
# SBC configuration following Stan User's Guide §10.1 recommendations.
# With L post-thinned draws per simulation, the rank statistic takes values
# in {0, 1, ..., L}, giving L + 1 possible values.  Setting L = N_sbc = 999
# means the expected rank histogram is discrete-uniform on 1000 values;
# with 20 bins we get ~50 counts per bin—well within the chi-square
# approximation's comfort zone—and the KS confidence band shrinks to
# ε ≈ 0.043, roughly 3× narrower than the previous n = 100 design.

n_sbc_sims = 999            # Number of SBC iterations (N)
n_mcmc_chains = 1           # Single chain (standard for SBC; see note below)
thin = 4                    # Thinning factor
n_mcmc_samples = thin * n_sbc_sims   # 3996 total draws → 999 post-thinned
max_rank = n_mcmc_samples // thin     # 999 (= L)
n_bins = 20                 # Histogram bins
expected_per_bin = n_sbc_sims / n_bins  # ~49.95

print("SBC Configuration:")
print(f"  Simulations (N):        {n_sbc_sims}")
print(f"  MCMC samples per sim:   {n_mcmc_samples}")
print(f"  Chains:                 {n_mcmc_chains}")
print(f"  Thinning factor:        {thin}")
print(f"  Post-thinned draws (L): {max_rank}")
print(f"  Possible rank values:   {max_rank + 1}")
print(f"  Histogram bins:         {n_bins}")
print(f"  Expected counts/bin:    {expected_per_bin:.1f}")
SBC Configuration:
  Simulations (N):        999
  MCMC samples per sim:   3996
  Chains:                 1
  Thinning factor:        4
  Post-thinned draws (L): 999
  Possible rank values:   1000
  Histogram bins:         20
  Expected counts/bin:    50.0

0.3 Running SBC for m_0

First, we run SBC for model m_0 (uncertain choices only) to establish a baseline:

Show code
from utils.study_design import StudyDesign
from analysis.sbc import SimulationBasedCalibration

# Create m_0 study design
study_m0 = StudyDesign(
    M=config_m0["M"],
    K=config_m0["K"],
    D=config_m0["D"],
    R=config_m0["R"],
    min_alts_per_problem=config_m0["min_alts_per_problem"],
    max_alts_per_problem=config_m0["max_alts_per_problem"],
    feature_dist=config_m0["feature_dist"],
    feature_params=config_m0["feature_params"],
    design_name="sbc_m0"
)
study_m0.generate()
Show code
# Create output directory
output_dir_m0 = tempfile.mkdtemp(prefix="sbc_m0_")

# Run SBC for m_0
sbc_m0 = SimulationBasedCalibration(
    sbc_model_path=os.path.join(project_root, "models", "m_0_sbc.stan"),
    study_design=study_m0,
    output_dir=output_dir_m0,
    n_sbc_sims=n_sbc_sims,
    n_mcmc_samples=n_mcmc_samples,
    n_mcmc_chains=n_mcmc_chains,
    thin=thin
)

ranks_m0, true_params_m0 = sbc_m0.run()
m_0 SBC Complete:
  Simulations: 999
  Parameters: 18

0.4 Running SBC for m_1

Now we run SBC for model m_1 (uncertain + risky choices):

Show code
from utils.study_design_m1 import StudyDesignM1

# Create m_1 study design
study_m1 = StudyDesignM1(
    M=config_m1["M"],
    N=config_m1["N"],
    K=config_m1["K"],
    D=config_m1["D"],
    R=config_m1["R"],
    S=config_m1["S"],
    min_alts_per_problem=config_m1["min_alts_per_problem"],
    max_alts_per_problem=config_m1["max_alts_per_problem"],
    risky_probs="random",
    feature_dist=config_m1["feature_dist"],
    feature_params=config_m1["feature_params"],
    design_name="sbc_m1"
)
study_m1.generate()
Show code
# Create output directory  
output_dir_m1 = tempfile.mkdtemp(prefix="sbc_m1_")

# Run SBC for m_1
sbc_m1 = SimulationBasedCalibration(
    sbc_model_path=os.path.join(project_root, "models", "m_1_sbc.stan"),
    study_design=study_m1,
    output_dir=output_dir_m1,
    n_sbc_sims=n_sbc_sims,
    n_mcmc_samples=n_mcmc_samples,
    n_mcmc_chains=n_mcmc_chains,
    thin=thin
)

ranks_m1, true_params_m1 = sbc_m1.run()
m_1 SBC Complete:
  Simulations: 999
  Parameters: 18

0.5 MCMC Diagnostics

SBC relies on each individual MCMC fit producing valid posterior samples. We follow the standard SBC convention of running a single chain per simulation (Talts et al. 2018): SBC’s calibration test depends on each fit returning approximately exact draws from the posterior, and using one chain (rather than splitting a fixed compute budget across multiple chains) maximises the post-thinned sample size \(L\) per simulation, which controls the rank-statistic resolution and the chi-square / KS power. The trade-off is that single-chain runs preclude the cross-chain \(\hat{R}\) diagnostic; we therefore rely on effective sample size (ESS) for single-chain mixing checks. To keep storage manageable, the SBC infrastructure saves full posterior summaries only for the first 10 iterations of each model. The ESS values reported below are therefore a representative sampler-health check on those early fits, not an exhaustive sweep across all 999 simulations.

MCMC Diagnostics (from first 10 SBC iterations)
=======================================================

m_0:
  Minimum ESS across all params/sims:  1328
  Median ESS (across sims):            3522
  Post-thinning draws:                 999
  ESS/draw ratio:                      0.88

m_1:
  Minimum ESS across all params/sims:  2142
  Median ESS (across sims):            4475
  Post-thinning draws:                 999
  ESS/draw ratio:                      1.12

✓ ESS is adequate: thinning by 4 yields approximately independent draws for all parameters.
NoteLimitations of Single-Chain Diagnostics

With a single MCMC chain per SBC iteration, \(\hat{R}\) (which requires multiple chains) is unavailable. We rely instead on ESS to assess mixing. Divergent transition counts are monitored by CmdStan during fitting; if any iterations had problematic divergences, this would typically manifest as poor ESS or anomalous rank distributions. A more comprehensive diagnostic framework would log per-iteration divergence counts and energy Bayesian fraction of missing information (E-BFMI), which we recommend for future SBC analyses with more complex models.

NoteSimulated Data Quality

Each SBC iteration draws parameters from the prior and simulates a complete dataset. The first 10 simulated datasets and their generating parameters are saved for inspection. The prior specification from Report 3 was checked there for prior-predictive plausibility; we do not, in this report, produce a programmatic degeneracy-check table across the saved SBC simulations. Future versions should add a compact summary (e.g., minimum and maximum across-alternative choice-probability spread per saved simulation) so that simulated-data quality can be audited reproducibly from the rendered report.

0.6 Comparing Rank Distributions

The key diagnostic is comparing rank distributions between m_0 and m_1. For well-calibrated parameters, ranks should be uniformly distributed.

0.6.1 α (Sensitivity Parameter)

Show code
K, D = config_m0['K'], config_m0['D']

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# m_0 alpha ranks (index 0)
ax = axes[0]
counts_m0, bins_m0, _ = ax.hist(ranks_m0[:, 0], bins=n_bins, alpha=0.7, 
                                 color='steelblue', edgecolor='white')
# Expected count per bin for uniform distribution
expected_count = len(ranks_m0) / n_bins
ax.axhline(y=expected_count, color='red', linestyle='--', linewidth=2, 
           label=f'Uniform (E={expected_count:.1f})')
ax.set_xlabel('Rank', fontsize=11)
ax.set_ylabel('Count', fontsize=11)
ax.set_title('m_0: α Rank Distribution', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)

# m_1 alpha ranks (index 0)
ax = axes[1]
counts_m1, bins_m1, _ = ax.hist(ranks_m1[:, 0], bins=n_bins, alpha=0.7, 
                                 color='forestgreen', edgecolor='white')
ax.axhline(y=expected_count, color='red', linestyle='--', linewidth=2, 
           label=f'Uniform (E={expected_count:.1f})')
ax.set_xlabel('Rank', fontsize=11)
ax.set_ylabel('Count', fontsize=11)
ax.set_title('m_1: α Rank Distribution', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Chi-square tests
chi2_m0_alpha, p_m0_alpha = stats.chisquare(np.histogram(ranks_m0[:, 0], bins=n_bins)[0])
chi2_m1_alpha, p_m1_alpha = stats.chisquare(np.histogram(ranks_m1[:, 0], bins=n_bins)[0])
print(f"\nα Uniformity Tests:")
print(f"  m_0: χ² = {chi2_m0_alpha:.2f}, p = {p_m0_alpha:.3f}")
print(f"  m_1: χ² = {chi2_m1_alpha:.2f}, p = {p_m1_alpha:.3f}")
Figure 1: SBC rank distributions for α. With 999 SBC simulations the chi-square test is well-powered. See the printed test statistics below for the exact values in this run.

α Uniformity Tests:
  m_0: χ² = 19.76, p = 0.409
  m_1: χ² = 26.17, p = 0.126
NoteInterpreting the α results

Both models identify α well, since sensitivity influences all choice probabilities. With 999 SBC simulations the chi-square test has adequate power to detect moderate departures from uniformity, so the p-values can be taken at face value: a rejection at the 0.05 level is meaningful evidence of miscalibration. If m_0 shows a notably lower p-value for α than m_1, that could indicate a mild calibration issue stemming from the β–δ identification problem propagating to α estimates; the parameter recovery results in Report 4 are relevant context here.

This is the critical comparison. In m_0, the β–δ interaction makes δ difficult to identify from uncertain choices alone (Report 4); in m_1, risky choices supply direct information about utilities that is expected to improve δ identification. Whether that structural advantage produces a detectable difference in marginal SBC ranks is the empirical question the histograms and tests below address:

Show code
# Delta parameters are after alpha (1) and beta (K*D)
delta_start_idx = 1 + K * D
K_minus_1 = K - 1

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Expected count per bin for uniform distribution
expected_count = len(ranks_m0) / n_bins

for k in range(K_minus_1):
    param_idx = delta_start_idx + k
    
    # m_0
    ax = axes[0, k]
    counts_m0_delta, _, _ = ax.hist(ranks_m0[:, param_idx], bins=n_bins, alpha=0.7, 
                                     color='steelblue', edgecolor='white')
    ax.axhline(y=expected_count, color='red', linestyle='--', linewidth=2, 
               label=f'Uniform (E={expected_count:.1f})')
    ax.set_xlabel('Rank', fontsize=11)
    ax.set_ylabel('Count', fontsize=11)
    ax.set_title(f'm_0: δ_{k+1} Rank Distribution', fontsize=12)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # m_1
    ax = axes[1, k]
    counts_m1_delta, _, _ = ax.hist(ranks_m1[:, param_idx], bins=n_bins, alpha=0.7,
                                     color='forestgreen', edgecolor='white')
    ax.axhline(y=expected_count, color='red', linestyle='--', linewidth=2, 
               label=f'Uniform (E={expected_count:.1f})')
    ax.set_xlabel('Rank', fontsize=11)
    ax.set_ylabel('Count', fontsize=11)
    ax.set_title(f'm_1: δ_{k+1} Rank Distribution', fontsize=12)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
Figure 2: SBC rank distributions for δ parameters. Compare the shape of the rank histograms across models: departures from uniformity indicate poor calibration. See the formal test results below for quantitative assessment.
δ Parameter Uniformity Tests:
--------------------------------------------------

δ_1:
  m_0: χ² = 11.63, p = 0.901 | KS = 0.023, p = 0.656
  m_1: χ² = 16.56, p = 0.620 | KS = 0.028, p = 0.405

δ_2:
  m_0: χ² = 11.63, p = 0.901 | KS = 0.023, p = 0.656
  m_1: χ² = 17.92, p = 0.528 | KS = 0.028, p = 0.405
NoteInterpreting the δ results

Calibration and identification are related but distinct concepts. Calibration refers to whether the posterior accurately represents uncertainty (uniform SBC ranks). Identification refers to whether the data are informative enough to pin down the parameter. Poor identification leads to poor calibration in SBC, because when the data are uninformative the posterior fails to update appropriately from the prior, which manifests as non-uniform ranks (often a central peak, indicating the posterior is too diffuse).

When reading the test statistics, note the following caveats:

  • Chi-square p-values depend on the particular binning. With 999 simulations and 20 bins, expected counts per bin are approximately 50, so the chi-square approximation is reliable and the test has good power to detect moderate departures from uniformity.
  • KS test values should in principle differ between models. If they appear identical or very similar, check whether the rank distributions actually differ visually in the histograms and ECDF plots. The KS test can behave conservatively with discrete rank data (since it was designed for continuous distributions), which may reduce its sensitivity to real differences between models.
  • The rank histograms and ECDF plots are the primary diagnostics; the formal tests supplement but do not replace visual inspection.

0.6.2 ECDF Comparison

The Empirical Cumulative Distribution Function (ECDF) provides another view of calibration. For well-calibrated parameters, the ECDF should follow the diagonal:

Show code
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for k in range(K_minus_1):
    param_idx = delta_start_idx + k
    
    # Normalize ranks to [0, 1] using max_rank (effective samples after thinning)
    ranks_norm_m0 = np.sort(ranks_m0[:, param_idx]) / max_rank
    ranks_norm_m1 = np.sort(ranks_m1[:, param_idx]) / max_rank
    
    ecdf = np.arange(1, len(ranks_norm_m0) + 1) / len(ranks_norm_m0)
    
    ax = axes[k]
    ax.step(ranks_norm_m0, ecdf, where='post', label='m_0', color='steelblue', linewidth=2)
    ax.step(ranks_norm_m1, ecdf, where='post', label='m_1', color='forestgreen', linewidth=2)
    ax.plot([0, 1], [0, 1], 'r--', linewidth=2, label='Uniform')
    
    # Add 95% confidence band
    n = len(ecdf)
    epsilon = np.sqrt(np.log(2/0.05) / (2 * n))  # Kolmogorov-Smirnov band
    ax.fill_between([0, 1], [0-epsilon, 1-epsilon], [0+epsilon, 1+epsilon], 
                    alpha=0.2, color='red', label='95% CI')
    
    ax.set_xlabel('Normalized Rank', fontsize=11)
    ax.set_ylabel('ECDF', fontsize=11)
    ax.set_title(f'δ_{k+1} ECDF Comparison', fontsize=12)
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)

plt.tight_layout()
plt.show()
Figure 3: ECDF plots for δ parameters. The closer the curve to the diagonal, the better the calibration. The shaded band shows the 95% Kolmogorov-Smirnov confidence region for n=999 simulations. Compare the m_0 and m_1 curves to assess whether adding risky choices improves δ calibration.

The 95% confidence band width is \(\epsilon \approx 0.043\) for \(n=999\) simulations, roughly three times narrower than the \(\epsilon \approx 0.136\) that would obtain with only 100 iterations. This gives us sensitivity to detect relatively small departures from uniformity.

0.6.3 β (Feature Weight Parameters)

Finally, we examine β calibration. Because β and δ interact multiplicatively in the expected utility (see the identification analysis in Report 5), poor identification of δ in m_0 can propagate to β, potentially degrading its calibration as well. In m_1, if risky choices successfully pin down δ, the β parameters may also benefit from improved calibration.

Show code
# Collect beta chi-square p-values for all K*D parameters
beta_start_idx = 1
beta_end_idx = 1 + K * D

beta_pvals_m0 = []
beta_pvals_m1 = []

for idx in range(beta_start_idx, beta_end_idx):
    counts_m0 = np.histogram(ranks_m0[:, idx], bins=n_bins)[0]
    counts_m1 = np.histogram(ranks_m1[:, idx], bins=n_bins)[0]
    
    _, p_m0 = stats.chisquare(counts_m0)
    _, p_m1 = stats.chisquare(counts_m1)
    
    beta_pvals_m0.append(p_m0)
    beta_pvals_m1.append(p_m1)

fig, ax = plt.subplots(figsize=(10, 5))

x = np.arange(K * D)
width = 0.35

bars1 = ax.bar(x - width/2, beta_pvals_m0, width, label='m_0', color='steelblue', alpha=0.7)
bars2 = ax.bar(x + width/2, beta_pvals_m1, width, label='m_1', color='forestgreen', alpha=0.7)

ax.axhline(y=0.05, color='red', linestyle='--', linewidth=2, label='α = 0.05')
ax.set_xlabel('β Parameter Index', fontsize=11)
ax.set_ylabel('Chi-square p-value', fontsize=11)
ax.set_title('β Parameter Calibration (p-values)', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"\nβ Calibration Summary:")
print(f"  m_0: {np.sum(np.array(beta_pvals_m0) > 0.05)}/{K*D} parameters well-calibrated (p > 0.05)")
print(f"  m_1: {np.sum(np.array(beta_pvals_m1) > 0.05)}/{K*D} parameters well-calibrated (p > 0.05)")
Figure 4: Summary of β parameter SBC calibration. Because β and δ interact in the expected utility, the β–δ identification problem in m_0 may affect β calibration as well. Compare p-values across models to assess whether m_1’s improved δ identification also benefits β.

β Calibration Summary:
  m_0: 15/15 parameters well-calibrated (p > 0.05)
  m_1: 14/15 parameters well-calibrated (p > 0.05)
NoteAggregated Presentation of β Results

With \(K \times D = 15\) feature-weight parameters, displaying individual rank histograms for each β would be impractical. The bar chart above summarizes calibration via per-parameter chi-square p-values, providing a compact overview. Individual rank histograms can be generated from the saved SBC output if detailed inspection of any particular β parameter is needed.

0.7 Summary Statistics

Show code
# Build summary table
summary_rows = []

# Alpha
counts_m0 = np.histogram(ranks_m0[:, 0], bins=n_bins)[0]
counts_m1 = np.histogram(ranks_m1[:, 0], bins=n_bins)[0]
chi2_m0, p_m0 = stats.chisquare(counts_m0)
chi2_m1, p_m1 = stats.chisquare(counts_m1)

summary_rows.append({
    'Parameter': 'α',
    'm_0 χ²': f'{chi2_m0:.2f}',
    'm_0 p-value': f'{p_m0:.3f}',
    'm_1 χ²': f'{chi2_m1:.2f}',
    'm_1 p-value': f'{p_m1:.3f}',
    'Improvement': '—' if p_m0 > 0.05 else ('✓' if p_m1 > 0.05 else '—')
})

# Beta (aggregated)
mean_p_m0 = np.mean(beta_pvals_m0)
mean_p_m1 = np.mean(beta_pvals_m1)
summary_rows.append({
    'Parameter': 'β (mean)',
    'm_0 χ²': '—',
    'm_0 p-value': f'{mean_p_m0:.3f}',
    'm_1 χ²': '—',
    'm_1 p-value': f'{mean_p_m1:.3f}',
    'Improvement': '—'
})

# Delta
for k in range(K_minus_1):
    result = delta_chi2_results[k]
    improvement = '✓' if (result['m0_p'] < 0.05 and result['m1_p'] > 0.05) else \
                  ('↑' if result['m1_p'] > result['m0_p'] else '—')
    summary_rows.append({
        'Parameter': f'δ_{k+1}',
        'm_0 χ²': f'{result["m0_chi2"]:.2f}',
        'm_0 p-value': f'{result["m0_p"]:.3f}',
        'm_1 χ²': f'{result["m1_chi2"]:.2f}',
        'm_1 p-value': f'{result["m1_p"]:.3f}',
        'Improvement': improvement
    })

summary_df = pd.DataFrame(summary_rows)
print(summary_df.to_string(index=False))
Table 1: SBC calibration comparison between m_0 and m_1. Higher p-values suggest better calibration (uniformity of rank distribution). With 999 simulations and ~50 expected counts per bin, the chi-square tests are well-powered.
Parameter m_0 χ² m_0 p-value m_1 χ² m_1 p-value Improvement
        α  19.76       0.409  26.17       0.126           —
 β (mean)      —       0.540      —       0.560           —
      δ_1  11.63       0.901  16.56       0.620           —
      δ_2  11.63       0.901  17.92       0.528           —
Multiple Testing Considerations
==================================================
  Parameters tested per model:         18
  Expected false rejections at α=0.05:  0.9
  m_0 rejections (p < 0.05):            0/18
  m_1 rejections (p < 0.05):            1/18

  m_0: Number of rejections is within the range expected
       by chance under the null (well-calibrated).
  m_1: Number of rejections is within the range expected
       by chance under the null (well-calibrated).
NoteMultiple Testing

With \(1 + K \times D + (K-1) = 18\) parameters tested per model (36 tests total across m_0 and m_1), the expected number of false rejections at the 5% level is approximately 0.9 per model. In this run the observed rejection counts are within that chance baseline for both models (see the printed counts above), so the marginal SBC tests do not indicate systematic miscalibration. Cross-model comparisons of systematic patterns (e.g., a parameter family failing in one model but passing in the other) would be more diagnostic than any individual test, but no such pattern is present here.

0.8 Discussion

0.8.1 Interpretation of Results

The rank histograms, ECDF plots, and formal test statistics above directly assess posterior calibration for each parameter in m_0 and m_1. We now summarize and interpret the observed patterns.

TipKey Finding: Marginal Calibration in Both Models; Identification Story Qualified

In this 999-iteration SBC run, the marginal rank distributions are consistent with calibration for all assessed parameters in both models:

  • α is well-calibrated in both models. This is expected: sensitivity influences all choice probabilities and is therefore identifiable in both model structures.

  • δ is also marginally calibrated in both models: the chi-square and KS tests do not reject uniformity for either δ component in m_0 or m_1 (see the test statistics printed above). This run therefore does not independently reproduce the m_0 δ identification failure suggested by the parameter recovery analysis in Report 4. It does not contradict that analysis either: marginal SBC and parameter recovery answer different questions (see below).

  • β calibration is broadly similar across models, with most parameters passing the marginal chi-square test in both. The summary above gives the fraction passing; see the multiple-testing note for interpreting individual rejections.

The theoretical motivation for risky choices from the Anscombe-Aumann analysis in Report 5 and the recovery-based evidence in Reports 4 and 5 both stand on their own; this SBC run qualifies rather than independently confirms the identification-improvement story. See the discussion below for why marginal SBC may not surface the β–δ confound even when it is present.

NoteDistinguishing Identification from Inference Problems

SBC can in principle detect both inference failures (bugs or pathologies in the sampler) and identification problems (structural non-identifiability). One way to distinguish them is to combine SBC with MCMC diagnostics: if rank uniformity is rejected while ESS, divergences, and other sampler diagnostics look healthy, the issue is more likely identification than inference. In this run, the MCMC Diagnostics section shows adequate ESS in the first 10 saved fits for both models, and the marginal SBC tests do not reject uniformity—so we have no positive evidence here of either an inference or a marginal identification problem.

0.8.2 How SBC Complements Parameter Recovery

SBC and parameter recovery assess different aspects of model performance. Parameter recovery (Report 4) asks: “given specific true values, can the posterior concentrate around them?” SBC asks: “across the entire prior, does the posterior correctly represent uncertainty?” These are related but distinct questions:

  • Parameter recovery can succeed (good point estimates) even when the posterior is miscalibrated (e.g., credible intervals that are too narrow).
  • SBC can pass (uniform ranks) even when individual posteriors are wide and point estimates are poor—as long as the width correctly reflects the available information.
  • Marginal SBC can pass even when the joint posterior is poorly identified. A β–δ confound that manifests as a tilted joint posterior (e.g., a ridge along which β and δ trade off) can leave each marginal rank distribution close to uniform while the joint posterior remains poorly identified.

The present results illustrate the third point. Report 4’s parameter-recovery analysis flagged weak δ identification in m_0 (wide intervals, modest coverage), but the marginal SBC tests for δ here are not rejected. This pattern is consistent with a confound that distorts the joint (β, δ) posterior more than its δ marginal—which is precisely what a multiplicative β⋅δ interaction would predict. A more targeted check would be a joint rank statistic (e.g., a projection-based or 2-D rank for (β, δ)); we list this under Future Work below.

0.8.3 Why Non-Uniform Ranks Would Indicate Calibration Failure

When a parameter is poorly identified, the posterior typically fails to concentrate around the true value. In SBC, this would manifest as follows:

  1. The posterior is too wide relative to the prior (weak updating from data)
  2. True values tend to have central ranks (ranks cluster around the median)
  3. The rank histogram shows a characteristic peak in the middle (the inverted-U pattern)

This pattern is the diagnostic signature the report was designed to detect. In the present 999-iteration run we do not observe it for δ in either model—consistent with marginal calibration, but, as noted above, not in itself ruling out a joint-distribution identification problem.

0.8.4 Why m_1 Is Expected to Achieve Better Identification

The theoretical case for m_1 does not depend on this SBC run. In m_1, risky choices provide direct information about utilities without confounding with subjective probabilities:

\[ \text{Risky EU: } \eta^{(r)}_s = \sum_{k=1}^K \pi_{sk} \cdot \upsilon_k \]

where π are the known objective probabilities. Structurally, this should help identification because:

  1. Risky choices constrain υ (and hence δ) directly, breaking the β–δ confound
  2. With δ better identified, uncertain choices more effectively constrain β
  3. Both choice types constrain α

Whether this structural advantage shows up in marginal SBC depends on how the confound manifests in the joint posterior; in the present run it does not produce a detectable marginal difference between m_0 and m_1.

0.9 Conclusion

Simulation-Based Calibration provides a principled framework for assessing posterior calibration that complements the parameter recovery analysis of Report 4. The SBC results in this 999-iteration run support three findings:

  1. Marginal calibration in both models. Rank distributions for α, β, and δ are consistent with uniformity in both m_0 and m_1 under the marginal chi-square and KS tests. The overall rejection counts are within the chance baseline for both models.

  2. The marginal SBC run does not independently reproduce m_0’s δ identification failure. The weak δ recovery documented in Report 4 and the theoretical motivation for risky choices in Report 5 both remain on the table, but this SBC analysis qualifies rather than confirms that story. A plausible reading is that a multiplicative β⋅δ confound distorts the joint (β, δ) posterior more than its δ marginal—so marginal SBC is not the right diagnostic for this particular identification issue.

  3. The structural argument for m_1 stands on its own. The Anscombe-Aumann decoupling in Report 5 gives a model-structural reason to expect better δ identification in m_1. The SBC results here neither confirm nor contradict that expectation at the level of marginal calibration.

With 999 SBC iterations, the marginal tests are well-powered: with \(\epsilon \approx 0.043\) and ≈50 expected counts per bin, the design has sensitivity to moderate marginal non-uniformity. The non-significant results should therefore be read as evidence against a detectable marginal calibration problem, not as a guarantee that the joint posterior is well-identified.

NoteOn Statistical Power

With 999 SBC simulations and 20 histogram bins, the expected count per bin is approximately 50, well within the range where the chi-square approximation is reliable. The 95% KS confidence band narrows to \(\epsilon \approx 0.043\), roughly three times narrower than the \(\epsilon \approx 0.136\) that would obtain with only 100 iterations. The KS test can still behave conservatively with discrete rank data, so visual inspection of rank histograms and ECDF plots remains a useful complement to the formal tests.

NoteLimitations and Future Work

Several limitations of this analysis suggest concrete next steps:

  • Marginal-only calibration. This analysis validates marginal calibration for each parameter individually. Joint calibration—checking that parameter vectors (e.g., (β, δ) together) have jointly uniform ranks—is a more demanding criterion that we do not assess here. A joint or projection-based SBC (e.g., a 2-D rank for (β, δ), or rank statistics along directions of the prior-induced correlation) would be a more targeted test of the suspected β–δ confound and is the most natural follow-up.
  • MCMC diagnostics from saved fits only. Full posterior summaries are saved only for the first 10 SBC iterations per model. Future SBC runs should log per-iteration divergence counts, E-BFMI, and ESS across all 999 fits so that sampler health can be audited globally rather than from a representative sample.
  • Single, fixed study design per model. The SBC design is generated once and reused across iterations. A SBC variant that also randomises the study design at each iteration would assess calibration that averages over designs as well as parameters; a design swept along axes known to stress the β–δ confound (e.g., smaller \(M\), fewer alternatives per problem, or priors closer to Report 4’s recovery setup) would provide a complementary stress test.
  • No reproducible simulated-data audit table. As noted earlier, a compact degeneracy-check summary over the saved simulations would let readers reproduce the simulated-data quality check from the rendered report.

0.10 References

Talts, Sean, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2018. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” arXiv Preprint arXiv:1804.06788.

Reuse

Citation

BibTeX citation:
@online{helzner2026,
  author = {Helzner, Jeff},
  title = {Simulation-Based {Calibration}},
  date = {2026-05-14},
  url = {https://jeffhelzner.github.io/seu-sensitivity/foundations/06_sbc_validation.html},
  langid = {en}
}
For attribution, please cite this work as:
Helzner, Jeff. 2026. “Simulation-Based Calibration.” SEU Sensitivity Project, May 14. https://jeffhelzner.github.io/seu-sensitivity/foundations/06_sbc_validation.html.