---
title: "Hierarchical SBC Validation"
subtitle: "Foundational Report 12"
description: |
Simulation-Based Calibration of the h_m01 hierarchical model at the
alignment-study factorial scale (J=18, P=7). Rank-uniformity
diagnostics across all 29 model parameters confirm posterior
calibration under the tightened priors.
categories: [foundations, validation, h_m01]
execute:
cache: true
---
```{python}
#| label: setup
#| include: false
import sys
import os
import json
import tempfile
import warnings
warnings.filterwarnings('ignore')
sys.path.insert(0, os.path.join(os.getcwd(), '..'))
project_root = os.path.dirname(os.path.dirname(os.getcwd()))
sys.path.insert(0, project_root)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image
np.random.seed(42)
```
## Introduction
[Report 11](11_hierarchical_parameter_recovery.qmd) showed that over 20 simulate-and-fit iterations the hierarchical `h_m01` model recovers its true parameters with near-nominal coverage and well-behaved HMC geometry. But 20 iterations provides limited statistical power for detecting mis-calibration: a coverage of 0.85 vs. 0.90 is roughly within one Monte Carlo SE.
Following the approach established for the flat models in [Report 6](06_sbc_validation.qmd), we now subject `h_m01` to the more rigorous test of **Simulation-Based Calibration**. SBC operationalises the question "does the posterior correctly represent uncertainty?" as a rank-uniformity check across many simulated datasets.
::: {.callout-note}
## The SBC Principle
If the inference algorithm is correct and each parameter is identified, then the rank of the true value $\theta^*$ within the thinned posterior draws $\{\theta_1, \ldots, \theta_L\}$ is uniformly distributed on $\{0, 1, \ldots, L\}$ across repeated simulations.
For each of $N$ simulations we:
1. Draw $\theta^* \sim p(\theta)$ from the tightened prior.
2. Simulate data $y \sim p(y \mid \theta^*)$.
3. Fit `h_m01_sbc.stan` (which uses the same data block as `h_m01.stan` but draws the true parameters in `transformed data` and exposes the rank as a generated quantity).
4. Extract the rank of $\theta^*$ for every parameter.
Systematic departures from uniformity — detected by $\chi^2$ bin-counts, KS on the normalised ranks, and visual ECDF-vs.-diagonal plots — signal miscalibration.
:::
## SBC Configuration
SBC is run live inside the report; Quarto caches the result so the
analysis re-executes only when this report changes.
```{python}
#| label: run-sbc
#| output: false
from utils.study_design_hierarchical import HierarchicalStudyDesign
from analysis.hierarchical_sbc import HierarchicalSBC
study = HierarchicalStudyDesign.from_factorial(
factors=[6, 3], reference_indices=[0, 0],
K=3, D=2, R=10, M_per_cell=20,
min_alts_per_problem=2, max_alts_per_problem=4,
feature_dist="normal", feature_params={"loc": 0, "scale": 1},
design_name="h_m01_sbc",
)
study.generate()
output_dir = tempfile.mkdtemp(prefix="h_m01_sbc_")
sbc = HierarchicalSBC(
study_design=study,
output_dir=output_dir,
n_sbc_sims=100,
n_mcmc_samples=1000,
n_mcmc_chains=1,
thin=3,
)
sbc.run()
def _img(path, width=720):
return Image(filename=os.path.join(output_dir, path), width=width)
```
```{python}
#| label: config
#| echo: false
info = json.load(open(os.path.join(output_dir, "config_info.json")))
print("SBC configuration:")
print(f" n_sbc_sims = {info['n_sbc_sims']}")
print(f" n_mcmc_samples = {info['n_mcmc_samples']}")
print(f" n_mcmc_chains = {info['n_mcmc_chains']}")
print(f" thin = {info['thin']}")
print(f" effective draws = {info['effective_sample_size']}")
print(f"\nDesign: J={info['J']}, K={info['K']}, P={info['P']}")
print(f"Parameters tracked: {len(info['param_names'])}")
print(f" {info['param_names']}")
```
We use the same `6 × 3` alignment factorial ($J = 18$, $P = 7$) as [Reports 10](10_hierarchical_prior_analysis.qmd) and [11](11_hierarchical_parameter_recovery.qmd). A single chain per simulation avoids between-chain alignment issues; thinning by a factor of 3 reduces rank-statistic autocorrelation. With 1 000 pre-thinning draws the effective post-thin sample is 333, so ranks take values in $\{0, 1, \ldots, 333\}$.
::: {.callout-tip}
## Why thin?
HMC draws are serially correlated, so their empirical rank is not exactly
uniform even under a correctly specified model. Thinning by a factor of $t$
approximately reduces the autocorrelation budget used by the rank
statistic; under `h_m01` we found $t = 3$ sufficient for the chain
autocorrelation lengths implied by the tightened priors.
:::
## Diagnostics
For each of the 29 parameters we report the $\chi^2$ bin-count test (20 bins, 19 d.f.) and the KS test against the discrete uniform.
```{python}
#| label: load-summary
#| echo: false
s = json.load(open(os.path.join(output_dir, "sbc_results", "sbc_summary.json")))
rows = []
for name in info["param_names"]:
c = s["chi2_tests"][name]
k = s["ks_tests"][name]
rows.append([name, round(c["chi2_statistic"], 2), round(c["p_value"], 4),
round(k["ks_statistic"], 3), round(k["p_value"], 4)])
df = pd.DataFrame(rows, columns=["param", "chi2", "chi2_p", "ks", "ks_p"])
print(df.to_string(index=False))
```
### Multiplicity correction
With 29 parameters, the Bonferroni threshold for a family-wise error rate of 0.05 is $\alpha^* = 0.05 / 29 \approx 0.0017$. We expect under $H_0$ (perfect calibration):
- 29 × 0.05 ≈ 1.45 uncorrected failures at $\alpha = 0.05$;
- 0 Bonferroni-corrected failures.
```{python}
#| label: multiplicity
#| echo: false
p_thresh = 0.05 / 29
def _report(tests, name):
flagged_uncorr = [(n, round(d["p_value"], 4)) for n, d in tests.items()
if d["p_value"] < 0.05]
flagged_bonf = [(n, round(d["p_value"], 4)) for n, d in tests.items()
if d["p_value"] < p_thresh]
print(f"{name}:")
print(f" p < 0.05 (expected ~1.45): {flagged_uncorr}")
print(f" p < Bonferroni ({p_thresh:.4f}): {flagged_bonf}")
_report(s["chi2_tests"], "chi2")
print()
_report(s["ks_tests"], "KS")
c_ps = [d["p_value"] for d in s["chi2_tests"].values()]
k_ps = [d["p_value"] for d in s["ks_tests"].values()]
print("\nQuantiles of p-values (a flat uniform would centre on 0.5):")
print(f" chi2 [q10, q25, q50, q75, q90] = "
f"{np.round(np.quantile(c_ps, [.1,.25,.5,.75,.9]), 3)}")
print(f" KS [q10, q25, q50, q75, q90] = "
f"{np.round(np.quantile(k_ps, [.1,.25,.5,.75,.9]), 3)}")
```
**Result**: 0 of 29 parameters fail the KS test at $\alpha = 0.05$; 3 of 29 fail the $\chi^2$ test uncorrected (close to the 1.45 expected by chance); **none survive Bonferroni correction**. The quantile sweep of the 29 p-values traces the Uniform(0, 1) CDF closely (medians near 0.5, 10th percentiles near 0.08–0.13), consistent with a globally well-calibrated model.
## Visual Diagnostics
### Regression coefficients
```{python}
#| label: fig-regression-ranks
#| fig-cap: "Rank histograms for $(\\gamma_0, \\boldsymbol{\\gamma}, \\sigma_\\text{cell})$. Flat distributions indicate calibration."
_img("sbc_results/regression_ranks.png")
```
```{python}
#| label: fig-regression-ecdf
#| fig-cap: "ECDF of normalised ranks versus the Uniform(0, 1) diagonal for the regression parameters. The shaded band is the 95 % simultaneous KS confidence envelope; ECDFs that stay inside the band are consistent with calibration at the 5 % level."
_img("sbc_results/regression_ecdf.png")
```
The regression hyperparameters $(\gamma_0, \gamma_1, \ldots, \gamma_7, \sigma_{\text{cell}})$ all sit well inside the simultaneous band. The marginally noisy $\chi^2$ p for $\gamma_5$ (p = 0.04) corresponds to mild bin-count variation that the visual ECDF treats as ordinary sampling noise.
### Cell-level sensitivities
```{python}
#| label: fig-alpha-ranks
#| fig-cap: "Rank histograms for the 18 cell-level $\\alpha_j$."
_img("sbc_results/alpha_ranks.png")
```
```{python}
#| label: fig-alpha-ecdf
#| fig-cap: "ECDF plots for cell-level $\\alpha$. All 18 cells stay inside the 95 % simultaneous band."
_img("sbc_results/alpha_ecdf.png")
```
The two cell-level parameters with the smallest uncorrected $\chi^2$ p-values (α[2] at 0.016 and α[12] at 0.047) show minor bin-count imbalance but no visual pattern suggesting systematic bias, width-mis-specification, or shift. Under Bonferroni correction both are unflagged.
### Utility simplex
```{python}
#| label: fig-delta-ranks
#| fig-cap: "Rank histograms for the shared utility simplex $\\boldsymbol{\\delta}$."
_img("sbc_results/delta_ranks.png")
```
```{python}
#| label: fig-delta-ecdf
#| fig-cap: "ECDF plots for $\\boldsymbol{\\delta}$."
_img("sbc_results/delta_ecdf.png")
```
$\boldsymbol{\delta}$ is calibrated cleanly — a reassuring result given its central role in interpreting SEU-max rates.
## Interpretation
Taken together, the SBC diagnostics confirm that:
1. **The posterior is well-calibrated across all 29 parameters** of `h_m01` at the alignment-study factorial scale, *to the resolution that $N = 100$ SBC simulations affords*. With 29 parameters under Bonferroni correction this is at the lower end of conventionally adequate sample sizes [@talts2018]: the test detects gross miscalibration but has limited power against subtle deviations, and a future hierarchical SBC at larger $N$ would tighten the conclusion.
2. **The 3 of 29 uncorrected $\chi^2$ flags are consistent with multiple-testing noise**: Bonferroni yields zero rejections, the observed p-value distribution tracks Uniform(0, 1), and the ECDF plots remain inside the simultaneous envelope for every parameter.
3. **The hierarchical model is ready for use on real alignment data**, subject to the $N = 100$ power qualification above. Combined with the recovery analysis ([Report 11](11_hierarchical_parameter_recovery.qmd)) and the healthy HMC geometry under the tightened priors, there are no remaining calibration concerns at $J = 18$, $P = 7$.
::: {.callout-note}
## What would miscalibration look like?
- **Posterior too narrow** ⇒ rank histogram piles up at the extremes (U-shape); ECDF leaves the simultaneous band at both ends.
- **Posterior too wide** ⇒ rank histogram piles up in the middle (inverted-U).
- **Posterior shifted** ⇒ monotone trend from low to high bins; ECDF sits systematically above or below the diagonal.
None of these patterns appear in the `h_m01` diagnostics: all ECDFs oscillate around the diagonal and all rank histograms are visually flat.
:::
::: {.callout-note}
## Provenance and Reproducibility
This report runs `HierarchicalSBC` live on every build (Quarto caches the
result). The same analysis can be reproduced from the command line via
```bash
python scripts/run_hierarchical_sbc.py \
--config configs/h_m01_sbc_config.json
```
Configuration: 100 simulations × 1 chain × 1 000 draws, thin = 3 (333 effective draws), `adapt_delta=0.95`.
:::
```{python}
#| label: cleanup
#| include: false
import shutil
try:
shutil.rmtree(output_dir)
except Exception:
pass
```
## Summary
The hierarchical SBC completes the validation trilogy for `h_m01`:
| Validation step | Result | Report |
|---|---|---|
| Prior predictive is scientifically reasonable | α medians ≈ 12, overall P(SEU-max) ≈ 0.77 | [10](10_hierarchical_prior_analysis.qmd) |
| Parameters are identifiable and recovered | 0.01 % divergences, coverage near nominal | [11](11_hierarchical_parameter_recovery.qmd) |
| Posterior is calibrated | 0 / 29 Bonferroni failures (at $N = 100$ SBC sims) | 12 (this report) |
: Hierarchical validation summary at $J = 18$, $P = 7$. {#tbl-hm01-validation}
With all three checks passed, the hierarchical `h_m01` model can be fit to real alignment-study data with confidence that credible intervals, posterior probabilities, and contrasts between LLM / prompt effects will carry their nominal Bayesian meaning.