Probing the SEU Model: A Bayesian Workflow Tour

Probing the m_0 SEU sensitivity model with the modern Bayesian workflow: prior predictive checks, parameter recovery, simulation-based calibration, and posterior predictive checks.
Author

Jeff Helzner

Published

May 16, 2026

Foundations of SEU Sensitivity · Part 3 of 3
Part 1 · Part 2 · Part 3

Part 1 introduced the abstract sensitivity idea. Part 2 turned it into a Stan program by committing to a feature-to-probability map, an ordered-utility parameterization, and a set of priors. With that implementation in hand, the natural next question is whether the resulting posterior is something we should trust.

A posterior is just an answer the model gives to the data. It is only as good as the model that produced it. Asking “does this model behave the way a useful model should behave?” is the job of the modern Bayesian workflow, and the foundational reports run four checks on m_0 that together address that question:

  1. Prior predictive checks — does the model produce plausible decision behavior before seeing any data?
  2. Parameter recovery — given data the model itself generates, can the model recover the parameters that generated it?
  3. Simulation-based calibration (SBC) — do the model’s posteriors have the right calibration, not just the right point estimates?
  4. Posterior predictive checks (PPC) — does the fitted model reproduce features of the actual observed data?

Three of the four checks — prior predictive, recovery, and SBC — are pre-data diagnostics: they probe what the model is willing to say in advance and how it behaves on data the model itself generates. Only PPC compares the fitted model to the actual choices being analyzed. Each check interrogates a different aspect of the model, and failure on any of them changes what an eventual alpha estimate is allowed to mean.

Prior predictive: is the model plausible before data?

Before fitting any choices, we can ask what kind of decision behavior the model expects to see. Sampling parameters from the priors and simulating choices yields a prior predictive distribution over observable outcomes — the rate at which SEU-maximizing alternatives get chosen, the spread of expected utilities across problems, the apparent decisiveness of the simulated chooser.

The foundational prior-predictive report runs this check for m_0 on a small but realistic design: M = 25 problems with K = 3 consequences, D = 5 features, R = 15 distinct alternatives, and 2–5 alternatives per problem. Each prior draw produces a complete synthetic data set.

The headline summary is that under the m_0 priors, the rate at which the SEU-maximizing alternative is chosen ranges from about 12% to 92% across simulations, with a mean near 43%. Compare that to the random-choice baseline of 1/3 ≈ 33% for problems with three alternatives on average.

Histogram of SEU-maximizer selection rates across prior simulations under the m_0 priors, spanning roughly 12% to 92%.
Figure 1: Distribution of SEU-maximizer selection rates across prior draws. Each prior draw produces a complete synthetic data set; the histogram shows the fraction of problems in which the SEU-maximizing alternative is chosen. The vertical line marks the random-choice baseline.

What this range tells us is something about the prior, not about any real decision maker. It says the priors do not commit in advance to any particular regime: they allow draws where the simulated chooser looks nearly random (12%) and draws where the simulated chooser is highly sensitive (92%). The Lognormal(0, 1) prior on alpha, paired with the weakly informative priors on beta and delta, expresses genuine uncertainty across the regimes named in Part 1.

A few more things are worth noting. The mean (43%) sits modestly above the baseline rather than far above or below it. The simulations show no pathologies: no degenerate distributions over choices, no impossible utility orderings, no numerical breakdowns. And the breadth of the range is itself a diagnostic. A prior that produced an SEU-max rate concentrated in a narrow window — say, 35–45% — would be quietly assuming the decision maker is near-random; a prior concentrated in 85–95% would be quietly assuming high sensitivity. The default priors do neither.

A more substantive prior predictive check — and one worth flagging — would compare these prior predictions to any data we already have. If real choices in our domain regularly produce SEU-rates above the prior’s 92nd percentile, that is prior–data conflict, and it warns us that the default priors are mis-specified for the application. The Lognormal(0, 1) on alpha is the place where this most often shows up. In the initial temperature study, for example, observed choice data sit far above the regime the default prior emphasizes, which is exactly why that study replaces the default with Lognormal(3.0, 0.75).

Parameter recovery: can the model find what it put in?

The next check is a kind of self-test. If we generate synthetic data from the model itself — sampling parameters from the priors, then sampling choices from the resulting likelihood — can the model recover the parameters that generated it?

The foundational recovery report runs this exercise systematically on m_0. The results are uneven across parameters, in a way that turns out to be informative.

alpha recovers well. Across 50 recovery iterations on the small-M design above, the posterior for alpha covered the true generating value 90% of the time, with a root-mean-square error of 1.17 on the Lognormal(0,1) range and a mean 90% credible-interval width of 3.25 (Report 4 §Sensitivity Parameter). Coverage is at nominal, and the posterior mean tracks the truth closely; the credible intervals are wide but honest.

Two-panel figure showing recovery of the sensitivity parameter alpha: true-versus-estimated scatter on the left and 90% credible intervals on the right.
Figure 2: Recovery of the sensitivity parameter α across 50 simulated data sets. Left: true vs. estimated α with the identity line. Right: 90% credible intervals per iteration, colored by whether they contain the true value.

beta and delta recover less cleanly. Posteriors for individual entries of beta are wider than for alpha, and posterior means do not track the truth as tightly. The same is true, less dramatically, for delta.

There is a structural reason for this, worked out formally in Report 5: the decision-relevant signal at every problem flows through the composition alpha * eta_r = alpha * (psi_r^T upsilon), and beta and delta enter eta_r multiplicatively. Choice probabilities are invariant to a lot of joint movement in (alpha, beta, delta) that leaves this composition unchanged. The likelihood pins down something close to the product well, and the data needed to disentangle the factors are substantial. With M = 25 problems, the data simply do not contain enough information to localize beta and delta precisely on their own. Doubling the design to M = 50 shrinks alpha’s RMSE to 0.79 and its mean CI width to 2.53, but leaves delta’s RMSE essentially unchanged — a direct demonstration that more uncertain-choice data sharpens alpha much faster than it sharpens the utility-function parameters (Report 4 §Does Increasing Sample Size Help?).

The design implication is that if the goal is to interpret entries of beta — for example, to claim that one feature has a stronger effect on beliefs than another — a M = 25 study will rarely support it. If the goal is to estimate alpha, the design is roughly adequate. Report 4 §Differential Recovery Across Parameters discusses how M and the number of alternatives per problem affect each parameter’s recovery, and a later series in this blog will treat the (beta, delta) identification problem in its own right.

SBC: are the posteriors calibrated, not just close?

Parameter recovery answers a point-estimate question: does the posterior land near the truth? It does not directly answer a calibration question: do my 90% intervals contain the true value 90% of the time? A model can have well-located point estimates and still be miscalibrated — typically by being overconfident, which is the failure mode that matters most when intervals will be interpreted as uncertainty.

Simulation-based calibration (Talts et al., 2018) is the standard way to check this. The recipe: simulate many (theta_true, y) pairs from the prior and likelihood, fit the model to each y to get a posterior over theta, and record the rank of theta_true within that posterior. If the model is calibrated, those ranks should be uniformly distributed.

The reader’s guide for SBC rank histograms is short:

  • Flat — calibration is good.
  • U-shaped — the model is overconfident: the true value falls in the tails too often, meaning intervals are too narrow.
  • Dome / inverted-U — the model is underconfident: intervals are too wide.
  • Ramp — the model is biased: posteriors systematically over- or under-shoot the truth.

For m_0, the SBC results align with the recovery picture. alpha shows broadly well-calibrated ranks, with mild departures consistent with the moderate posterior width seen in recovery. Some entries of beta show flatter, wider rank distributions, again reflecting the weak identification of individual entries from a small M.

The take-away is the same as for recovery, but with a sharper edge: SBC tells us that uncertainty in alpha reported by m_0 can be interpreted as uncertainty, not as a placeholder for “the sampler converged but we have no idea what the interval means.”

Posterior predictive checks: does the fitted model match the data?

The first three checks all use synthetic data. PPC is the first check that uses the actual data being analyzed. Having fit the model to observed choices, we sample synthetic choices from the posterior predictive distribution and compare summary statistics of those replicated data sets to the same statistics computed on the observed data.

The m_0 Stan program emits three discrepancy statistics in generated quantities, evaluated on both the observed choices and posterior-replicated choices:

  • ll — model log-likelihood on the choices.
  • modal — the fraction of problems in which the alternative with highest posterior choice probability matches the observed choice.
  • prob — the mean posterior probability assigned to the actually-chosen alternative.

Each statistic gets a posterior predictive p-value: the fraction of posterior-replicated data sets whose statistic is at least as extreme as the observed value. A p-value near 0 or 1 is a flag that the model cannot reproduce that feature of the data; values comfortably in the interior of [0, 1] are reassuring.

A worked example helps. The Posterior Predictive Checks section of the initial temperature study reports all three statistics across five sampling temperatures. Every p-value lands in roughly [0.30, 0.65]. The model is not implausibly close to the data (which would suggest overfitting) and not far away (which would suggest misspecification). It reproduces these three features adequately.

One caveat is worth stating plainly. Passing PPC on these three statistics is encouraging, but it is not sufficient by itself for trusting the fit. PPC tests only the discrepancies one chooses to compute. A model that reproduces likelihood, modal accuracy, and chosen-alternative probability could still fail to reproduce a discrepancy we did not check — for example, the within-problem distribution of choices when more than one alternative is plausible. PPC is one input to confidence in the fit, alongside the pre-data checks above, not a final verdict on it.

What the four checks tell us together

Prior predictive, recovery, SBC, and PPC ask four different questions:

  • Is the model plausible before data? Yes for m_0 — wide range of decision behavior, no pathologies.
  • Can the model recover its own parameters? alpha well; beta and delta only weakly at small M.
  • Are the resulting posteriors calibrated? Yes for alpha; the beta posteriors are honestly wide.
  • Does the fitted model match observed data on chosen statistics? In the temperature application, yes.

That is what it takes to use a sensitivity estimate as evidence rather than as a number. It is not enough that the sampler converged. It is not enough that the posterior interval looks reasonable. The model has to behave the right way across all four checks before the estimate it produces can be interpreted with confidence.

Where this leaves the broader program

The three posts in this series have concentrated on the m_0 model: a single decision maker, a single block of decision problems with consequence features, and a softmax-over-expected-utility likelihood. This is intentional. m_0 is the workhorse that everything else in the foundational program either generalizes or compares to.

There are two natural directions of generalization, both of which the foundational reports work out in detail.

The first is to enrich the model of utilities. The m_1 variant adds risky alternatives — lotteries whose probabilities are fixed by the experimenter rather than inferred from features — to the design. In the spirit of the Anscombe–Aumann construction, the presence of these objectively risky prospects breaks the multiplicative coupling between beta and delta for the risky portion of the data, which materially improves identification of the utility function. The mechanics are worked out in Report 5.

The second is to model multiple decision makers jointly. The h_m01 family of hierarchical models lets alpha (and other parameters) vary across decision makers while pooling information about the common structure. This is the natural framework for studying how sensitivity differs across, say, AI systems with different training, or across temperatures of the same system.

The next series in this blog moves from foundations into applications, beginning with the temperature study that already appeared in this post as an illustration of PPC. The workflow developed across these three posts — abstract model, concrete implementation, Bayesian probes — is what the applied work assumes and builds on.

Sources

This post draws on the foundational Prior Predictive Analysis, Parameter Recovery, and Simulation-Based Calibration reports. The PPC discussion uses the Initial Temperature Study as a worked example. The bridge to m_1 is developed in Report 5: Adding Risky Choices, and the hierarchical extension in Reports 08–12 of the SEU Sensitivity Foundations site.