Concrete Implementation: The m_0 Stan Model

Foundational Report 2

foundations
implementation
m_0

A detailed walkthrough of the m_0.stan implementation, showing how the abstract formulation from Report 1 is realized in a concrete Bayesian model with specific parameterizations for subjective probabilities and utilities.

Author
Published

May 12, 2026

0.1 Introduction

The previous report established the theoretical foundations of the softmax choice model with sensitivity parameter \(\alpha\). We proved three fundamental properties and showed how they apply when values are subjective expected utilities.

This report bridges theory and practice by specifying an initial concrete implementation in Stan. We show precisely how the abstract formulation is instantiated, highlighting the specific choices made for:

  1. Subjective probability formation: How features of alternatives map to beliefs about consequences
  2. Utility parameterization: How we construct ordered utilities on the unit scale
  3. Prior distributions: The default priors and their rationale
NoteReading This Report

This report is intended for readers who want to understand, modify, or extend the model. We present the full Stan code with detailed commentary on each block.

0.2 From Abstract to Concrete

The abstract formulation in Report 1 established notation for alternatives (\(\mathcal{R}\)), consequences (\(K\)), subjective probabilities (\(\boldsymbol{\psi}_r\)), utilities (\(\boldsymbol{\upsilon}\)), expected utilities (\(\eta_r\)), and the sensitivity parameter (\(\alpha\)). That formulation left several things unspecified:

  • Where do subjective probabilities come from? The abstract model takes \(\boldsymbol{\psi}_r\) as given
  • How are utilities parameterized? We required \(0 = \upsilon_1 \leq \cdots \leq \upsilon_K = 1\), but not how to achieve this
  • What priors should we use? The theoretical results are prior-free

The m_0 model provides a concrete answer to each of these questions.

0.3 Data Structure

The model operates on a specific data structure that supports flexible experimental designs.

0.3.1 Conceptual Overview

We assume:

  • A set of \(R\) distinct alternatives, each described by a \(D\)-dimensional feature vector
  • A collection of \(M\) decision problems, each presenting a subset of the alternatives
  • For each problem, an observed choice from among the available alternatives

This structure supports a wide range of experimental designs—from simple paired comparisons to complex multi-alternative choice sets.

0.3.2 Stan Data Block

m_0.stan (data block)
data {
  int<lower=1> M;                        // Number of decision problems
  int<lower=2> K;                        // Number of possible consequences
  int<lower=1> D;                        // Feature dimensions
  int<lower=2> R;                        // Number of distinct alternatives
  array[R] vector[D] w;                  // Feature vectors for alternatives
  array[M,R] int<lower=0,upper=1> I;     // Indicator: I[m,r]=1 if r in problem m
  array[M] int<lower=1> y;               // Observed choices
}
NoteNotation: Data Structure
Symbol Stan Variable Description
\(M\) M Number of decision problems
\(D\) D Feature dimensions
\(\mathbf{w}_r \in \mathbb{R}^D\) w[r] Feature vector for alternative \(r\)
\(I_{m,r} \in \{0,1\}\) I[m,r] Indicator: 1 if alternative \(r\) available in problem \(m\)
\(y_m\) y[m] Observed choice in problem \(m\)

The variables \(R\), \(K\), and \(\alpha\) retain their meanings from Report 1.

NoteOn “Distinct” Alternatives

The variable R represents the number of distinct alternatives in the study design. “Distinct” here reflects a judgment by the study designer about how to individuate alternatives. Within the model, two alternatives with identical feature vectors \(\mathbf{w}_r\) will have the same subjective probability distribution over consequences, and thus the same expected utility. The model itself has no notion of alternative identity beyond the feature representation.

Behavioral assumption: When the same alternative \(r\) appears in multiple decision problems (i.e., \(I_{m_1,r} = I_{m_2,r} = 1\)), the model assumes the same subjective probabilities \(\boldsymbol{\psi}_r\) and expected utility \(\eta_r\) but allows different choices across problems via the stochastic softmax. This reflects a decision-maker who has stable beliefs but whose choices are probabilistic given those beliefs.

0.3.3 Transformed Data

The model preprocesses the data for computational efficiency:

m_0.stan (transformed data)
transformed data {
  array[M] int<lower=2> N;           // Alternatives per problem
  int total_alternatives = 0;
  
  for (m in 1:M) {
    N[m] = sum(I[m]);                // Count alternatives in problem m
    total_alternatives += N[m];
  }
  
  // Construct flattened feature array from w and I
  array[total_alternatives] vector[D] x;
  {
    int pos = 1;
    for (m in 1:M) {
      for (r in 1:R) {
        if (I[m, r] == 1) {
          x[pos] = w[r];
          pos += 1;
        }
      }
    }
  }
}

The key transformation is constructing x—a flattened array that lists the feature vectors for all alternatives across all problems, in order. This allows efficient vectorized computation in the transformed parameters block.

NoteMapping x back to w

The transformed data x contains \(\sum_{m=1}^M N_m\) feature vectors—one for each alternative-in-problem combination. The mapping is: x is populated by iterating over problems \(m = 1, \ldots, M\) and, within each problem, over alternatives \(r\) where \(I_{m,r} = 1\). Thus x[pos] corresponds to w[r] for whichever alternative \(r\) is the pos-th entry in this nested iteration. Later code uses segment(eta, pos, N[m]) to extract the expected utilities for problem \(m\) from the flattened representation.

0.4 Model Parameters

The m_0 model estimates three parameter groups:

m_0.stan (parameters)
parameters {
  real<lower=0> alpha;           // Sensitivity parameter
  matrix[K,D] beta;              // Feature-to-probability mapping
  simplex[K-1] delta;            // Utility increments
}

0.4.1 The Sensitivity Parameter (\(\alpha\))

The parameter alpha is constrained to be non-negative, matching the theoretical requirement \(\alpha \geq 0\). Its interpretation was established in Report 1: with utilities standardized to \([0,1]\), \(\alpha\) represents the log-odds change per unit of expected utility difference.

0.4.2 The Feature-to-Probability Mapping (\(\boldsymbol{\beta}\))

The \(K \times D\) matrix beta parameterizes how alternative features determine subjective probabilities over consequences. This is the key modeling assumption that connects observable features to latent beliefs.

0.4.3 Utility Increments (\(\boldsymbol{\delta}\))

The \((K-1)\)-simplex delta parameterizes the spacing of utilities on the unit interval. We explain this construction in detail below.

0.5 Subjective Probability Formation

A central modeling decision in m_0 is how subjective probabilities arise from alternative features.

0.5.1 The Mapping

For each alternative with feature vector \(\mathbf{w}_r \in \mathbb{R}^D\), we compute subjective probabilities via:

\[ \boldsymbol{\psi}_r = \text{softmax}(\boldsymbol{\beta} \mathbf{w}_r) \tag{1}\]

where \(\boldsymbol{\beta} \in \mathbb{R}^{K \times D}\) is a coefficient matrix.

Expanding the softmax:

\[ \psi_{r,k} = \frac{\exp\left(\sum_{d=1}^D \beta_{k,d} \cdot w_{r,d}\right)}{\sum_{j=1}^K \exp\left(\sum_{d=1}^D \beta_{j,d} \cdot w_{r,d}\right)} \tag{2}\]

0.5.2 Stan Implementation

m_0.stan (subjective probabilities)
transformed parameters {
  array[sum(N)] simplex[K] psi;    // Subjective probabilities
  
  // Calculate subjective probabilities via softmax
  for (i in 1:sum(N)) {
    psi[i] = softmax(beta * x[i]);
  }
  // ...
}

0.5.3 Interpretation

This parameterization embeds several assumptions:

  1. Linear-in-features: The log-odds of each consequence are linear functions of features
  2. Shared coefficient matrix: The same \(\boldsymbol{\beta}\) applies to all alternatives
  3. Softmax normalization: Probabilities sum to 1 and are strictly positive
Show code
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

# Example feature vector
w = np.array([0.8, -0.3])

# Example beta matrix (K=3, D=2)
beta = np.array([
    [1.2, -0.5],   # Consequence 1 coefficients
    [0.0,  0.8],   # Consequence 2 coefficients
    [-0.8, 0.3]    # Consequence 3 coefficients
])

# Compute log-odds and probabilities
log_odds = beta @ w
psi = softmax(log_odds)

# Plot 1: Feature vector
ax1 = axes[0]
ax1.bar(['w₁', 'w₂'], w, color=['steelblue', 'coral'])
ax1.axhline(y=0, color='gray', linestyle='-', linewidth=0.5)
ax1.set_ylabel('Feature Value')
ax1.set_title('Feature Vector w')
ax1.set_ylim(-1, 1)

# Plot 2: Log-odds (beta * w)
ax2 = axes[1]
ax2.bar(['k=1', 'k=2', 'k=3'], log_odds, color=['forestgreen', 'orange', 'purple'])
ax2.axhline(y=0, color='gray', linestyle='-', linewidth=0.5)
ax2.set_ylabel('Log-odds')
ax2.set_title('Linear Mapping βw')

# Plot 3: Probabilities (softmax)
ax3 = axes[2]
bars = ax3.bar(['ψ₁', 'ψ₂', 'ψ₃'], psi, color=['forestgreen', 'orange', 'purple'])
ax3.set_ylabel('Probability')
ax3.set_title('Subjective Probabilities ψ = softmax(βw)')
ax3.set_ylim(0, 1)

# Add probability labels
for bar, p in zip(bars, psi):
    ax3.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
             f'{p:.2f}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()
Figure 1: Illustration of subjective probability formation. A 2D feature vector is mapped through the β matrix to produce a probability distribution over 3 consequences.
TipAlternative Specifications

The linear-softmax mapping is a reasonable starting point, but alternatives exist:

  • Neural network: Replace \(\boldsymbol{\beta} \mathbf{w}_r\) with a nonlinear function
  • Hierarchical: Allow \(\boldsymbol{\beta}\) to vary across individuals or problem types
  • Constrained: Impose structure on \(\boldsymbol{\beta}\) based on domain knowledge

These extensions would require modifying the Stan model but preserve the overall framework.

NoteOn the β Prior Scale

The \(\mathcal{N}(0, 1)\) prior places 95% of coefficients within \(\pm 2\). In the softmax context, this corresponds to log-odds differences up to \(\pm 2\), or probability ratios up to \(e^2 \approx 7.4\) per unit feature change. This is moderately informative—it allows substantial feature effects while regularizing against extreme values. For applications where features are expected to have very strong effects on beliefs, a wider prior (e.g., \(\mathcal{N}(0, 2)\)) may be appropriate.

ImportantSoftmax Invariance and Parameter Identifiability

The softmax function satisfies \(\text{softmax}(\mathbf{v}) = \text{softmax}(\mathbf{v} + c \cdot \mathbf{1})\) for any scalar \(c\). This means that adding a constant to all rows of \(\boldsymbol{\beta}\) produces identical subjective probabilities \(\boldsymbol{\psi}\). Consequently, the \(K \times D\) matrix \(\boldsymbol{\beta}\) has only \((K-1) \times D\) effective degrees of freedom, not \(K \times D\).

Two common approaches address this redundancy:

  1. Reference category: Fix one row of \(\boldsymbol{\beta}\) to zero (analogous to multinomial logistic regression)
  2. Prior regularization: Place independent priors on all \(K \times D\) elements and accept the resulting ridge of constant posterior density

The m_0 model uses approach (2). The \(\mathcal{N}(0, 1)\) prior on all elements provides sufficient regularization for Stan’s HMC sampler to explore the posterior efficiently. However, when interpreting individual \(\beta_{k,d}\) coefficients, remember that only differences between rows are identified—the absolute level is arbitrary. For this reason, posterior summaries should focus on contrasts such as \(\beta_{k,d} - \beta_{j,d}\) or on the induced probabilities \(\boldsymbol{\psi}\).

0.6 Utility Parameterization

The abstract formulation required utilities to satisfy: \[ 0 = \upsilon_1 \leq \upsilon_2 \leq \cdots \leq \upsilon_K = 1 \]

The m_0 model achieves this through an incremental parameterization.

0.6.1 The Incremental Construction

Define \(\boldsymbol{\delta} = (\delta_1, \ldots, \delta_{K-1})\) as a \((K-1)\)-simplex, so \(\delta_k \geq 0\) and \(\sum_{k=1}^{K-1} \delta_k = 1\).

Then construct utilities as cumulative sums: \[ \upsilon_k = \sum_{j=1}^{k-1} \delta_j \tag{3}\]

with the convention that \(\upsilon_1 = 0\) (empty sum).

This construction guarantees:

  • \(\upsilon_1 = 0\) (the sum of zero terms)
  • \(\upsilon_K = \sum_{j=1}^{K-1} \delta_j = 1\) (the simplex sums to 1)
  • \(\upsilon_k \leq \upsilon_{k+1}\) (each \(\delta_k \geq 0\))

0.6.2 Stan Implementation

m_0.stan (utility construction)
transformed parameters {
  // ...
  ordered[K] upsilon;              // Ordered utilities
  
  // Construct ordered utilities from increments
  upsilon = cumulative_sum(append_row(0, delta));
  // ...
}

The append_row(0, delta) prepends a zero to the \(\delta\) vector, and cumulative_sum computes the running totals.

0.6.3 Visualizing the Construction

Show code
np.random.seed(42)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Generate sample delta vectors from Dirichlet(1,1)
n_samples = 3
colors = ['steelblue', 'coral', 'forestgreen']

for i in range(n_samples):
    delta = np.random.dirichlet([1, 1])
    upsilon = np.cumsum(np.concatenate([[0], delta]))
    
    # Plot delta
    x_positions = [i * 2.5, i * 2.5 + 1]
    axes[0].bar(x_positions, delta, 
                alpha=0.7, color=colors[i], label=f'Sample {i+1}', width=0.8)
    
    # Plot upsilon
    axes[1].plot([1, 2, 3], upsilon, 'o-', color=colors[i], 
                 linewidth=2, markersize=8, label=f'Sample {i+1}')

# Set x-axis labels for delta plot
axes[0].set_xticks([0.5, 3, 5.5])
axes[0].set_xticklabels(['Sample 1\n(δ₁, δ₂)', 'Sample 2\n(δ₁, δ₂)', 'Sample 3\n(δ₁, δ₂)'])

axes[0].set_ylabel('Increment Value')
axes[0].set_title('Utility Increments δ ~ Dirichlet(1,1)')
axes[0].legend()

axes[1].set_xlabel('Consequence k')
axes[1].set_ylabel('Utility υₖ')
axes[1].set_title('Resulting Utility Vectors')
axes[1].set_xticks([1, 2, 3])
axes[1].set_ylim(-0.05, 1.05)
axes[1].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[1].axhline(y=1, color='gray', linestyle='--', alpha=0.5)
axes[1].legend()

plt.tight_layout()
plt.show()
Figure 2: The incremental utility construction. Left: Three sample δ vectors from a Dirichlet(1,1) prior. Right: The corresponding utility vectors, all satisfying υ₁=0 and υ₃=1.

0.6.4 Prior on Utilities

The default prior delta ~ dirichlet(rep_vector(1, K-1)) is the symmetric Dirichlet with concentration parameter 1. This induces a uniform distribution over the simplex—all valid \(\boldsymbol{\delta}\) configurations are equally likely a priori.

For \(K=3\) consequences, this means:

  • The middle utility \(\upsilon_2\) has prior mean 0.5 and prior standard deviation \(\approx 0.29\)
  • The utilities are uniformly distributed subject to the ordering constraint
NoteOn the Choice of Utility Prior

The uniform Dirichlet prior is relatively uninformative—it expresses no preference for how utilities should be spaced. In some applications, a more informative prior might be appropriate:

  • Dirichlet with larger concentration (e.g., all 5’s): Concentrates utilities more evenly
  • Domain-specific: If consequences have natural ordering with known approximate spacing

K-dependence: While Dirichlet(1) is uniform over the simplex, the marginal distribution of individual utilities \(\upsilon_k\) depends on \(K\). As \(K\) grows, the marginal distribution of each increment \(\delta_k\) concentrates around \(1/(K-1)\), causing interior utilities to cluster toward their expected spacing. For large \(K\), a different concentration parameter may better express prior uncertainty about utility spacing.

The prior predictive analysis in Report 3 examines the implications of this default choice, and sensitivity to prior specification is discussed there.

0.7 Expected Utility and Choice Probabilities

With subjective probabilities \(\boldsymbol{\psi}\) and utilities \(\boldsymbol{\upsilon}\) in hand, we compute expected utilities and choice probabilities exactly as in the abstract formulation.

0.7.1 Expected Utility Calculation

\[ \eta_r = \boldsymbol{\psi}_r^\top \boldsymbol{\upsilon} = \sum_{k=1}^K \psi_{r,k} \cdot \upsilon_k \]

m_0.stan (expected utilities)
transformed parameters {
  // ...
  vector[sum(N)] eta;              // Expected utilities
  
  // Calculate expected utility for each alternative
  for (i in 1:sum(N)) {
    eta[i] = dot_product(psi[i], upsilon);
  }
  // ...
}

0.7.2 Choice Probabilities

For each decision problem \(m\) with \(N_m\) available alternatives:

\[ \chi_{m,r} = \frac{\exp(\alpha \cdot \eta_r)}{\sum_{j \in \text{problem } m} \exp(\alpha \cdot \eta_j)} \]

m_0.stan (choice probabilities)
transformed parameters {
  // ...
  array[M] simplex[max(N)] chi;    // Choice probabilities (with padding)
  
  {
    int pos = 1;
    for (i in 1:M) {
      vector[N[i]] problem_eta = segment(eta, pos, N[i]);
      
      chi[i] = append_row(
        softmax(alpha * problem_eta),
        rep_vector(0, max(N) - N[i])  // Padding for ragged array
      );
      
      pos += N[i];
    }
  }
}
NoteOn the Simplex Declaration with Padding

The chi array is declared as simplex[max(N)], but after padding with zeros, vectors for problems with fewer than max(N) alternatives technically violate the simplex constraint (they sum to 1 only over the first \(N_m\) elements). Stan permits this because it does not enforce simplex constraints on transformed parameters during sampling—the constraint declaration serves primarily as documentation and for initialization.

This design was chosen over Stan’s ragged array features (using segment() throughout) for code clarity. The categorical distribution correctly handles the padded vectors because observations \(y_m\) are validated to never exceed \(N_m\) (see the bounds check in transformed data). An alternative implementation could use local vectors without the simplex declaration, but the current approach makes the probabilistic intent clearer.

TipAlternative: log_softmax for Numerical Stability

For applications where \(\alpha\) may be very large, Stan’s log_softmax function avoids potential overflow. Combined with categorical_logit, this yields:

y[i] ~ categorical_logit(alpha * problem_eta);

This single statement replaces both the softmax computation and the categorical likelihood with a numerically stable combined operation. The m_0 model uses explicit softmax to expose the intermediate choice probabilities chi for posterior predictive checks and interpretability.

0.8 Prior Distributions

The model block specifies prior distributions for all parameters:

m_0.stan (model block)
model {
  // Priors
  alpha ~ lognormal(0, 1);               // Sensitivity
  to_vector(beta) ~ std_normal();        // Feature coefficients
  delta ~ dirichlet(rep_vector(1, K-1)); // Utility increments
  
  // Likelihood
  for (i in 1:M) {
    y[i] ~ categorical(chi[i]);
  }
}

0.8.1 Prior on \(\alpha\): Lognormal(0, 1)

The lognormal prior ensures \(\alpha > 0\) and places substantial mass on moderate sensitivity values.

Key properties of Lognormal(0, 1):

Quantity Value
Median 1.00
Mean 1.65
25th percentile 0.51
75th percentile 1.96
95th percentile 5.18

0.8.2 Prior on \(\boldsymbol{\beta}\): Standard Normal

Each element of the \(K \times D\) matrix \(\boldsymbol{\beta}\) receives an independent \(\mathcal{N}(0, 1)\) prior. This is a weakly informative default that:

  • Centers coefficients at zero (no a priori direction of effect)
  • Allows moderately large coefficients (95% within \(\pm 2\))
  • Is computationally convenient

0.8.3 Prior on \(\boldsymbol{\delta}\): Symmetric Dirichlet

As discussed above, delta ~ dirichlet(rep_vector(1, K-1)) induces a uniform distribution over valid utility configurations.

0.9 Generated Quantities

The generated quantities block computes outputs for model checking and comparison:

m_0.stan (generated quantities)
generated quantities {
  // Log-likelihood for model comparison (LOO-CV)
  vector[M] log_lik;
  for (i in 1:M) {
    log_lik[i] = categorical_lpmf(y[i] | chi[i]);
  }
  
  // Posterior predictive samples
  array[M] int y_pred;
  for (i in 1:M) {
    y_pred[i] = categorical_rng(chi[i]);
  }
  
  // Posterior predictive check statistics
  // ... (log-likelihood discrepancy, modal accuracy, sum of chosen probabilities)
}

Quantities computed:

  • log_lik: Pointwise log-likelihood for each observation, used for approximate leave-one-out cross-validation (LOO-CV) via the loo package
  • y_pred: Posterior predictive samples—simulated choices under the fitted model
  • PPC statistics: Log-likelihood discrepancy, modal choice accuracy, and sum of chosen probabilities for calibration checking

These quantities are used in Report 6 for simulation-based calibration and posterior predictive checks.

0.10 Computational Considerations

For users adapting or extending the model, several computational aspects are worth noting.

0.10.1 Numerical Stability

For very large \(\alpha\), the softmax computation \(\text{softmax}(\alpha \cdot \boldsymbol{\eta})\) can overflow. Stan’s log_softmax function is numerically stable for such cases:

// Alternative: Use log_softmax for numerical stability
vector[N[i]] log_chi = log_softmax(alpha * problem_eta);
// Then use categorical_logit_lpmf in the model block

The categorical_logit distribution combines log-softmax with the categorical likelihood in a single stable computation. The current implementation uses standard softmax because typical posterior draws for \(\alpha\) remain in a numerically safe range.

0.10.2 Scaling Properties

  • Memory: Linear in \(M \cdot K + R \cdot D + K \cdot D\) (data plus parameters)
  • Time per gradient: Dominated by the \(M\) loop over decision problems; each iteration involves softmax operations that scale as \(O(K)\) and \(O(N_m)\)
  • Overall: For typical problem sizes (\(M \sim 100\)\(1000\), \(K \sim 3\)\(10\), \(D \sim 2\)\(10\)), fitting takes seconds to minutes

0.10.3 Vectorization

The current implementation uses loops for clarity. The softmax computations could potentially be vectorized for large \(M\), but the gains are modest because Stan’s autodiff is already efficient for simple loops. For very large datasets, consider working with sufficient statistics or subsampling approaches.

0.11 Summary

The m_0 model provides a complete Bayesian implementation of the SEU sensitivity framework:

Component Abstract Concrete (m_0)
Beliefs \(\boldsymbol{\psi}_r \in \Delta^{K-1}\) \(\boldsymbol{\psi}_r = \text{softmax}(\boldsymbol{\beta} \mathbf{w}_r)\)
Utilities \(0 = \upsilon_1 \leq \cdots \leq \upsilon_K = 1\) \(\boldsymbol{\upsilon} = \text{cumsum}([0, \boldsymbol{\delta}])\)
Sensitivity \(\alpha \geq 0\) \(\alpha \sim \text{Lognormal}(0, 1)\)
Choice \(\chi_m = \text{softmax}(\alpha \cdot \boldsymbol{\eta}_m)\) Same, with categorical likelihood

The model makes specific choices—linear feature mapping, symmetric Dirichlet prior on utilities—that could be modified for particular applications. The theoretical properties from Report 1 (monotonicity, perfect rationality limit, random choice limit) hold for any such modifications, provided the basic softmax choice structure is preserved.

The next report examines the prior predictive distribution implied by these choices, asking: what range of behaviors does the prior permit?

0.12 References

Reuse

Citation

BibTeX citation:
@online{helzner2026,
  author = {Helzner, Jeff},
  title = {Concrete {Implementation:} {The} m\_0 {Stan} {Model}},
  date = {2026-05-12},
  url = {https://jeffhelzner.github.io/seu-sensitivity/foundations/02_concrete_implementation.html},
  langid = {en}
}
For attribution, please cite this work as:
Helzner, Jeff. 2026. “Concrete Implementation: The m_0 Stan Model.” SEU Sensitivity Project, May 12. https://jeffhelzner.github.io/seu-sensitivity/foundations/02_concrete_implementation.html.