---
title: "Concrete Implementation: The m_0 Stan Model"
subtitle: "Foundational Report 2"
description: |
A detailed walkthrough of the m_0.stan implementation, showing how the abstract
formulation from Report 1 is realized in a concrete Bayesian model with specific
parameterizations for subjective probabilities and utilities.
categories: [foundations, implementation, m_0]
---
```{python}
#| label: setup
#| include: false
import sys
import os
sys.path.insert(0, os.path.join(os.getcwd(), '..'))
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import softmax
import pandas as pd
```
## Introduction
The [previous report](01_abstract_formulation.qmd) established the theoretical foundations of the softmax choice model with sensitivity parameter $\alpha$. We proved three fundamental properties and showed how they apply when values are subjective expected utilities.
This report bridges theory and practice by specifying an initial concrete implementation in Stan. We show precisely how the abstract formulation is instantiated, highlighting the specific choices made for:
1. **Subjective probability formation**: How features of alternatives map to beliefs about consequences
2. **Utility parameterization**: How we construct ordered utilities on the unit scale
3. **Prior distributions**: The default priors and their rationale
::: {.callout-note}
## Reading This Report
This report is intended for readers who want to understand, modify, or extend the model. We present the full Stan code with detailed commentary on each block.
:::
## From Abstract to Concrete
The abstract formulation in [Report 1](01_abstract_formulation.qmd) established notation for alternatives ($\mathcal{R}$), consequences ($K$), subjective probabilities ($\boldsymbol{\psi}_r$), utilities ($\boldsymbol{\upsilon}$), expected utilities ($\eta_r$), and the sensitivity parameter ($\alpha$). That formulation left several things unspecified:
- **Where do subjective probabilities come from?** The abstract model takes $\boldsymbol{\psi}_r$ as given
- **How are utilities parameterized?** We required $0 = \upsilon_1 \leq \cdots \leq \upsilon_K = 1$, but not how to achieve this
- **What priors should we use?** The theoretical results are prior-free
The m_0 model provides a concrete answer to each of these questions.
## Data Structure
The model operates on a specific data structure that supports flexible experimental designs.
### Conceptual Overview
We assume:
- A set of $R$ **distinct alternatives**, each described by a $D$-dimensional feature vector
- A collection of $M$ **decision problems**, each presenting a subset of the alternatives
- For each problem, an **observed choice** from among the available alternatives
This structure supports a wide range of experimental designs—from simple paired comparisons to complex multi-alternative choice sets.
### Stan Data Block
```{.stan filename="m_0.stan (data block)"}
data {
int<lower=1> M; // Number of decision problems
int<lower=2> K; // Number of possible consequences
int<lower=1> D; // Feature dimensions
int<lower=2> R; // Number of distinct alternatives
array[R] vector[D] w; // Feature vectors for alternatives
array[M,R] int<lower=0,upper=1> I; // Indicator: I[m,r]=1 if r in problem m
array[M] int<lower=1> y; // Observed choices
}
```
::: {.callout-note}
## Notation: Data Structure
| Symbol | Stan Variable | Description |
|--------|---------------|-------------|
| $M$ | `M` | Number of decision problems |
| $D$ | `D` | Feature dimensions |
| $\mathbf{w}_r \in \mathbb{R}^D$ | `w[r]` | Feature vector for alternative $r$ |
| $I_{m,r} \in \{0,1\}$ | `I[m,r]` | Indicator: 1 if alternative $r$ available in problem $m$ |
| $y_m$ | `y[m]` | Observed choice in problem $m$ |
The variables $R$, $K$, and $\alpha$ retain their meanings from Report 1.
:::
::: {.callout-note}
## On "Distinct" Alternatives
The variable `R` represents the number of distinct alternatives in the study design. "Distinct" here reflects a judgment by the study designer about how to individuate alternatives. Within the model, two alternatives with identical feature vectors $\mathbf{w}_r$ will have the same subjective probability distribution over consequences, and thus the same expected utility. The model itself has no notion of alternative identity beyond the feature representation.
**Behavioral assumption:** When the same alternative $r$ appears in multiple decision problems (i.e., $I_{m_1,r} = I_{m_2,r} = 1$), the model assumes the same subjective probabilities $\boldsymbol{\psi}_r$ and expected utility $\eta_r$ but allows different choices across problems via the stochastic softmax. This reflects a decision-maker who has stable beliefs but whose choices are probabilistic given those beliefs.
:::
### Transformed Data
The model preprocesses the data for computational efficiency:
```{.stan filename="m_0.stan (transformed data)"}
transformed data {
array[M] int<lower=2> N; // Alternatives per problem
int total_alternatives = 0;
for (m in 1:M) {
N[m] = sum(I[m]); // Count alternatives in problem m
total_alternatives += N[m];
}
// Construct flattened feature array from w and I
array[total_alternatives] vector[D] x;
{
int pos = 1;
for (m in 1:M) {
for (r in 1:R) {
if (I[m, r] == 1) {
x[pos] = w[r];
pos += 1;
}
}
}
}
}
```
The key transformation is constructing `x`—a flattened array that lists the feature vectors for all alternatives across all problems, in order. This allows efficient vectorized computation in the transformed parameters block.
::: {.callout-note}
## Mapping x back to w
The transformed data `x` contains $\sum_{m=1}^M N_m$ feature vectors—one for each alternative-in-problem combination. The mapping is: `x` is populated by iterating over problems $m = 1, \ldots, M$ and, within each problem, over alternatives $r$ where $I_{m,r} = 1$. Thus `x[pos]` corresponds to `w[r]` for whichever alternative $r$ is the `pos`-th entry in this nested iteration. Later code uses `segment(eta, pos, N[m])` to extract the expected utilities for problem $m$ from the flattened representation.
:::
## Model Parameters
The m_0 model estimates three parameter groups:
```{.stan filename="m_0.stan (parameters)"}
parameters {
real<lower=0> alpha; // Sensitivity parameter
matrix[K,D] beta; // Feature-to-probability mapping
simplex[K-1] delta; // Utility increments
}
```
### The Sensitivity Parameter ($\alpha$)
The parameter `alpha` is constrained to be non-negative, matching the theoretical requirement $\alpha \geq 0$. Its interpretation was established in Report 1: with utilities standardized to $[0,1]$, $\alpha$ represents the log-odds change per unit of expected utility difference.
### The Feature-to-Probability Mapping ($\boldsymbol{\beta}$)
The $K \times D$ matrix `beta` parameterizes how alternative features determine subjective probabilities over consequences. This is the key modeling assumption that connects observable features to latent beliefs.
### Utility Increments ($\boldsymbol{\delta}$)
The $(K-1)$-simplex `delta` parameterizes the spacing of utilities on the unit interval. We explain this construction in detail below.
## Subjective Probability Formation
A central modeling decision in m_0 is how subjective probabilities arise from alternative features.
### The Mapping
For each alternative with feature vector $\mathbf{w}_r \in \mathbb{R}^D$, we compute subjective probabilities via:
$$
\boldsymbol{\psi}_r = \text{softmax}(\boldsymbol{\beta} \mathbf{w}_r)
$$ {#eq-psi-from-features}
where $\boldsymbol{\beta} \in \mathbb{R}^{K \times D}$ is a coefficient matrix.
Expanding the softmax:
$$
\psi_{r,k} = \frac{\exp\left(\sum_{d=1}^D \beta_{k,d} \cdot w_{r,d}\right)}{\sum_{j=1}^K \exp\left(\sum_{d=1}^D \beta_{j,d} \cdot w_{r,d}\right)}
$$ {#eq-psi-expanded}
### Stan Implementation
```{.stan filename="m_0.stan (subjective probabilities)"}
transformed parameters {
array[sum(N)] simplex[K] psi; // Subjective probabilities
// Calculate subjective probabilities via softmax
for (i in 1:sum(N)) {
psi[i] = softmax(beta * x[i]);
}
// ...
}
```
### Interpretation
This parameterization embeds several assumptions:
1. **Linear-in-features**: The log-odds of each consequence are linear functions of features
2. **Shared coefficient matrix**: The same $\boldsymbol{\beta}$ applies to all alternatives
3. **Softmax normalization**: Probabilities sum to 1 and are strictly positive
```{python}
#| label: fig-psi-from-features
#| fig-cap: "Illustration of subjective probability formation. A 2D feature vector is mapped through the β matrix to produce a probability distribution over 3 consequences."
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
# Example feature vector
w = np.array([0.8, -0.3])
# Example beta matrix (K=3, D=2)
beta = np.array([
[1.2, -0.5], # Consequence 1 coefficients
[0.0, 0.8], # Consequence 2 coefficients
[-0.8, 0.3] # Consequence 3 coefficients
])
# Compute log-odds and probabilities
log_odds = beta @ w
psi = softmax(log_odds)
# Plot 1: Feature vector
ax1 = axes[0]
ax1.bar(['w₁', 'w₂'], w, color=['steelblue', 'coral'])
ax1.axhline(y=0, color='gray', linestyle='-', linewidth=0.5)
ax1.set_ylabel('Feature Value')
ax1.set_title('Feature Vector w')
ax1.set_ylim(-1, 1)
# Plot 2: Log-odds (beta * w)
ax2 = axes[1]
ax2.bar(['k=1', 'k=2', 'k=3'], log_odds, color=['forestgreen', 'orange', 'purple'])
ax2.axhline(y=0, color='gray', linestyle='-', linewidth=0.5)
ax2.set_ylabel('Log-odds')
ax2.set_title('Linear Mapping βw')
# Plot 3: Probabilities (softmax)
ax3 = axes[2]
bars = ax3.bar(['ψ₁', 'ψ₂', 'ψ₃'], psi, color=['forestgreen', 'orange', 'purple'])
ax3.set_ylabel('Probability')
ax3.set_title('Subjective Probabilities ψ = softmax(βw)')
ax3.set_ylim(0, 1)
# Add probability labels
for bar, p in zip(bars, psi):
ax3.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
f'{p:.2f}', ha='center', va='bottom', fontsize=10)
plt.tight_layout()
plt.show()
```
::: {.callout-tip}
## Alternative Specifications
The linear-softmax mapping is a reasonable starting point, but alternatives exist:
- **Neural network**: Replace $\boldsymbol{\beta} \mathbf{w}_r$ with a nonlinear function
- **Hierarchical**: Allow $\boldsymbol{\beta}$ to vary across individuals or problem types
- **Constrained**: Impose structure on $\boldsymbol{\beta}$ based on domain knowledge
These extensions would require modifying the Stan model but preserve the overall framework.
:::
::: {.callout-note}
## On the β Prior Scale
The $\mathcal{N}(0, 1)$ prior places 95% of coefficients within $\pm 2$. In the softmax context, this corresponds to log-odds differences up to $\pm 2$, or probability ratios up to $e^2 \approx 7.4$ per unit feature change. This is moderately informative—it allows substantial feature effects while regularizing against extreme values. For applications where features are expected to have very strong effects on beliefs, a wider prior (e.g., $\mathcal{N}(0, 2)$) may be appropriate.
:::
::: {.callout-important}
## Softmax Invariance and Parameter Identifiability
The softmax function satisfies $\text{softmax}(\mathbf{v}) = \text{softmax}(\mathbf{v} + c \cdot \mathbf{1})$ for any scalar $c$. This means that adding a constant to all rows of $\boldsymbol{\beta}$ produces identical subjective probabilities $\boldsymbol{\psi}$. Consequently, the $K \times D$ matrix $\boldsymbol{\beta}$ has only $(K-1) \times D$ effective degrees of freedom, not $K \times D$.
Two common approaches address this redundancy:
1. **Reference category**: Fix one row of $\boldsymbol{\beta}$ to zero (analogous to multinomial logistic regression)
2. **Prior regularization**: Place independent priors on all $K \times D$ elements and accept the resulting ridge of constant posterior density
The m_0 model uses approach (2). The $\mathcal{N}(0, 1)$ prior on all elements provides sufficient regularization for Stan's HMC sampler to explore the posterior efficiently. However, when interpreting individual $\beta_{k,d}$ coefficients, remember that only *differences* between rows are identified—the absolute level is arbitrary. For this reason, posterior summaries should focus on contrasts such as $\beta_{k,d} - \beta_{j,d}$ or on the induced probabilities $\boldsymbol{\psi}$.
:::
## Utility Parameterization
The abstract formulation required utilities to satisfy:
$$
0 = \upsilon_1 \leq \upsilon_2 \leq \cdots \leq \upsilon_K = 1
$$
The m_0 model achieves this through an incremental parameterization.
### The Incremental Construction
Define $\boldsymbol{\delta} = (\delta_1, \ldots, \delta_{K-1})$ as a $(K-1)$-simplex, so $\delta_k \geq 0$ and $\sum_{k=1}^{K-1} \delta_k = 1$.
Then construct utilities as cumulative sums:
$$
\upsilon_k = \sum_{j=1}^{k-1} \delta_j
$$ {#eq-upsilon-construction}
with the convention that $\upsilon_1 = 0$ (empty sum).
This construction guarantees:
- $\upsilon_1 = 0$ (the sum of zero terms)
- $\upsilon_K = \sum_{j=1}^{K-1} \delta_j = 1$ (the simplex sums to 1)
- $\upsilon_k \leq \upsilon_{k+1}$ (each $\delta_k \geq 0$)
### Stan Implementation
```{.stan filename="m_0.stan (utility construction)"}
transformed parameters {
// ...
ordered[K] upsilon; // Ordered utilities
// Construct ordered utilities from increments
upsilon = cumulative_sum(append_row(0, delta));
// ...
}
```
The `append_row(0, delta)` prepends a zero to the $\delta$ vector, and `cumulative_sum` computes the running totals.
### Visualizing the Construction
```{python}
#| label: fig-utility-construction
#| fig-cap: "The incremental utility construction. Left: Three sample δ vectors from a Dirichlet(1,1) prior. Right: The corresponding utility vectors, all satisfying υ₁=0 and υ₃=1."
np.random.seed(42)
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
# Generate sample delta vectors from Dirichlet(1,1)
n_samples = 3
colors = ['steelblue', 'coral', 'forestgreen']
for i in range(n_samples):
delta = np.random.dirichlet([1, 1])
upsilon = np.cumsum(np.concatenate([[0], delta]))
# Plot delta
x_positions = [i * 2.5, i * 2.5 + 1]
axes[0].bar(x_positions, delta,
alpha=0.7, color=colors[i], label=f'Sample {i+1}', width=0.8)
# Plot upsilon
axes[1].plot([1, 2, 3], upsilon, 'o-', color=colors[i],
linewidth=2, markersize=8, label=f'Sample {i+1}')
# Set x-axis labels for delta plot
axes[0].set_xticks([0.5, 3, 5.5])
axes[0].set_xticklabels(['Sample 1\n(δ₁, δ₂)', 'Sample 2\n(δ₁, δ₂)', 'Sample 3\n(δ₁, δ₂)'])
axes[0].set_ylabel('Increment Value')
axes[0].set_title('Utility Increments δ ~ Dirichlet(1,1)')
axes[0].legend()
axes[1].set_xlabel('Consequence k')
axes[1].set_ylabel('Utility υₖ')
axes[1].set_title('Resulting Utility Vectors')
axes[1].set_xticks([1, 2, 3])
axes[1].set_ylim(-0.05, 1.05)
axes[1].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[1].axhline(y=1, color='gray', linestyle='--', alpha=0.5)
axes[1].legend()
plt.tight_layout()
plt.show()
```
### Prior on Utilities
The default prior `delta ~ dirichlet(rep_vector(1, K-1))` is the symmetric Dirichlet with concentration parameter 1. This induces a **uniform distribution over the simplex**—all valid $\boldsymbol{\delta}$ configurations are equally likely a priori.
For $K=3$ consequences, this means:
- The middle utility $\upsilon_2$ has prior mean 0.5 and prior standard deviation $\approx 0.29$
- The utilities are uniformly distributed subject to the ordering constraint
::: {.callout-note}
## On the Choice of Utility Prior
The uniform Dirichlet prior is relatively uninformative—it expresses no preference for how utilities should be spaced. In some applications, a more informative prior might be appropriate:
- **Dirichlet with larger concentration** (e.g., all 5's): Concentrates utilities more evenly
- **Domain-specific**: If consequences have natural ordering with known approximate spacing
**K-dependence:** While Dirichlet(1) is uniform over the simplex, the *marginal* distribution of individual utilities $\upsilon_k$ depends on $K$. As $K$ grows, the marginal distribution of each increment $\delta_k$ concentrates around $1/(K-1)$, causing interior utilities to cluster toward their expected spacing. For large $K$, a different concentration parameter may better express prior uncertainty about utility spacing.
The prior predictive analysis in [Report 3](03_prior_analysis.qmd) examines the implications of this default choice, and sensitivity to prior specification is discussed there.
:::
## Expected Utility and Choice Probabilities
With subjective probabilities $\boldsymbol{\psi}$ and utilities $\boldsymbol{\upsilon}$ in hand, we compute expected utilities and choice probabilities exactly as in the abstract formulation.
### Expected Utility Calculation
$$
\eta_r = \boldsymbol{\psi}_r^\top \boldsymbol{\upsilon} = \sum_{k=1}^K \psi_{r,k} \cdot \upsilon_k
$$
```{.stan filename="m_0.stan (expected utilities)"}
transformed parameters {
// ...
vector[sum(N)] eta; // Expected utilities
// Calculate expected utility for each alternative
for (i in 1:sum(N)) {
eta[i] = dot_product(psi[i], upsilon);
}
// ...
}
```
### Choice Probabilities
For each decision problem $m$ with $N_m$ available alternatives:
$$
\chi_{m,r} = \frac{\exp(\alpha \cdot \eta_r)}{\sum_{j \in \text{problem } m} \exp(\alpha \cdot \eta_j)}
$$
```{.stan filename="m_0.stan (choice probabilities)"}
transformed parameters {
// ...
array[M] simplex[max(N)] chi; // Choice probabilities (with padding)
{
int pos = 1;
for (i in 1:M) {
vector[N[i]] problem_eta = segment(eta, pos, N[i]);
chi[i] = append_row(
softmax(alpha * problem_eta),
rep_vector(0, max(N) - N[i]) // Padding for ragged array
);
pos += N[i];
}
}
}
```
::: {.callout-note}
## On the Simplex Declaration with Padding
The `chi` array is declared as `simplex[max(N)]`, but after padding with zeros, vectors for problems with fewer than `max(N)` alternatives technically violate the simplex constraint (they sum to 1 only over the first $N_m$ elements). Stan permits this because it does not enforce simplex constraints on transformed parameters during sampling—the constraint declaration serves primarily as documentation and for initialization.
This design was chosen over Stan's ragged array features (using `segment()` throughout) for code clarity. The `categorical` distribution correctly handles the padded vectors because observations $y_m$ are validated to never exceed $N_m$ (see the bounds check in transformed data). An alternative implementation could use local vectors without the simplex declaration, but the current approach makes the probabilistic intent clearer.
:::
::: {.callout-tip}
## Alternative: log_softmax for Numerical Stability
For applications where $\alpha$ may be very large, Stan's `log_softmax` function avoids potential overflow. Combined with `categorical_logit`, this yields:
```stan
y[i] ~ categorical_logit(alpha * problem_eta);
```
This single statement replaces both the softmax computation and the categorical likelihood with a numerically stable combined operation. The m_0 model uses explicit softmax to expose the intermediate choice probabilities `chi` for posterior predictive checks and interpretability.
:::
## Prior Distributions
The model block specifies prior distributions for all parameters:
```{.stan filename="m_0.stan (model block)"}
model {
// Priors
alpha ~ lognormal(0, 1); // Sensitivity
to_vector(beta) ~ std_normal(); // Feature coefficients
delta ~ dirichlet(rep_vector(1, K-1)); // Utility increments
// Likelihood
for (i in 1:M) {
y[i] ~ categorical(chi[i]);
}
}
```
### Prior on $\alpha$: Lognormal(0, 1)
The lognormal prior ensures $\alpha > 0$ and places substantial mass on moderate sensitivity values.
Key properties of Lognormal(0, 1):
| Quantity | Value |
|----------|-------|
| Median | 1.00 |
| Mean | 1.65 |
| 25th percentile | 0.51 |
| 75th percentile | 1.96 |
| 95th percentile | 5.18 |
### Prior on $\boldsymbol{\beta}$: Standard Normal
Each element of the $K \times D$ matrix $\boldsymbol{\beta}$ receives an independent $\mathcal{N}(0, 1)$ prior. This is a weakly informative default that:
- Centers coefficients at zero (no a priori direction of effect)
- Allows moderately large coefficients (95% within $\pm 2$)
- Is computationally convenient
### Prior on $\boldsymbol{\delta}$: Symmetric Dirichlet
As discussed above, `delta ~ dirichlet(rep_vector(1, K-1))` induces a uniform distribution over valid utility configurations.
## Generated Quantities
The `generated quantities` block computes outputs for model checking and comparison:
```{.stan filename="m_0.stan (generated quantities)"}
generated quantities {
// Log-likelihood for model comparison (LOO-CV)
vector[M] log_lik;
for (i in 1:M) {
log_lik[i] = categorical_lpmf(y[i] | chi[i]);
}
// Posterior predictive samples
array[M] int y_pred;
for (i in 1:M) {
y_pred[i] = categorical_rng(chi[i]);
}
// Posterior predictive check statistics
// ... (log-likelihood discrepancy, modal accuracy, sum of chosen probabilities)
}
```
**Quantities computed:**
- `log_lik`: Pointwise log-likelihood for each observation, used for approximate leave-one-out cross-validation (LOO-CV) via the `loo` package
- `y_pred`: Posterior predictive samples—simulated choices under the fitted model
- PPC statistics: Log-likelihood discrepancy, modal choice accuracy, and sum of chosen probabilities for calibration checking
These quantities are used in [Report 6](06_sbc_validation.qmd) for simulation-based calibration and posterior predictive checks.
## Computational Considerations
For users adapting or extending the model, several computational aspects are worth noting.
### Numerical Stability
For very large $\alpha$, the softmax computation $\text{softmax}(\alpha \cdot \boldsymbol{\eta})$ can overflow. Stan's `log_softmax` function is numerically stable for such cases:
```stan
// Alternative: Use log_softmax for numerical stability
vector[N[i]] log_chi = log_softmax(alpha * problem_eta);
// Then use categorical_logit_lpmf in the model block
```
The `categorical_logit` distribution combines log-softmax with the categorical likelihood in a single stable computation. The current implementation uses standard softmax because typical posterior draws for $\alpha$ remain in a numerically safe range.
### Scaling Properties
- **Memory**: Linear in $M \cdot K + R \cdot D + K \cdot D$ (data plus parameters)
- **Time per gradient**: Dominated by the $M$ loop over decision problems; each iteration involves softmax operations that scale as $O(K)$ and $O(N_m)$
- **Overall**: For typical problem sizes ($M \sim 100$--$1000$, $K \sim 3$--$10$, $D \sim 2$--$10$), fitting takes seconds to minutes
### Vectorization
The current implementation uses loops for clarity. The softmax computations could potentially be vectorized for large $M$, but the gains are modest because Stan's autodiff is already efficient for simple loops. For very large datasets, consider working with sufficient statistics or subsampling approaches.
## Summary
The m_0 model provides a complete Bayesian implementation of the SEU sensitivity framework:
| Component | Abstract | Concrete (m_0) |
|-----------|----------|----------------|
| Beliefs | $\boldsymbol{\psi}_r \in \Delta^{K-1}$ | $\boldsymbol{\psi}_r = \text{softmax}(\boldsymbol{\beta} \mathbf{w}_r)$ |
| Utilities | $0 = \upsilon_1 \leq \cdots \leq \upsilon_K = 1$ | $\boldsymbol{\upsilon} = \text{cumsum}([0, \boldsymbol{\delta}])$ |
| Sensitivity | $\alpha \geq 0$ | $\alpha \sim \text{Lognormal}(0, 1)$ |
| Choice | $\chi_m = \text{softmax}(\alpha \cdot \boldsymbol{\eta}_m)$ | Same, with categorical likelihood |
The model makes specific choices—linear feature mapping, symmetric Dirichlet prior on utilities—that could be modified for particular applications. The theoretical properties from Report 1 (monotonicity, perfect rationality limit, random choice limit) hold for any such modifications, provided the basic softmax choice structure is preserved.
The [next report](03_prior_analysis.qmd) examines the prior predictive distribution implied by these choices, asking: what range of behaviors does the prior permit?
## References
::: {#refs}
:::