Bridging from the insurance applications series to a new task family — Ellsberg-style urn gambles — and setting out the historical and interpretive weight Ellsberg's example carries.
GPT-4o shows a clear broad decline in alpha as temperature increases on Ellsberg gambles; Claude shows no such decline.
Reading the four cells of the 2x2 LLM-by-task design together: a dominant LLM main effect, a secondary task effect, and an uninformative interaction.
Before any temperature result can be read, the measurement device has to be set up: a defensible task, a calibrated prior, identifiable parameters, and adequacy checks.
The GPT-4o temperature finding, the Claude non-replication on the same task, and what the contrast licenses by way of conclusion.
Once SEU is used as a reference standard, the useful question is not only whether choices satisfy it exactly. It is how strongly choices move with expected utility differences.
Three choices turn the abstract softmax-over-expected-utility model into something we can fit: how features map to subjective probabilities, how ordered utilities are parameterized, and which priors we place on the resulting parameters.
Four checks ask four different questions of the same model. Together they tell us whether an SEU sensitivity estimate is something we should be willing to interpret.
If you've put an AI agent in a decision-making seat, you need a way to tell whether it's actually deciding well.
Four requirements for measuring how well an AI agent decides under uncertainty.
Why adequacy checks determine whether a decision-quality measurement is evidence or noise.