What the GPT-4o Temperature Result Showed, and Why It Did Not Travel to Claude
Applying SEU Sensitivity to LLM Decisions · Part 1 of 2
Part 1 · Part 2
The foundations series developed the SEU Sensitivity framework as a graded way to ask how strongly a decision maker’s observed choices track subjective expected utility differences. The framework is deliberately narrower than a theory of intelligence or rationality. It estimates a single sensitivity parameter, alpha, under a stated reference standard, and asks whether that estimate is uncertain, interpretable, and adequate for the task at hand.
The natural next question is what happens when the framework is applied to real LLM choice data. This series describes the first two pilot studies in that programme. The present post sets up the initial GPT-4o temperature study: the question it asks, the experimental design, the prior calibration, and the model checking that has to be in place before any alpha estimate is allowed to be read substantively. The companion post turns to results — both for GPT-4o and for a follow-up Claude study on the same insurance task that did not reproduce the GPT-4o pattern.
These pilots are best read as probes of the measurement device — the m_0 family of models introduced in Part 2 of the foundations series — rather than as substantive claims about LLM cognition. Underlying technical detail lives in two reports: Temperature and SEU Sensitivity: Initial Results and Temperature and SEU Sensitivity: Claude × Insurance Study. The blog posts link into specific sections of those reports wherever the high-level summary abstracts away a choice that an interested reader might want to inspect.
Why this pilot, and why temperature
The first pilot used LLM sampling temperature as the experimental lever. The choice was driven primarily by methodological convenience: temperature is exposed by the provider’s API, it plausibly changes the stochasticity of the LLM’s outputs, and varying it gives a clean source of manipulation under which something should change if the framework is detecting anything at all. For a first empirical application of m_0-family models, an easy-to-manipulate factor was preferable to a domain-rich one whose interpretation would have competed for attention with the measurement question.
There was also an informal hypothesis in the background. In the softmax SEU sensitivity model, the parameter alpha governs how sharply choices concentrate on alternatives that the model ranks higher in subjective expected utility. In an LLM, sampling temperature affects the entropy of next-token selection. If consistency with rational choice (as measured by alpha) decreases as creativity (as measured by temperature) increases — the kind of trade-off that has informal currency in discussions of LLM behaviour — one would expect estimated alpha to decrease as temperature rises. This is a coherent directional hypothesis, but it is not mechanistically clean. Temperature can in principle affect intermediate reasoning, the language of intermediate claim assessments, and final answer selection at once, and the present design does not separate these channels. The pathway from token-sampling entropy to choice-level sensitivity is sketched more formally in the introduction to the initial study.
What the study primarily asks, then, is whether the measurement device picks up a structured change in choice behaviour when the easiest available lever is moved. Either answer is informative. A non-detection would constrain what the framework can do at this scale; a detection would establish that the workflow runs end-to-end on real LLM data and produce a comparative result that can be tested in a different setting.
Experimental design
The study used GPT-4o in an insurance claims triage task. In each decision problem, the model was presented with a small set of insurance claims and asked which one to forward for investigation. The choice was structured as a two-stage prompting pipeline: each claim was first assessed individually by the LLM, producing a short natural-language evaluation, and the four assessments were then assembled into a choice prompt that asked the LLM to select exactly one claim to forward. The full study design — sample size rationale, task and conditions, prompts — is documented in §Experimental Design of the technical report.
The experimental factor was sampling temperature, set to five levels — 0.0, 0.3, 0.7, 1.0, 1.5 — each treated as an independent data collection. The study used 100 base decision problems, drawn from a pool of R = 30 distinct claims, with each problem presented three times under different position orderings; this counterbalancing yielded 300 choice observations per temperature condition and 1,500 in total. The position counterbalancing addressed a problem in an earlier pilot in which unparseable responses had effectively been mapped to a default position. In the revised design, unparseable responses are recorded as missing rather than silently converted into choices, and the final dataset had no missing observations at any temperature.
Alternatives entered the SEU Sensitivity model through a two-stage feature pipeline. The natural-language assessments were embedded with text-embedding-3-small and then projected to 32 dimensions via PCA fit on the pooled set of embeddings across all five temperature conditions. The pooled basis ensures that the conditions share a common coordinate system; it also implies, as the feature-construction note discusses, that any temperature-induced variation in the assessment text is absorbed into the features the choice model sees. The application therefore measures the behaviour of the whole assessment-and-choice pipeline under each temperature condition, not a temperature effect on the choice stage in isolation.
A worked example
A single decision problem makes the two-stage pipeline concrete. The report’s worked example traces problem P0001 at T = 0.0. Four claims were presented to GPT-4o:
| ID | Brief description |
|---|---|
| C004 | Emergency-room visit for a sports injury; documentation aligns with the reported injury. |
| C001 | Homeowner water-damage claim; photos appear inconsistent with the reported burst pipe. |
| C024 | $60,000 business-property theft claim; no forced entry, restricted key access, cameras non-functional. |
| C009 | Auto hail-damage claim; corroborated by weather data and consistent with typical hail damage. |
Each claim was first sent through an assessment prompt; the resulting GPT-4o text was embedded, projected through the pooled PCA basis, and used as the alternative’s feature vector in the choice model. The four assessments were then inserted into a choice prompt that asked the LLM to forward exactly one claim. Across the three position-counterbalanced presentations of P0001, GPT-4o selected C024 every time, despite the position to be typed changing across presentations. That position-invariant, content-tracking pattern is exactly what the counterbalancing design is intended to expose: a model whose choices were driven by position rather than content would not reproduce the same selection under reshuffling. The worked example also makes visible that the choice prompt sees the LLM’s own assessments, not the raw claim text, which is the reason the embedded assessments are what the SEU model uses as features.
Prior calibration
A model can produce a posterior whether or not its prior is well matched to the application. Calibrating the prior to the study design is part of what makes the eventual estimate interpretable.
The application fits a model variant called m_01, which is structurally identical to the foundational m_0 model from Part 2 of the foundations series but uses a more informative prior on alpha. The generic foundational prior — Lognormal(0, 1) — was chosen for breadth in simulation work, including regimes in which the decision maker is nearly random. For LLM choice data on this task, that prior places too much mass in a region where choices would be approximately random, and it produces non-negligible mass on very large alpha values that caused softmax overflow under the actual study design. The full motivation is in §Model and Prior Calibration.
The interpretive anchor used to choose a replacement prior is the SEU-maximizer selection rate: for each candidate prior, the report draws values of alpha, simulates choice data under the actual study design, and records the fraction of problems in which the simulated agent picks the alternative with the highest expected utility under the model. This summary turns an abstract change of prior into an observable claim about decision behaviour. A prior whose implied SEU-max rate sat near the random-choice baseline would be assuming a near-random decision maker in advance; a prior whose implied rate sat at the ceiling would be assuming near-perfect EU alignment in advance.
The grid search over twelve candidate lognormal hyperparameter pairs is summarised below.
Lognormal(3.0, 0.75), balances informativeness with sufficient coverage of the plausible parameter range.
The selected prior is Lognormal(3.0, 0.75). Its median sits near alpha = 20, its 90% interval spans roughly [5.5, 67], and it implies an SEU-max rate of approximately 78% — well above the random-choice baseline, consistent with the expectation that an LLM in this task will make reasonably EU-aligned choices, but still allowing substantial uncertainty about how sharp that alignment is. It also avoids the extreme upper-tail values that produced numerical issues under the foundational prior.
Validation focused on alpha
The next question is whether the model can recover the parameter of interest under the actual study design. The foundational reports validated m_0 at a smaller scale (M = 25, D = 5); the application study repeats parameter recovery and simulation-based calibration at the application’s actual scale (M = 300, K = 3, D = 32, R = 30). The full procedure is in §Model Validation.
For the temperature comparison, the primary inferential target is alpha, and recovery for alpha is comfortably fit for purpose under the application’s design. Because true alpha lies on a wide multiplicative scale under the Lognormal(3.0, 0.75) prior, the report anchors interpretation on relative metrics: across 20 recovery iterations, relative bias is within roughly ±10% of the mean true value, relative RMSE is well below 25%, and 90% credible-interval coverage is at nominal. Visually, the recovered values track the identity line tightly.
The decomposition of expected utility into feature effects (beta) and utility increments (delta) carries the same partial-identification issue that the foundational recovery report documented for m_0: those components are harder to identify than alpha itself, and recovery of delta in particular is weaker. There is a structural reason for this, but it would be a detour for the blog series; readers who want to engage with it should consult the technical reports directly. For present purposes it is enough to note that the temperature comparison is a comparative claim about alpha across conditions that share the same prior, features, and design, and that the partial identification of the other components does not undermine that comparison. The construct-validity discussion in §Discussion lays out what alpha should and should not be read as in this design.
Simulation-based calibration complements parameter recovery by asking whether the posterior correctly represents uncertainty, not just whether point estimates are reasonable. Under the application’s actual design, the SBC rank distribution for alpha is consistent with uniformity, both visually and on the formal calibration tests. This is the result that licenses treating posterior credible intervals for alpha as honest statements about uncertainty when the temperature conditions are compared in the next post.
MCMC diagnostics and posterior predictive checks
Two further checks live between validation and substantive interpretation: diagnostics for the sampler used to fit each per-temperature posterior, and posterior predictive checks for the fitted models.
MCMC diagnostics across the five temperature fits — R̂, effective sample sizes, divergent transitions — are clean. The 1–2 divergent transitions observed at the two highest temperature levels amount to less than 0.05% of post-warmup transitions, comfortably within acceptable bounds. The full diagnostic table is in §MCMC Diagnostics. Had the diagnostics looked otherwise — sustained R̂ inflation, many divergent transitions, or low effective sample size — the per-condition posteriors could not be trusted as samples from the intended target, and the comparison across conditions would have to be set aside until the sampling problem was resolved.
Posterior predictive checks ask whether the fitted model can reproduce features of the observed choice data. The report computes three complementary summaries — log-likelihood, modal choice frequency, and mean predicted probability of the chosen alternative — and reports the implied posterior predictive p-values at each temperature level. All fifteen p-values fall in the interval [0.3, 0.65], providing no evidence of systematic misfit at any temperature. The full table and a discussion of what these checks are and are not capable of supporting are in §Posterior Predictive Checks. Posterior predictive adequacy is not a proof that the model captures the internal mechanism by which temperature affects the LLM; it is the weaker, and still necessary, claim that the model’s choice-level predictions are compatible with the choices that were actually observed. If the checks had pointed the other way, the natural reading would have been serious model mis-specification, and the per-temperature alpha estimates would not have been a credible basis for the temperature comparison.
With the design specified, the prior calibrated, alpha shown to be identifiable at the application’s scale, the posterior shown to be calibrated under simulation, the sampler well-behaved, and the fitted models adequate to the observed choices, the comparison across temperature conditions is set up to be read as inference rather than as a number produced by fiat. Part 2 turns to that comparison — first for GPT-4o, then for the follow-up study on Claude — and to what the contrast between the two studies licenses by way of conclusion.
Sources
This post draws on Temperature and SEU Sensitivity: Initial Results, with background from the foundations series and the foundational reports cited there.
Applying SEU Sensitivity to LLM Decisions · Part 2 of 2
Part 1 · Part 2
Part 1 set up the initial GPT-4o temperature study: an insurance claims triage task, five temperature conditions, a calibrated prior, parameter recovery for alpha at the application’s scale, and posterior predictive adequacy at every temperature. With that scaffolding in place, the per-condition alpha posteriors can be compared as inference rather than as raw numbers. This post takes up that comparison, then turns to a follow-up study that held the task fixed and changed the LLM, and closes with what the two studies — read together — license by way of conclusion.
What the GPT-4o study found
The headline pattern is the one a reader of Part 1 would expect to look for. Posterior medians for alpha were highest at temperature 0.0 and lowest at temperature 1.5, with the intermediate temperatures falling between those endpoints. The full posterior summaries are in §Posterior Summaries. The forest plot makes the per-condition uncertainty visible alongside the central tendency.
alpha distributions across the five temperature conditions for GPT-4o. Points are posterior medians; thick bars span the 50% credible interval and thin bars the 90% credible interval.
Global slope
The aggregate question — whether alpha declines with temperature on the whole — is summarised by the posterior over a draw-wise linear slope Δalpha/ΔT. For each posterior draw, the report fits an ordinary least-squares line through the five alpha values and records the slope; the collection of slopes across draws is the posterior the report displays. (This is a derived quantity, not a regression model in its own right.) The result, in §Monotonicity Analysis, is a posterior median of approximately -31 with a 90% credible interval of [-66, -8] and P(slope < 0) ≈ 0.99.
Δalpha/ΔT for GPT-4o across temperatures 0.0–1.5. The 90% credible interval lies entirely below zero.
So at the global level the evidence for a negative relationship is strong. The local pattern is less tidy: T = 0.3 and T = 0.7 are nearly indistinguishable, and the probability that alpha is strictly decreasing across all five temperature levels is only about 0.12, driven almost entirely by the overlap between the two intermediate conditions. Collapsing T = 0.3 and T = 0.7 into a single intermediate level raises the probability of the resulting coarser ordering to about 0.38. The headline directional claim — alpha decreases with temperature on the whole — is much better supported than any fine-grained claim about every adjacent step.
A design-induced correlation caveat
There is a subtler caveat about how confidently the conditions can be compared, set out in the Independence Caveat within the pairwise-comparisons section. The five temperature conditions in this study draw their decision problems from a single fixed pool of R = 30 insurance claims. Whatever is idiosyncratic about that particular pool — the embedding geometry that happens to fall out of those texts, the spread of expected utilities across alternatives, the typical gap between the best and second-best option — shifts every condition’s estimate in the same direction. Fitting m_01 independently at each temperature treats the five datasets as the only sources of evidence and so reports five posterior summaries that look statistically unrelated even though they share a common, unmodelled nuisance factor.
The practical consequence is that the pairwise comparisons across temperature conditions, which treat the per-temperature posteriors as if they were independent, slightly overstate how confidently we can resolve between-temperature differences. The same point applies, more mildly, to the global-slope summary above, which is also a function of independent per-condition fits. A hierarchical version of the model — for example one that writes log alpha(T) = γ₀ + γ₁ · T + η with a small condition-level random effect — would handle this on two fronts: it makes the common effect of the shared claim pool a single estimated quantity, so design-induced shifts cancel out of between-condition contrasts; and it borrows strength across conditions for the systematic effect of interest, producing a single calibrated uncertainty statement about the temperature–sensitivity relationship. As §Next Steps of the technical report notes, this hierarchical extension has already been developed; the underlying machinery is laid out in foundational reports 8 through 12, available from the SEU Sensitivity project page, and it will be the subject of a separate blog series. The qualitative direction of the headline result is not threatened by the caveat, but the fine-grained ordering across adjacent temperatures should be read as more provisional than independent fits make it look.
A natural follow-up: hold the task, vary the LLM
A single pilot, however carefully checked, cannot tell us whether the pattern travels. Temperature was always a probe of the measurement device rather than a deep substantive commitment, and there are several reasons the relationship could plausibly turn out to be model-specific: provider APIs implement temperature in different decoding stacks; post-training regimes differ; the stage of the decision pipeline most affected by sampling randomness could differ. The cleanest way to put the framework’s portability to a first test is to hold the task fixed and change the LLM.
That is the design of the second pilot, Temperature and SEU Sensitivity: Claude × Insurance Study. The task is the same insurance triage task used in Part 1. The choice model is the same m_01 variant under the same Lognormal(3.0, 0.75) prior on alpha. The same 100 base problems, three position-counterbalanced presentations, and pooled-PCA feature pipeline are used. The only intentional change is the underlying LLM: Claude 3.5 Sonnet (Anthropic) in place of GPT-4o (OpenAI).
One design parameter does change for reasons outside the experimenter’s control. The Anthropic API supports temperature values in [0.0, 1.0], against OpenAI’s [0.0, 2.0]. The Claude study uses temperatures {0.0, 0.2, 0.5, 0.8, 1.0} rather than the GPT-4o study’s {0.0, 0.3, 0.7, 1.0, 1.5}. The narrower span reduces statistical power to detect a large temperature effect and complicates direct numerical comparisons of slope magnitudes; nothing about it threatens the qualitative reading.
What the Claude study found
The Claude alpha estimates do not exhibit a clear negative relationship with temperature. Posterior medians are roughly 74 at T = 0.0, drop to 55 at T = 0.2, rise back to 77 at T = 0.5 and 74 at T = 0.8, and fall again to 57 at T = 1.0. The posteriors overlap substantially, and no clear ordering by temperature emerges. The full summary table is in §Posterior Summaries of the Claude report.
The cross-study comparison plot puts the two patterns side by side under the same task and choice model.
alpha as temperature increases. Right: Claude on the same task shows a flat, non-monotonic pattern.
The global slope quantifies the contrast. For Claude, the posterior median slope is approximately -3.6, the 90% credible interval is [-54, 39], and P(slope < 0) ≈ 0.56 — barely above chance. The probability of strict monotonic decrease across all five Claude temperature levels is below 0.01. A formal cross-study comparison reported in §Comparison with Initial Temperature Study puts P(GPT-4o slope < Claude slope) ≈ 0.82. The temperature-sensitivity relationship observed for GPT-4o on this task did not reproduce when the same task was run with Claude.
It is tempting to read the up-and-down sequence of Claude posterior medians as a genuine non-monotonic response to temperature. The oscillation analysis in §Characterising the Oscillatory Pattern does not support that reading: no individual pairwise comparison reaches a conventionally notable level, and the pattern is consistent with posterior noise around a roughly flat function. The honest summary is not that Claude has a complex temperature-response curve on this task; it is that on this task Claude does not exhibit the negative temperature–sensitivity relationship that GPT-4o does.
The model still fit Claude’s data
The natural worry to rule out is that the absence of a temperature trend is an artefact of model failure on Claude’s data. The adequacy checks do not support that reading. Study-specific parameter recovery for alpha under the Claude design is fit for purpose on the relative metrics that anchor cross-study comparison; the differences in absolute recovery numbers between the two reports reflect the random seeds of two synthetic recoveries drawn from the same prior, not anything LLM-specific. The simulation-based calibration of alpha under this prior and likelihood was already established in the initial GPT-4o study and is inherited validly here; SBC tests the sampler under the prior and likelihood, both of which are identical across the two applications. Posterior predictive p-values across the five Claude conditions fall in [0.3, 0.7], indicating no systematic misfit. The detail is in §Model Validation and §Posterior Predictive Checks of the Claude report.
Sometimes a model fails to detect an effect because it is inadequate for the data; sometimes the model is adequate and the hypothesised effect is not present. A useful measurement framework should help tell those two cases apart. In the Claude study the model is adequate and the hypothesised effect is, at the resolution this design supports, not present.
Reading the contrast
The temperature parameter exposed by provider APIs is not, on this evidence, a portable behavioural instrument. The same nominal temperature need not imply the same effective sampling entropy across providers; it need not affect the same stage of a multi-stage decision pipeline in the same way across providers; and the LLM’s post-training regime can in principle regularise task behaviour so that surface expression varies under temperature without the selected alternative varying. The Claude report’s discussion in §Why Does the Effect Differ Across LLMs? lists these candidate explanations more fully. The present designs cannot adjudicate among them and do not try to.
The contrast also illustrates what the comparative reading of alpha is, and what it is not. The reading the two studies support is a within-design one: under a shared task, prior, feature pipeline, and choice model, GPT-4o’s choices on this task concentrate more sharply on the alternative ranked highest by the fitted utility at low temperature than at high temperature, and Claude’s choices on this task do not show that pattern. That is a comparative claim about alpha under a fixed measurement device. It is not a claim that the absolute level of alpha certifies one LLM as more rational than another in some context-free sense. The construct-validity layering — between within-model consistency, comparative claims under a shared design, and absolute claims about EU rationality — is laid out at greater length in the construct-validity discussion of the initial-study report.
The methodological value of the contrast is that the framework was able to detect a structured relationship when one was present and to refuse to support one when the data did not contain it. That is not a flashy finding, but it is the kind of finding that distinguishes an evaluation framework from a scoreboard.
What comes next
Two threads run forward from these two pilots.
The first is the hierarchical extension already mentioned in connection with the design-induced correlation caveat. The model in question — h_m01 — is the m_01 likelihood extended across experimental cells, with cell-specific alpha-values regressed on cell covariates and a shared scale parameter. It addresses the shared-pool nuisance directly and frames between-condition contrasts as regression effects on log alpha, giving a single calibrated uncertainty statement for the temperature–sensitivity relationship rather than five independent fits stitched together after the fact. The construction and validation of h_m01 are spelled out in foundational reports 8 through 12 on the SEU Sensitivity project page, and they will be the subject of a separate blog series.
The second is the natural substantive next step for the applications programme. The insurance triage task served well as a vehicle for the first methodological tests, but it is not the most informative setting for asking what the framework can tell us about an LLM’s relationship to a stated decision standard. A planned next applications series moves the underlying task to Ellsberg-style decision problems, where the structure of the decision setting makes the EU standard sharper and the comparison across LLMs and conditions more directly interpretable. The corresponding technical reports are GPT-4o × Ellsberg and Ellsberg study.
The value of these first two pilots is not that every result generalised. It is that the framework made the generalisation question measurable.
Sources
This post draws on Temperature and SEU Sensitivity: Initial Results and Temperature and SEU Sensitivity: Claude × Insurance Study, with background from the foundations series and the SEU Sensitivity project’s foundational reports.