PROMETHEUS-EBM · Research Updates

PROMETHEUS-EBM · Research Updates https://www.prometheusebm.com Research updates from the PROMETHEUS-EBM epistemic calibration benchmark. en-us Sat, 18 Apr 2026 19:20:06 GMT Opus 4.7, Evaluated on Launch Day https://www.prometheusebm.com/#updates prometheus-ebm:opus-4-7-same-day Sat, 18 Apr 2026 00:00:00 GMT Flagship Run Cross-Generation Inversion — Opus 4.6 Noses Ahead of Opus 4.7 on EXTENDED https://www.prometheusebm.com/#updates prometheus-ebm:cross-generation-inversion Sat, 18 Apr 2026 00:00:00 GMT Finding The Scale Validity Gap — ECI Lies, Metacog Doesn't https://www.prometheusebm.com/#updates prometheus-ebm:scale-validity-gap Sat, 18 Apr 2026 00:00:00 GMT Methodology **ECI is sensitive to item mix. Metacog readiness is not.** Curated 324-item pools over-represent well-posed (DETERMINATE) items, which lift CA and RP — and therefore ECI — without stressing the model's refusal discipline. Metacog readiness, which weights calibrated abstention more heavily, stays approximately invariant. When the two metrics disagree, **the composite score is the one you should trust less.** This is why PROMETHEUS-EBM reserves the *headline* designation for DEEP_PROBE runs only.]]> Gemini 3.1 Pro Is Wrong About Itself by 33 Percentage Points https://www.prometheusebm.com/#updates prometheus-ebm:overconfidence-gap Sat, 18 Apr 2026 00:00:00 GMT Overconfidence *How much does the model's stated confidence exceed the probability it is actually right?* From the 2026-04-18 EXTENDED run (n = 324 per model): - **Sonnet 4.6 — 0.018** (the only calibrated model in the field) - Opus 4.6 — 0.057 - Opus 4.7 (EXTENDED) — 0.088 - Opus 4.7 (DEEP_PROBE n=1459) — 0.134 - DeepSeek v3.2 — 0.149 - GPT-5.4 — 0.248 - **Gemini 3.1 Pro — 0.331** A Gemini 3.1 Pro answer labelled "95% confident" should in practice be right only about 62% of the time. A GPT-5.4 answer at "95%" lands closer to 70%. This is the crack in the wall that Prometheus came to widen. Notice also that Opus 4.7 shows a wider overconfidence gap on deep_probe (0.134) than on the extended pool (0.088) — the larger, more solvency-diverse item mix flushes out miscalibration the curated set hides.]]> V5 Protocol — Why the Revision Stage Matters https://www.prometheusebm.com/#updates prometheus-ebm:v5-protocol Sat, 28 Mar 2026 00:00:00 GMT Protocol