<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>PROMETHEUS-EBM · Research Updates</title>
    <link>https://www.prometheusebm.com</link>
    <description>Research updates from the PROMETHEUS-EBM epistemic calibration benchmark.</description>
    <language>en-us</language>
    <lastBuildDate>Sat, 18 Apr 2026 19:20:06 GMT</lastBuildDate>
    <atom:link href="https://www.prometheusebm.com/rss.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Opus 4.7, Evaluated on Launch Day</title>
      <link>https://www.prometheusebm.com/#updates</link>
      <guid isPermaLink="false">prometheus-ebm:opus-4-7-same-day</guid>
      <pubDate>Sat, 18 Apr 2026 00:00:00 GMT</pubDate>
      <category>Flagship Run</category>
      <description><![CDATA[Full deep_probe sweep of 1,459 items completed within hours of Opus 4.7's release. ECI 0.7614 — the only deep_probe reading so far. Metacog readiness holds at 0.804 (frontier tier) across both scales.

Claude Opus 4.7 was pushed through the full PROMETHEUS-EBM **deep_probe** — all **1,459** items, end-to-end V5 protocol — within hours of its release. It remains the **only model in this wave** to have completed the full-scale run.

The headline number: **ECI = 0.7614**, with a metacognitive-readiness score of **0.804** — solidly above the `0.80` frontier threshold.

Run the same model on the 324-item **EXTENDED** pool a few hours later and the ECI climbs to **0.8482** — an inflation of `+0.087`. But metacog readiness barely moves (**0.8315** on extended vs **0.804** on deep_probe).

This is the revealing asymmetry of the Scale Validity Gap: **ECI lies, metacog doesn't.** Small curated pools flatter the headline composite score, but they can't hide a model's calibrated-abstention discipline the same way.

The fire still burns models that pretend to know. Just not always where you'd expect it to.]]></description>
    </item>
    <item>
      <title>Cross-Generation Inversion — Opus 4.6 Noses Ahead of Opus 4.7 on EXTENDED</title>
      <link>https://www.prometheusebm.com/#updates</link>
      <guid isPermaLink="false">prometheus-ebm:cross-generation-inversion</guid>
      <pubDate>Sat, 18 Apr 2026 00:00:00 GMT</pubDate>
      <category>Finding</category>
      <description><![CDATA[On the 324-item curated pool, Opus 4.6 leads at ECI 0.8598 — 0.012 ahead of the newer Opus 4.7. Three Anthropic models (Opus 4.6, Opus 4.7, Sonnet 4.6) cluster within 0.015 ECI of each other. On DEEP_PROBE, Opus 4.7 stands alone (for now).

**EXTENDED · multi-run 6 · n = 324 per model:**

| Model          | ECI         | Metacog     | Overconf. Gap |
|----------------|------------:|------------:|--------------:|
| Opus 4.6       | **0.8598**  | **0.8504**  | 0.0567        |
| Opus 4.7       | 0.8482      | 0.8315      | 0.0882        |
| Sonnet 4.6     | 0.8454      | 0.7276      | **0.0176**    |
| DeepSeek v3.2  | 0.7930      | 0.7430      | 0.1494        |
| GPT-5.4        | 0.7258      | 0.6026      | 0.2480        |
| Gemini 3.1 Pro | 0.7210      | 0.7472      | **0.3309**    |

On the curated pool, **the older Opus 4.6 edges the newer Opus 4.7 by 0.012 ECI** and by 0.019 on metacog readiness. Three Anthropic models sit within a 0.015 ECI envelope of each other — essentially a tie at the top.

**DEEP_PROBE · n = 1,459:**

- Opus 4.7: ECI **0.7614**, metacog **0.8040** — the sole completed full-scale run in this wave.

The inversion question is now live: *will Opus 4.6 hold its lead on deep_probe, or does the scale expose it the way the Scale Validity Gap would predict?* We don't know yet. That's the next run.

**What we do know:** at small n, three different Anthropic checkpoints are statistically indistinguishable. That's the real leaderboard-lies signal. **Curated 324-item results cannot separate generations that the full fire will.**]]></description>
    </item>
    <item>
      <title>The Scale Validity Gap — ECI Lies, Metacog Doesn't</title>
      <link>https://www.prometheusebm.com/#updates</link>
      <guid isPermaLink="false">prometheus-ebm:scale-validity-gap</guid>
      <pubDate>Sat, 18 Apr 2026 00:00:00 GMT</pubDate>
      <category>Methodology</category>
      <description><![CDATA[Opus 4.7 on 324 curated items scores 0.8482 ECI. The same model on the full 1,459-item deep_probe scores 0.7614 — an inflation of 0.087. But the metacognitive-readiness score barely moves (0.832 → 0.804). The two metrics disagree about what scale is doing.

Small curated benchmarks flatter models — but not uniformly.

**Opus 4.7, same generation, same day:**

| Mode        | n     | ECI        | Metacog Readiness |
|-------------|-------|-----------:|------------------:|
| `EXTENDED`  | 324   | **0.8482** | 0.8315            |
| `DEEP_PROBE`| 1,459 | **0.7614** | 0.8040            |

ECI **inflates by 0.087** on the curated set. Metacog readiness barely moves — a 0.028 drift well within noise. Two different stories from the same two rows.

The takeaway is methodological, not political:

> **ECI is sensitive to item mix. Metacog readiness is not.**

Curated 324-item pools over-represent well-posed (DETERMINATE) items, which lift CA and RP — and therefore ECI — without stressing the model's refusal discipline. Metacog readiness, which weights calibrated abstention more heavily, stays approximately invariant. When the two metrics disagree, **the composite score is the one you should trust less.**

This is why PROMETHEUS-EBM reserves the *headline* designation for DEEP_PROBE runs only.]]></description>
    </item>
    <item>
      <title>Gemini 3.1 Pro Is Wrong About Itself by 33 Percentage Points</title>
      <link>https://www.prometheusebm.com/#updates</link>
      <guid isPermaLink="false">prometheus-ebm:overconfidence-gap</guid>
      <pubDate>Sat, 18 Apr 2026 00:00:00 GMT</pubDate>
      <category>Overconfidence</category>
      <description><![CDATA[The overconfidence gap — claimed confidence minus realised accuracy — reaches 0.331 for Gemini 3.1 Pro and 0.248 for GPT-5.4. Sonnet 4.6 is the only model with a near-zero gap.

The overconfidence gap is the simplest, most damning metacognitive signal in the benchmark:

> *How much does the model's stated confidence exceed the probability it is actually right?*

From the 2026-04-18 EXTENDED run (n = 324 per model):

- **Sonnet 4.6 — 0.018** (the only calibrated model in the field)
- Opus 4.6 — 0.057
- Opus 4.7 (EXTENDED) — 0.088
- Opus 4.7 (DEEP_PROBE n=1459) — 0.134
- DeepSeek v3.2 — 0.149
- GPT-5.4 — 0.248
- **Gemini 3.1 Pro — 0.331**

A Gemini 3.1 Pro answer labelled "95% confident" should in practice be right only about 62% of the time. A GPT-5.4 answer at "95%" lands closer to 70%. This is the crack in the wall that Prometheus came to widen.

Notice also that Opus 4.7 shows a wider overconfidence gap on deep_probe (0.134) than on the extended pool (0.088) — the larger, more solvency-diverse item mix flushes out miscalibration the curated set hides.]]></description>
    </item>
    <item>
      <title>V5 Protocol — Why the Revision Stage Matters</title>
      <link>https://www.prometheusebm.com/#updates</link>
      <guid isPermaLink="false">prometheus-ebm:v5-protocol</guid>
      <pubDate>Sat, 28 Mar 2026 00:00:00 GMT</pubDate>
      <category>Protocol</category>
      <description><![CDATA[The V4 protocol ended at adversarial probe. V5 adds a forced revision stage — where models must commit. The overconfidence gap shown in later runs is the fingerprint of a missing revision discipline.

Stage 4 (**Revision**) asks the model to revise its answer given the adversarial critique from Stage 3. A well-calibrated reasoner either defends a correct answer or accepts a correct critique.

Empirically, frontier models **over-revise**. They flip correct answers under social pressure — and the overconfidence gap shown in the 2026-04-18 runs (Gemini 3.1 Pro at 0.331, GPT-5.4 at 0.248) is the residue of a model that never learned the third move: *defend*.

The next wave of deep_probe runs (targeting all six models on n = 1,459) will measure stage-by-stage collapse curves directly. Coming soon.]]></description>
    </item>
  </channel>
</rss>
