The Claude Mythos benchmarks are extremely concerning | AutoAdmit.com

The most prestigious law school admissions discussion board in the world.

Back

Refresh

Options

Favorite

The Claude Mythos benchmarks are extremely concerning

Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 ...

.,,.,..,.,....,

I didn't know what any of that meant so I ran it through dee...

are you sure these are genuine? was there a preview release ...

It’s from the system card. No ARC AGI 2 scores i think...

.,,.,..,.,....,

Poast new message in this thread

Favorite

Date: April 7th, 2026 3:28 PM
Author: .,,.,..,.,....,

Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)
SWE-bench Verified: 93.9% / 80.8% / — / 80.6%
SWE-bench Pro: 77.8% / 53.4% / 57.7% / 54.2%
SWE-bench Multilingual: 87.3% / 77.8% / — / —
SWE-bench Multimodal: 59.0% / 27.1% / — / —
Terminal-Bench 2.0: 82.0% / 65.4% / 75.1% / 68.5%

GPQA Diamond: 94.5% / 91.3% / 92.8% / 94.3%
MMMLU: 92.7% / 91.1% / — / 92.6–93.6%
USAMO: 97.6% / 42.3% / 95.2% / 74.4%
GraphWalks BFS 256K–1M: 80.0% / 38.7% / 21.4% / —

HLE (no tools): 56.8% / 40.0% / 39.8% / 44.4%
HLE (with tools): 64.7% / 53.1% / 52.1% / 51.4%

CharXiv (no tools): 86.1% / 61.5% / — / —
CharXiv (with tools): 93.2% / 78.9% / — / —

OSWorld: 79.6% / 72.7% / 75.0% / —

These are very large increases overall but especially for SWE task that are relevant to automating AI research. LLMs are not hitting a wall and the pace of progress here is faster than ever. They aren’t even releasing this model to the general public. The possible worlds where AI doesn’t totally fuck over human labor in the next decade or so are rapidly declining in probability

(http://www.autoadmit.com/thread.php?thread_id=5854649&forum_id=2...#49800904)

Favorite

Date: April 7th, 2026 3:39 PM
Author: chilmata

I didn't know what any of that meant so I ran it through deepseek:

This is a fascinating post because it's a **"vibe leak"** or a **"rumor mill aggregation."** The user isn't talking about a specific product you can use right now (like ChatGPT-4o or Claude 3.7 Sonnet). They are talking about **internal, unreleased, next-generation frontier models** that are currently being tested inside labs like Anthropic, OpenAI, and Google.

Here’s the translation of what you're looking at and why the poster is alarmed.

### The Core Premise: "Claude Mythos" is Terrifyingly Good
The user is comparing **Claude Mythos** (an internal codename at Anthropic) against **Claude Opus 4.6** (the current/next public flagship), **GPT-5.4** (OpenAI's unreleased test), and **Gemini 3.1 Pro** (Google's unreleased test).

The poster's entire argument hinges on the first column: **Claude Mythos**.

### What Do These Numbers Mean?
These benchmarks are designed to test if an AI can replace a **human software engineer or researcher**.

| Benchmark Name | What it Actually Tests | Why "Mythos" (93.9%) is a Big Deal |
| :--- | :--- | :--- |
| **SWE-bench Verified** | Real GitHub bug fixes from open-source Python projects. | This is the **"Will AI take my coding job?"** benchmark. A score of **~94%** means the AI is solving nearly *every* software engineering ticket thrown at it without human help. Public models are currently at ~50-60%. |
| **SWE-bench Pro** | Enterprise-level, massive codebase fixes. | The gap is even wider here (Mythos: 78% vs Others: ~54%). This suggests it can navigate **millions of lines of code** to find and fix a specific bug. |
| **Terminal-Bench 2.0** | Complex command-line usage and system administration. | 82% means it's an expert sysadmin. |
| **USAMO** | **USA Math Olympiad** problems (proof-based, not multiple choice). | This is the most shocking number. **97.6%** means it's effectively perfect at *International Math Olympiad-level proofs*. The drop-off for Opus 4.6 (42.3%) suggests this is a **step-change in reasoning**, not just training. |
| **GraphWalks BFS 1M** | Navigating a graph with 1,000,000 nodes. | This tests **long-horizon planning and memory**. Mythos holds it together (80%), while others crash (20-40%). |
| **GPQA Diamond** | **Google-Proof Q&A** - PhD-level biology, physics, chemistry questions that aren't findable on the internet. | **94.5%** is effectively a "PhD or better" score in hard sciences. |

### The Scary Part: "LLMs are not hitting a wall"

The poster is reacting to a common narrative in 2025 that "AI scaling has hit a wall" or "we are seeing diminishing returns." This post is counter-evidence. The gap between **Opus 4.6** (which is already better than anything we have) and **Mythos** is *massive*.

The poster is pointing out:
1. **The best models are hidden:** "They aren’t even releasing this model to the general public." This is the key. The public sees ChatGPT 4o or Claude 3.5. The labs see *this*. The reason they aren't releasing it is likely **cost** (it might cost $10,000 per query to run GraphWalks) or **safety alignment** (a model that gets 98% on Math Olympiad can probably break encryption protocols if asked nicely).
2. **Automating AI Research:** The line *"especially for SWE task that are relevant to automating AI research"* is the doomer kicker. If Mythos can code at 94% accuracy on real-world bugs, it can *write the code for the next version of itself.* This is the **recursive self-improvement** loop that leads to explosive growth.

### Summary of the Poster's Sentiment (in plain English)

> *"I have seen the internal test scores of the next AI models. The top secret one (Claude Mythos) is scoring 95-98% on tasks that are literally designed to measure **human expert intelligence** (PhD science, Math Olympiad, Software Engineering). The public thinks AI is plateauing because they only see ChatGPT and Gemini. That's wrong. The pace of progress behind closed doors is actually *faster* than ever. If this is what they have *now*, then the version they have in 5 years will be able to do all of our jobs better than we can. The probability that my career is safe is plummeting."*

(http://www.autoadmit.com/thread.php?thread_id=5854649&forum_id=2...#49800928)

Favorite

Date: April 7th, 2026 3:43 PM
Author: in the naked

are you sure these are genuine? was there a preview release or something? do they have numbers for arc-AGI2?

(http://www.autoadmit.com/thread.php?thread_id=5854649&forum_id=2...#49800933)

Favorite

Date: April 7th, 2026 3:53 PM
Author: .,,.,..,.,....,

It’s from the system card. No ARC AGI 2 scores i think

https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841.pdf

(http://www.autoadmit.com/thread.php?thread_id=5854649&forum_id=2...#49800978)