Date: April 7th, 2026 3:39 PM
Author: chilmata
I didn't know what any of that meant so I ran it through deepseek:
This is a fascinating post because it's a **"vibe leak"** or a **"rumor mill aggregation."** The user isn't talking about a specific product you can use right now (like ChatGPT-4o or Claude 3.7 Sonnet). They are talking about **internal, unreleased, next-generation frontier models** that are currently being tested inside labs like Anthropic, OpenAI, and Google.
Here’s the translation of what you're looking at and why the poster is alarmed.
### The Core Premise: "Claude Mythos" is Terrifyingly Good
The user is comparing **Claude Mythos** (an internal codename at Anthropic) against **Claude Opus 4.6** (the current/next public flagship), **GPT-5.4** (OpenAI's unreleased test), and **Gemini 3.1 Pro** (Google's unreleased test).
The poster's entire argument hinges on the first column: **Claude Mythos**.
### What Do These Numbers Mean?
These benchmarks are designed to test if an AI can replace a **human software engineer or researcher**.
| Benchmark Name | What it Actually Tests | Why "Mythos" (93.9%) is a Big Deal |
| :--- | :--- | :--- |
| **SWE-bench Verified** | Real GitHub bug fixes from open-source Python projects. | This is the **"Will AI take my coding job?"** benchmark. A score of **~94%** means the AI is solving nearly *every* software engineering ticket thrown at it without human help. Public models are currently at ~50-60%. |
| **SWE-bench Pro** | Enterprise-level, massive codebase fixes. | The gap is even wider here (Mythos: 78% vs Others: ~54%). This suggests it can navigate **millions of lines of code** to find and fix a specific bug. |
| **Terminal-Bench 2.0** | Complex command-line usage and system administration. | 82% means it's an expert sysadmin. |
| **USAMO** | **USA Math Olympiad** problems (proof-based, not multiple choice). | This is the most shocking number. **97.6%** means it's effectively perfect at *International Math Olympiad-level proofs*. The drop-off for Opus 4.6 (42.3%) suggests this is a **step-change in reasoning**, not just training. |
| **GraphWalks BFS 1M** | Navigating a graph with 1,000,000 nodes. | This tests **long-horizon planning and memory**. Mythos holds it together (80%), while others crash (20-40%). |
| **GPQA Diamond** | **Google-Proof Q&A** - PhD-level biology, physics, chemistry questions that aren't findable on the internet. | **94.5%** is effectively a "PhD or better" score in hard sciences. |
### The Scary Part: "LLMs are not hitting a wall"
The poster is reacting to a common narrative in 2025 that "AI scaling has hit a wall" or "we are seeing diminishing returns." This post is counter-evidence. The gap between **Opus 4.6** (which is already better than anything we have) and **Mythos** is *massive*.
The poster is pointing out:
1. **The best models are hidden:** "They aren’t even releasing this model to the general public." This is the key. The public sees ChatGPT 4o or Claude 3.5. The labs see *this*. The reason they aren't releasing it is likely **cost** (it might cost $10,000 per query to run GraphWalks) or **safety alignment** (a model that gets 98% on Math Olympiad can probably break encryption protocols if asked nicely).
2. **Automating AI Research:** The line *"especially for SWE task that are relevant to automating AI research"* is the doomer kicker. If Mythos can code at 94% accuracy on real-world bugs, it can *write the code for the next version of itself.* This is the **recursive self-improvement** loop that leads to explosive growth.
### Summary of the Poster's Sentiment (in plain English)
> *"I have seen the internal test scores of the next AI models. The top secret one (Claude Mythos) is scoring 95-98% on tasks that are literally designed to measure **human expert intelligence** (PhD science, Math Olympiad, Software Engineering). The public thinks AI is plateauing because they only see ChatGPT and Gemini. That's wrong. The pace of progress behind closed doors is actually *faster* than ever. If this is what they have *now*, then the version they have in 5 years will be able to do all of our jobs better than we can. The probability that my career is safe is plummeting."*
(http://www.autoadmit.com/thread.php?thread_id=5854649&forum_id=2...#49800928)