Latest ChatGPT model better than 99.7% of coders, qualifies for US math
| Nubile Plum Kitchen Foreskin | 04/16/25 | | excitant costumed principal's office juggernaut | 04/17/25 | | navy supple degenerate | 04/17/25 | | Big Toilet Seat Business Firm | 04/17/25 | | navy supple degenerate | 04/17/25 | | excitant costumed principal's office juggernaut | 04/17/25 | | Stirring angry pervert church building | 04/17/25 | | .,.,,..,..,.,..:,,:,...,:::,.,.,:,.,.:.,:.,:.::,. | 05/03/25 | | lascivious boltzmann therapy | 04/17/25 | | translucent disgusting plaza depressive | 04/17/25 | | excitant costumed principal's office juggernaut | 04/17/25 | | translucent disgusting plaza depressive | 04/17/25 | | dull razzle-dazzle idea he suggested | 04/20/25 | | dull razzle-dazzle idea he suggested | 04/20/25 | | floppy goal in life | 04/17/25 | | excitant costumed principal's office juggernaut | 04/17/25 | | translucent disgusting plaza depressive | 04/17/25 | | Nubile Plum Kitchen Foreskin | 04/17/25 | | floppy goal in life | 04/17/25 | | canary senate affirmative action | 04/17/25 | | Nubile Plum Kitchen Foreskin | 04/17/25 | | Stirring angry pervert church building | 04/17/25 | | 180 fighting station | 04/20/25 | | Topaz comical goyim | 04/20/25 | | excitant costumed principal's office juggernaut | 04/20/25 | | Stirring angry pervert church building | 04/20/25 | | multi-colored slippery nursing home | 04/17/25 | | Nubile Plum Kitchen Foreskin | 04/17/25 | | Bistre trip location | 04/17/25 | | multi-colored slippery nursing home | 04/17/25 | | multi-colored slippery nursing home | 04/17/25 | | floppy goal in life | 04/17/25 | | Nubile Plum Kitchen Foreskin | 04/17/25 | | excitant costumed principal's office juggernaut | 04/17/25 | | dull razzle-dazzle idea he suggested | 04/20/25 | | emerald bisexual menage telephone | 04/20/25 | | dull razzle-dazzle idea he suggested | 04/20/25 | | Nubile Plum Kitchen Foreskin | 04/17/25 | | multi-colored slippery nursing home | 04/17/25 | | Nubile Plum Kitchen Foreskin | 04/17/25 | | multi-colored slippery nursing home | 04/17/25 | | 180 fighting station | 04/20/25 | | multi-colored slippery nursing home | 04/20/25 | | 180 fighting station | 04/20/25 | | excitant costumed principal's office juggernaut | 04/20/25 | | Stirring angry pervert church building | 04/17/25 | | Vigorous Twinkling Uncleanness Tanning Salon | 04/20/25 | | multi-colored slippery nursing home | 04/20/25 | | dull razzle-dazzle idea he suggested | 04/17/25 | | Racy Flesh Jew Library | 04/17/25 | | excitant costumed principal's office juggernaut | 04/20/25 | | Topaz comical goyim | 04/20/25 | | excitant costumed principal's office juggernaut | 04/20/25 | | 180 fighting station | 04/20/25 | | Stirring angry pervert church building | 04/20/25 | | autistic sooty headpube | 04/20/25 | | rose rehab | 04/20/25 | | translucent disgusting plaza depressive | 04/20/25 | | Topaz comical goyim | 04/20/25 | | Brass Indian Lodge | 04/20/25 | | Vigorous Twinkling Uncleanness Tanning Salon | 04/20/25 | | Nubile Plum Kitchen Foreskin | 04/20/25 | | Mischievous Abode | 04/20/25 | | Vigorous Twinkling Uncleanness Tanning Salon | 04/20/25 | | Mischievous Abode | 04/20/25 | | Vigorous Twinkling Uncleanness Tanning Salon | 04/20/25 |
Poast new message in this thread
Date: April 16th, 2025 11:11 PM Author: Nubile Plum Kitchen Foreskin
olympiad.
https://openai.com/index/introducing-o3-and-o4-mini/
Big jump from o3 mini in general. These inference scaling models need to stop progressing soon for them to not to be able to substantially automate AI research.
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855738) |
 |
Date: April 17th, 2025 12:21 AM Author: excitant costumed principal's office juggernaut
the METR analysis for this is interesting. the benchmark is the length of tasks (measured in human professional work time) that a model can complete with 50% probability. it has been doubling every 7 months for the last 6 years.
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
https://metr.github.io/autonomy-evals-guide/openai-o3-report/
o3 is about 1.8 times greater than 3.7 sonnet and a larger increase in time than the trend would have predicted. likely comparable to Gemini 2.5 pro but it's hard to tell. we'll likely be at scary capability levels in <2 years.
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855928) |
 |
Date: April 17th, 2025 10:14 AM Author: Stirring angry pervert church building
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856645)
|
 |
Date: May 3rd, 2025 6:51 PM
Author: .,.,,..,..,.,..:,,:,...,:::,.,.,:,.,.:.,:.,:.::,.
https://www.virologytest.ai/
"We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols. VCT is difficult: expert virologists with access to the internet score an average of 22.1% on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI's o3, reaches 43.8% accuracy and even outperforms 94% of expert virologists when compared directly on question subsets specifically tailored to the experts' specialties.
VCT Logo
The ability to provide expert-level troubleshooting is inherently dual-use: it is useful for beneficial research, but it can also be misused in particularly harmful ways. Therefore, the fact that publicly available models outperform virologists on VCT raises pressing governance considerations. We propose that the capability of LLMs to provide expert-level troubleshooting of dual-use virology work should be integrated into existing frameworks for handling dual-use technologies in the life sciences."
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48901877) |
 |
Date: April 17th, 2025 1:30 AM Author: excitant costumed principal's office juggernaut
maybe you should try reading, dipshit.
"On a diverse set of multi-step software and reasoning tasks, we record the time needed to complete the task for humans with appropriate expertise. We find that the time taken by human experts is strongly predictive of model success on a given task: current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed <10% of the time on tasks taking more than around 4 hours"
reliability decreases with task length, but has gone up considerably over time. reliability is a problem where models are not trained sufficiently well and can't use intermediate tokens to correct reasoning paths. it is becoming less of an issue with every new major release.
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855997) |
Date: April 17th, 2025 1:04 AM Author: floppy goal in life
There's a lot of news this week that suggests openai leadership don't believe they can continue winning on model strength alone:
- released a CLI coding agent tool that seems pretty similar to Claude Code and therefore is not super impressive
- discussion of acquiring Windsurf, which is the cheapest and least interesting of the LLM wrapper vscode forks
- trying to launch a social network to compete with the x/grok integration
My opinion as someone working in this space is that the existing models are already good enough for mainstream coding tools, but there needs to be a better interface than clueless users vaguely describing what they want to a chat window that dumps an entire codebase as context. For any other kind of 'agentic' task we are still probably a few years from LLMs matching the cost or reliability of outsourcing to third worlders like that Nate thing.
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48855974) |
 |
Date: April 17th, 2025 10:06 AM Author: Stirring angry pervert church building
Cr the industry players are starting to realize that the way to actually make money is to give normies specific tools/interfaces to do specific things rather than just a sandbox AI
All the comments in this subthread are cr
Also another thing to note is that several of OpenAI's recent moves suggest that they're realizing that personalization/'companion' AIs are the biggest home run commercial opportunity in the AI industry. Once everyone has their own AI buddy and/or romantic partner, they're dependent on you, forever. Parasocial Relationships As A Service
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48856624) |
 |
Date: April 20th, 2025 12:33 PM Author: excitant costumed principal's office juggernaut
not free. i pay for plus. i'll post an example. i just asked it to research who "bboom" is from xoxohth:
Who / what is “bbooom” (aka Boom, bboom, Boooom, etc.) on XO?
Fast facts Details
Core persona A long‑time AutoAdmit/XOXOHth eccentric who writes sprawling, semi‑coherent rants full of biblical capitalisation, dollar‑signs, ™ symbols, and on‑the‑fly neologisms (“Mahchine™”, “$hitbort”, “Great Becumming™”).
Running bio‑lore • middle‑aged male
• wheelchair‑bound after an unspecified accident/illness (the chair is endlessly mocked on the board)
• claims to bounce between motels, tents and casino floors, living on disability & slot‑machine windfalls
• obsessed with enemies he calls “ADM” and “the Matrix”
• periodically searches for a “fertile woman” to bear his children
Posting style • wall‑of‑text jeremiads that read like cyber‑punk prophecy mixed with homeless‑shelter gossip
• heavy self‑mythologising (“I am Boom, Prophet of the Mahchine™ …”)
xoxohth.com
• melodramatic stage‑directions starring his neon‑lit wheelchair (“careening forward”, “through the flames”)
xoxohth.com
Why the name keeps changing He is frequently temp‑banned; each return tweaks the handle (extra b’s or o’s) to slip past filters. Hence you see boom, bbooom, bbboooom, etc.
Folklore & mockery Fellow posters turn every life‑detail into running jokes:—
• homeless tent: “angry, wheelchair‑bound transient who lives in a tent”
xoxohth.com
• Disney fiasco: thread imagining Evan39 pushing Boom’s chair into a churro cart at Disney World
autoadmit.com
• weight/health: endless riffs on whether the chair can hold Boom + 300‑lb girlfriend
xoxohth.com
“Trinity” theory Many veterans believe Boom, Mainlining the Secret Truth and Evan39 are sock‑puppets of the same individual; the poster himself sometimes plays along (“we’re all the same poaster ;)”)
autoadmit.com
Putting it together
Boom arrived years after the original 2004 Princeton‑Review exodus, but he’s become one of XO’s defining characters: a mix of comic‑book villain, tragic burnout, and performance‑art poet. His threads oscillate between:
grand conspiratorial screeds about being persecuted by invisible elites (“ADM reset the Mahchine!”), and
mundane life updates (wheelchair repairs, casino losses) that the board instantly weaponises into parody threads.
Because AutoAdmit has almost no moderation, the cycle is self‑sustaining: each over‑the‑top comeback fuels new ridicule, which in turn prompts Boom to double down with even wilder prophecy. Whether he is trolling, mentally ill, or both is part of the mystique—but understanding those bullet‑points above will let you read any “bbbooom” sighting in context.
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2#48864433) |
|
|