LessWrong February 26, 2026 at 03:53 PM

recent-ai-model-progress-feels-mostly-...

1 correction found

Claim

but which nevertheless has not done a publicized evaluation of either Claude 3.7 Sonnet, or DeepSeek, or o3-mini.

Correction

ARC Prize has publicly published evaluations involving DeepSeek and o3-mini (and also published ARC-AGI results for DeepSeek), so it’s not true that they haven’t publicized evaluations of these models.

Full reasoning

The post claims the ARC Prize (ARC-AGI prize) "has not done a publicized evaluation" of any of: Claude 3.7 Sonnet, DeepSeek, or o3-mini. However, ARC Prize has publicly published: 1) **A public ARC Prize evaluation that explicitly includes o3-mini (and DeepSeek)**: In its Feb 14, 2025 post introducing SnakeBench, ARC Prize writes that it tested 50 LLMs "from Haiku to o3-mini" and also highlights a match "by o3-mini and DeepSeek". This is a publicized evaluation by ARC Prize that includes **o3-mini** (and **DeepSeek**). 2) **A public ARC-AGI evaluation that includes DeepSeek**: In its Jun 5, 2025 post "We tested every major AI reasoning system. There is no clear winner.", ARC Prize publishes ARC-AGI-1 and ARC-AGI-2 results tables that include **DeepSeek R1** scores. So, regardless of whether ARC Prize has (or hasn’t) published ARC-AGI results for Claude 3.7 Sonnet specifically, the statement that they have not publicized evaluations of **DeepSeek** or **o3-mini** is contradicted by ARC Prize’s own published posts.

2 sources

ARC Prize Side Quest: SnakeBench (Published 14 Feb 2025)
ARC Prize: “We tested 50 LLMs — from Haiku to o3-mini — to see which were the best at battling snakes.” and “Top SnakeBench Match by o3-mini and DeepSeek.”
We tested every major AI reasoning system. There is no clear winner. (Published 05 Jun 2025)
ARC Prize publishes ARC-AGI score tables that include “DeepSeek R1” entries under ARC-AGI-1 and ARC-AGI-2 scores.

Model: OPENAI_GPT_5 Prompt: v1.11.0