All corrections
LessWrong February 26, 2026 at 03:53 PM

www.lesswrong.com/posts/4mvphwx5pdsZLMmpY/recent-ai-model-progress-feels-mostly-...

1 correction found

1
Claim
but which nevertheless has not done a publicized evaluation of either Claude 3.7 Sonnet, or DeepSeek, or o3-mini.
Correction

ARC Prize has publicly published evaluations involving DeepSeek and o3-mini (and also published ARC-AGI results for DeepSeek), so it’s not true that they haven’t publicized evaluations of these models.

Full reasoning
The post claims the ARC Prize (ARC-AGI prize) "has not done a publicized evaluation" of any of: Claude 3.7 Sonnet, DeepSeek, or o3-mini. However, ARC Prize has publicly published: 1) **A public ARC Prize evaluation that explicitly includes o3-mini (and DeepSeek)**: In its Feb 14, 2025 post introducing SnakeBench, ARC Prize writes that it tested 50 LLMs "from Haiku to o3-mini" and also highlights a match "by o3-mini and DeepSeek". This is a publicized evaluation by ARC Prize that includes **o3-mini** (and **DeepSeek**). 2) **A public ARC-AGI evaluation that includes DeepSeek**: In its Jun 5, 2025 post "We tested every major AI reasoning system. There is no clear winner.", ARC Prize publishes ARC-AGI-1 and ARC-AGI-2 results tables that include **DeepSeek R1** scores. So, regardless of whether ARC Prize has (or hasn’t) published ARC-AGI results for Claude 3.7 Sonnet specifically, the statement that they have not publicized evaluations of **DeepSeek** or **o3-mini** is contradicted by ARC Prize’s own published posts.
2 sources
Model: OPENAI_GPT_5 Prompt: v1.11.0