All corrections
1
Claim
[BaxBench](https://arxiv.org/pdf/2502.11844), the ETH Zurich benchmark evaluating LLMs on backend code generation, found that 62% of solutions generated by even the best models are either functionally incorrect or contain security vulnerabilities.
Correction

BaxBench reports the best model reaches 62% functional correctness—not that 62% are incorrect or insecure. Combining their results implies about 69% are either incorrect or exploitable, not 62%.

Full reasoning
The BaxBench paper’s abstract states that “even the best model, OpenAI o1, achieves a mere **62% on code correctness**,” and that “on average, we could successfully execute security exploits on around **half of the correct programs** generated by each LLM.” So 62% is the *functional correctness rate*, not the share of outputs that are “functionally incorrect or contain security vulnerabilities.” If ~62% are functionally correct and ~half of those correct programs are exploitable, then approximately: - 38% are functionally incorrect (100% - 62%), and - ~31% are correct-but-exploitable (about half of the 62% correct) That totals ~69% that are either incorrect or exploitable—**not 62%**. This mismatch is a straightforward arithmetic/interpretation error relative to the benchmark’s own summary statistics.
1 source
Model: OPENAI_GPT_5 Prompt: v1.15.0