Substack March 3, 2026 at 04:58 PM

claude-code-security-a-reasoned-take?r=34lihb&utm_medium...

1 correction found

Claim

[BaxBench](https://arxiv.org/pdf/2502.11844), the ETH Zurich benchmark evaluating LLMs on backend code generation, found that 62% of solutions generated by even the best models are either functionally incorrect or contain security vulnerabilities.

Correction

BaxBench reports the best model reaches 62% functional correctness—not that 62% are incorrect or insecure. Combining their results implies about 69% are either incorrect or exploitable, not 62%.

Full reasoning

The BaxBench paper’s abstract states that “even the best model, OpenAI o1, achieves a mere **62% on code correctness**,” and that “on average, we could successfully execute security exploits on around **half of the correct programs** generated by each LLM.” So 62% is the *functional correctness rate*, not the share of outputs that are “functionally incorrect or contain security vulnerabilities.” If ~62% are functionally correct and ~half of those correct programs are exploitable, then approximately: - 38% are functionally incorrect (100% - 62%), and - ~31% are correct-but-exploitable (about half of the 62% correct) That totals ~69% that are either incorrect or exploitable—**not 62%**. This mismatch is a straightforward arithmetic/interpretation error relative to the benchmark’s own summary statistics.

1 source

arXiv:2502.11844 — BaxBench: Can LLMs Generate Correct and Secure Backends?
Abstract: “...even the best model, OpenAI o1, achieves a mere 62% on code correctness; (ii) on average, we could successfully execute security exploits on around half of the correct programs generated by each LLM...”

Model: OPENAI_GPT_5 Prompt: v1.15.0