All corrections
www.resilientcyber.io/p/claude-code-security-a-reasoned-take?r=34lihb&utm_medium...
1 correction found
1
Claim
[BaxBench](https://arxiv.org/pdf/2502.11844), the ETH Zurich benchmark evaluating LLMs on backend code generation, found that 62% of solutions generated by even the best models are either functionally incorrect or contain security vulnerabilities.
Correction
BaxBench reports the best model reaches 62% functional correctness—not that 62% are incorrect or insecure. Combining their results implies about 69% are either incorrect or exploitable, not 62%.
Full reasoning
The BaxBench paper’s abstract states that “even the best model, OpenAI o1, achieves a mere **62% on code correctness**,” and that “on average, we could successfully execute security exploits on around **half of the correct programs** generated by each LLM.”
So 62% is the *functional correctness rate*, not the share of outputs that are “functionally incorrect or contain security vulnerabilities.” If ~62% are functionally correct and ~half of those correct programs are exploitable, then approximately:
- 38% are functionally incorrect (100% - 62%), and
- ~31% are correct-but-exploitable (about half of the 62% correct)
That totals ~69% that are either incorrect or exploitable—**not 62%**.
This mismatch is a straightforward arithmetic/interpretation error relative to the benchmark’s own summary statistics.
1 source
- arXiv:2502.11844 — BaxBench: Can LLMs Generate Correct and Secure Backends?
Abstract: “...even the best model, OpenAI o1, achieves a mere 62% on code correctness; (ii) on average, we could successfully execute security exploits on around half of the correct programs generated by each LLM...”