BridgeBench Claim Claude Opus 4.6 ‘Nerfed’ Criticized
BridgeBench reported Claude Opus 4.6 fell from #2 to #10 on a hallucination leaderboard, with accuracy cited at 83.3% to 68.3%. Critics noted the retest used different task sets and overlapping scores barely changed.
BridgeBench, the team behind a coding and hallucination benchmark, posted that Anthropic’s Claude Opus 4.6 fell from second to tenth place on its hallucination leaderboard after a retest. The post said accuracy had dropped from 83.3% to 68.3% and used the word “nerfed” to describe the change.
Computer scientist Paul Calcraft challenged the comparison, writing that the original 83.3% score came from six benchmark tasks while the retest covered 30. Calcraft posted that results on the six tasks common to both runs were 85.4% in the retest versus 87.6% previously, and that a single additional fabrication in one task accounted for much of the reported decline.
BridgeBench framed the change as a large surge in hallucinations. Critics pointed out two methodological issues: a change in the set of tasks used for comparison and the absence of repeated runs to estimate natural run-to-run variability. Large language models can produce different outputs across runs, so a single additional failure on a small sample can create a large percentage swing.
The debate follows user reports of perceived changes in Claude Opus 4.6 performance since its February 2026 release. Developers have reported shorter responses, weaker instruction-following and reduced reasoning depth during peak usage. Anthropic has introduced adaptive thinking controls that let the model adjust its reasoning effort; the company set the default effort level to medium to favor efficiency and added context compaction to reduce the chance of hitting context limits.
An independent usage analysis cited by critics examined more than 6,800 Claude Code sessions and found a roughly 67% decline in measured reasoning depth by late February. The same analysis reported a drop in the file-read ratio before edits, from 6.6 to 2.0, which the analysts said indicates fewer code review steps before proposing changes.
Based on the available test data, the two BridgeBench runs used different task sets and the scores on overlapping tasks were similar. BridgeBench’s public post and Calcraft’s response are part of a broader discussion about how benchmark methodology and product-level defaults can affect observed model behavior.
Anthropic had not issued a public response to the BridgeBench claim as of April 13, 2026.
Content on BlockPort is provided for informational purposes only and does not constitute financial guidance.
We strive to ensure the accuracy and relevance of the information we share, but we do not guarantee that all content is complete, error-free, or up to date. BlockPort disclaims any liability for losses, mistakes, or actions taken based on the material found on this site.
Always conduct your own research before making financial decisions and consider consulting with a licensed advisor.
For further details, please review our Terms of Use, Privacy Policy, and Disclaimer.








