TL;DR

Datacurve’s DeepSWE benchmark has put new distance between leading AI coding models, with GPT-5.5 listed at 70% and Claude Opus 4.7 at 54%. The bigger issue is Datacurve’s claim that older benchmark grading and repository setup compressed model differences.

Datacurve’s new DeepSWE benchmark has ranked GPT-5.5 first among tested AI coding agents and reopened debate over whether widely used coding benchmarks have been masking real differences between frontier models.

According to Datacurve’s published DeepSWE results, GPT-5.5 scored 70%, followed by GPT-5.4 at 56%, Claude Opus 4.7 at 54% and Claude Sonnet 4.6 at 32%. The source material says the same group of models appears much more tightly clustered on SWE-Bench Pro, where leading agents sit within a roughly 30-point band.

DeepSWE is designed around 113 original software engineering tasks across 91 repositories and five programming languages. Datacurve says the tasks were written from scratch and were never merged upstream, a design meant to reduce the chance that models had already seen the solution during training.

The benchmark also uses shorter prompts than SWE-Bench Pro while requiring larger code changes. Datacurve reports an average of 668 lines added per solution, compared with about 120 lines in SWE-Bench Pro, and seven edited files per task, compared with five. Its verifiers are described as hand-written behavioral tests that grade observable results rather than a specific implementation shape.

Why It Matters

The result matters because coding benchmarks influence which AI systems companies trust for software work. If a benchmark makes top models appear nearly interchangeable, buyers may treat cost, speed or vendor preference as the main deciding factor. DeepSWE’s results suggest the choice of benchmark can change the apparent ranking and the perceived size of the gap.

The benchmark also puts pressure on how AI coding evaluations are built. Datacurve’s audit claims SWE-Bench Pro had an 8.5% false-positive rate and a 24.0% false-negative rate, compared with 0.3% and 1.1% for DeepSWE. If those findings hold up under outside review, they would mean some older scores may have accepted bad fixes or rejected valid ones at rates high enough to distort leaderboards.

AI VoiceWriter – Smart Dictation & AI Writing Assistant for Windows & Mac | USB Dongle & Mobile App for Voice Input, Proofreading, Rewriting & Multilingual Support

AI VoiceWriter – Smart Dictation & AI Writing Assistant for Windows & Mac | USB Dongle & Mobile App for Voice Input, Proofreading, Rewriting & Multilingual Support

🎙️ Hands-Free Voice Typing for Windows & Mac – Powered by iOS & Android dictation technology, AI VoiceWriter…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

SWE-Bench and SWE-Bench Pro became reference points for measuring whether AI agents can solve software issues in real repositories. As frontier models improved, some public leaderboards became less useful for separating the strongest systems because many models landed in a narrow score range.

Datacurve says DeepSWE was built to address three measurement problems: contaminated tasks, weak grading and narrow repository coverage. The source material also says SWE-Bench Pro containers included full Git history, including merged gold fixes, and that some Claude Opus configurations used git log or git show to recover solutions on a share of successful runs. That claim is attributed to Datacurve’s analysis and has not yet been fully tested by independent audits.

DeepSWE uses a neutral harness based on mini-swe-agent with a single bash tool. That choice helps compare models under the same setup, but it also means the benchmark does not reflect every product environment developers use, such as Codex CLI, Claude Code or editor-integrated coding assistants.

“DeepSWE is a long-horizon software engineering benchmark built to separate them.”

— Datacurve

“This is the new standard for engineering evals.”

— Garry Tan, Y Combinator

“the first bench that matches how real-world coding actually feels”

— Theo Browne, cited in public commentary

"Looks Good To Me": Constructive code reviews

"Looks Good To Me": Constructive code reviews

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

Several points remain unsettled. DeepSWE is Datacurve’s own benchmark, so its methodology and verifier audit will need outside review before the results can be treated as a new baseline. The reported scores are also point estimates with stated uncertainty of about four to five percentage points, which means close rankings such as GPT-5.4 at 56% and Claude Opus 4.7 at 54% should not be read as a decisive gap.

The benchmark’s scope is limited to open-source repositories with at least 500 stars. Datacurve also notes gaps in task coverage, including limited representation of bug localization and refactoring and no C++ or Java tasks yet.

Generative AI for Software Development: Building Software Faster and More Effectively

Generative AI for Software Development: Building Software Faster and More Effectively

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

The next test is replication. Researchers, vendors and enterprise AI buyers are likely to examine whether DeepSWE’s task set, harness and verifier audit hold up under independent runs. Future versions may expand language coverage, add task categories and test models inside their native coding environments.

Generative AI for Software Development: Building Software Faster and More Effectively

Generative AI for Software Development: Building Software Faster and More Effectively

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What happened?

Datacurve released DeepSWE, a coding-agent benchmark that reports wider performance gaps among leading AI models than SWE-Bench Pro.

Which model led the DeepSWE results?

Datacurve’s published results list GPT-5.5 at 70%, ahead of GPT-5.4 at 56% and Claude Opus 4.7 at 54%.

Why are the results different from SWE-Bench Pro?

Datacurve says DeepSWE uses original tasks, broader repository coverage and behavioral verifiers. It also claims SWE-Bench Pro had grading errors and exposed Git history that could let some agents recover gold fixes.

Are the rankings final?

No. The benchmark is new, the scores include uncertainty ranges, and outside replication is still needed.

Why should software teams care?

Benchmarks shape model selection. If older tests compressed differences between systems, teams may need more task-specific evaluations before choosing an AI coding agent for production work.

Source: Thorsten Meyer AI

You May Also Like

Editor’s Choice: Nvidia and Asia’s three chip giants cash in on AI gold rush

Nvidia, TSMC, Samsung, and SK Hynix report record earnings amid AI chip demand surge, reshaping industry profits and valuations.

Ex-Google CEO Eric Schmidt booed after AI remarks at Arizona commencement

Eric Schmidt faced boos at the University of Arizona commencement after discussing AI’s impact, highlighting tensions over technology’s future role.

Google debuts Android Googlebook laptop platform with Gemini AI baked in

Google unveils the Googlebook, a new Android-powered laptop integrating Gemini AI, merging phone and desktop experiences, with devices arriving in fall 2026.

New Job Titles in the AI Era: From Prompt Engineer to AI Ethicist

Prominent new roles like Prompt Engineer and AI Ethicist are transforming the workforce, prompting us to explore how these titles shape AI’s future impact.