NIST CAISI evaluates DeepSeek V4 Pro as most capable Chinese model but 8 months behind US frontier
Tags AI · Research · Infrastructure

NIST's Center for AI Standards and Innovation published a formal evaluation of DeepSeek V4 Pro, finding it the most capable PRC-developed AI model tested to date but trailing leading US frontier models by approximately 8 months in aggregate capability. Using Item Response Theory Elo scoring across five domains (cyber, software engineering, natural sciences, abstract reasoning, mathematics), DeepSeek V4 Pro scored ~800 Elo (+/-28) versus GPT-5.5 at 1,260 and Claude Opus 4.6 at 999. On public benchmarks, the gap narrows: DeepSeek scored 90% on GPQA-Diamond (one point behind Opus 4.6's 91%) and 74% on SWE-Bench Verified versus GPT-5.5's 81%. DeepSeek was more cost-efficient than GPT-5.4 mini on 5 of 7 benchmarks. The evaluation used non-public benchmarks including CTF-Archive-Diamond (285 cybersecurity challenges) and PortBench. CAISI pre-committed to its benchmark suite before seeing results.