NIST CAISI Evaluates DeepSeek V4 Pro: Trails Leading US Models by 8 Months but Offers Superior Cost Efficiency
Tags Research · AI · Models

NIST's Center for AI Standards and Innovation (CAISI) published its evaluation of DeepSeek V4 Pro on May 2, 2026 — the most authoritative US government benchmark of a Chinese frontier model to date. DeepSeek V4 Pro performs similarly to GPT-5 (released ~8 months earlier), trailing leading US models like GPT-5.5 and Opus 4.6. However, it is more cost-efficient than GPT-5.4 mini on 5 of 7 benchmarks, ranging from 53% cheaper to 41% more expensive. On specific benchmarks, DeepSeek V4 Pro scored 74% on SWE-Bench Verified (vs. 81% for GPT-5.5), 90% on GPQA-Diamond (vs. 96%), and 97% on OTIS-AIME-2025 (vs. 100%). The IRT-estimated Elo for DeepSeek V4 Pro is 800 ± 28, compared to 1260 ± 28 for GPT-5.5. DeepSeek's self-reported evaluations claim parity with Opus 4.6 and GPT-5.4, but CAISI's non-public benchmarks show a meaningful gap. The evaluation covered cyber, software engineering, natural sciences, abstract reasoning, and mathematics domains.