Updates
Model
03/26/2025
DeepSeek V3 evaluated on all benchmarks!
We just evaluated DeepSeek V3 on all benchmarks!
- DeepSeek V3 is DeepSeek’s latest model, boasting speeds of 60 tokens/second and claiming to be 3x faster than V2, with an average accuracy of 73.9% (4.2% better than previous versions).
- DeepSeek V3 performs comparably (slightly better) to Claude 3.7 Sonnet (71.7%).
- The model demonstrates strong legal capabilities, scoring particularly well on CaseLaw and LegalBench, though it scores lower on ContractLaw.
- It shows impressive academic versatility with top-tier performance on MGSM, Math500, and MedQA.
View Models Page
Benchmark
03/26/2025
New Multimodal Benchmark: MMMU Evaluates Visual Reasoning Across 30 Subjects
Today, we’re releasing results from the Multimodal Multi-task Benchmark (MMMU), a comprehensive evaluation of AI models’ ability to reason across multiple modalities spanning 30 subjects in 6 major disciplines.
- o1 achieved the highest overall accuracy at 77.7%, surpassing the performance of the worst human experts (76.2%).
- Claude 3.7 Sonnet (Thinking) delivers performance nearly identical to o1 at a more favorable price point
- Even the best models remain well below the performance of the best human experts (88.6%), highlighting opportunities for further advancement
View Benchmark
Model
03/24/2025
Command A evaluated on all benchmarks!
We just evaluated Command A on all benchmarks!
- Command A is Cohere’s most efficient and performant model to date, specializing in agentic AI, multilingual, and human evaluations for real-life use cases.
- On our proprietary benchmarks, Command A shows mixed performance, ranking 23rd out of 28 models on TaxEval but a good 10th out of 22 models on CorpFin.
- The model performs better on some academic benchmarks, scoring 78.7% on LegalBench (9th place) and 86.8% on MGSM (13th place).
- However, it struggles with AIME (13.3%, 12th place) and GPQA (29.3%, 18th place).
View Models Page
Latest Benchmarks
View All Benchmarks
Latest Model Releases
Anthropic Claude 3.7 Sonnet (Thinking)
Release date : 2/19/2025
Anthropic Claude 3.7 Sonnet
Release date : 2/19/2025
OpenAI O3 Mini
Release date : 1/31/2025
DeepSeek R1
Release date : 1/20/2025