Updates
View All Updates
Model
04/15/2025
GPT 4.1, 4.1 Mini, and 4.1 Nano evaluated on all benchmarks!
We just evaluated GPT 4.1, GPT 4.1 Mini, and GPT 4.1 Nano on all benchmarks!
-
GPT 4.1 delivers impressive results with a 75.5% average accuracy across benchmarks.
-
Impressive performance on proprietary benchmarks! GPT 4.1 is now the leader on CorpFin (71.2%), and shows strong performance on CaseLaw (85.8%, 4/53), and MMLU Pro (80.5%, 6/33).
-
GPT 4.1 Nano and GPT 4.1 Mini bring AI to time-sensitive applications with an outstanding latency of only 3.62s and 6.60s respectively while still achieving 59.1% and 75.1% average accuracy.
-
Compact but capable! Despite its size, GPT 4.1 Mini performs admirably on Math500 (88.8%, 10/36) and MGSM (87.9%, 20/34).
-
Size versus performance tradeoff: The smaller models do show lower performance on some complex tasks, with GPT 4.1 Nano ranking near the bottom on MMLU Pro (62.3%, 30/33) and MGSM (69.8%, 32/34).
View Models Page
Model
04/11/2025
Grok 3 Beta and Mini Beta (High and Low Reasoning) evaluated on all benchmarks!
We just evaluated Grok 3 Beta, Grok 3 Mini Fast Beta (High Reasoning), and Grok 3 Mini Fast Beta (Low Reasoning) on all benchmarks!
-
Grok 3 Beta delivers impressive results with a 78.1% average accuracy across benchmarks and a snappy 15.52s latency.
-
Dominates proprietary benchmarks! Grok 3 Beta ranks #1 on three key benchmarks: CorpFin (69.1%), CaseLaw (88.1%), and TaxEval (78.8%).
-
Grok 3 Mini Fast Beta (High Reasoning) surprises with an even higher average accuracy of 81.6% despite being a smaller model.
-
Mathematical prowess! Grok 3 Mini Fast Beta (High Reasoning) takes the #2 place (94.2%) on Math500 and the #3 place (85.00%) on AIME.
View Models Page
Model
04/07/2025
Llama 4 Maverick and Llama 4 Scout evaluated on all benchmarks!
We just evaluated Llama 4 Maverick and Llama 4 Scout on all benchmarks!
- Llama 4 Scout achieves an average accuracy of 61.5% with a latency of 7.13 seconds, placing the model at a tie with Mistral Small 3.1 (03/2025) (61.5%) and just behind Cohere’s Command A (63.5%).
- Llama 4 Maverick sits at 67.0% accuracy with a latency of 7.72 seconds ranking just behind Anthropic’s Claude 3.5 Sonnet (69.9%) and DeepSeek V3 (03/24/2025) (74.7%).
- Both models excel on public benchmarks, with Maverick achieving top rankings in MMMU (4/17), MGSM (4/28), GPQA (5/27), and MMLU Pro (5/27), while Scout delivers strong results in MMMU (10/17), MortgageTax (10/18), and AIME (11/26).
- However, these models show a significant gap between their impressive public benchmark performance and mediocre results on private benchmarks, particularly struggling with TaxEval (Maverick: 28/34, Scout: 32/34), Contract Law (Maverick: 37/54, Scout: 43/54), and MedQA (Maverick: 32/32, Scout: 30/32).
View Models Page
Latest Benchmarks
View All Benchmarks
Latest Model Releases
View All Models
Anthropic Claude 3.7 Sonnet (Thinking)
Release date : Invalid Date
Anthropic Claude 3.7 Sonnet
Release date : Invalid Date
OpenAI O3 Mini
Release date : Invalid Date
DeepSeek R1
Release date : Invalid Date