Updates
Model
03/13/2025
Jamba 1.6 Large and Mini Evaluated on All Benchmarks.
We just evaluated Jamba 1.6 Large and Jamba 1.6 Mini models!
- Jamba 1.6 Large and Jamba 1.6 Mini are the latest versions of the open source Jamba models, developed by AI21 Labs.
- On our private benchmarks, Jamba 1.6 Large shows reasonable performance, getting the 16th place out of 27 models on TaxEval. with 65.3% accuracy, beating GPT-4o Mini and Claude 3.5 Haiku.
- However both models are not competitive on public benchmarks, they get the last two places on AIME and GPQA.
View Models Page
Benchmark
03/11/2025
Academic Benchmarks Released: GPQA, MMLU, AIME (2024 and 2025), Math 500, and MGSM
Today, we’ve released five new academic benchmarks on our site: three evaluating mathematical reasoning, and two on general question-answering.
Unlike results released by model providers on these benchmarks, we applied a consistent methodology and prompt-template across models, ensuring an apples-to-apples comparison. You can find detailed information about our evaluation approach on each benchmark’s page:
View Benchmarks
Benchmark
03/05/2025
New Multimodal Mortgage Tax Benchmark Released
We just released a new benchmark in partnership with Vontive!
- The MortgageTax benchmark evaluates language models on extracting information from tax certificates.
- It tests multimodal capabilities with 1258 document images, including both computer-written and handwritten content.
- The benchmark includes two key tasks: semantic extraction (identifying year, parcel number, county) and numerical extraction (calculating annualized amounts).
Claude 3.7 Sonnet leads the pack with 80.6% accuracy, and the other top 3 models are all from Anthropic.
View Benchmark
Latest Benchmarks
View All Benchmarks
Latest Model Releases
Anthropic Claude 3.7 Sonnet (Thinking)
Release date : 2/19/2025
Anthropic Claude 3.7 Sonnet
Release date : 2/19/2025
OpenAI O3 Mini
Release date : 1/31/2025
DeepSeek R1
Release date : 1/20/2025