Updates
View All Updates
Model
08/29/2025
GLM 4.5 Evaluated!
There’s been speculation that open-source models from China have overtaken U.S. models, so we put another China-based model to the test. We found that Z.ai’s GLM 4.5 model still doesn’t beat the top U.S. open-source models.
That said, for an open-source model it delivers solid top-twenty results on AIME (#5/51), GPQA (#16/53), MMLU Pro (#15/51), LiveCodeBench (#15/53), and our own CaseLaw benchmark (#20/27).
Beyond those highlights, the rest of the performance is fairly standard. When compared directly to U.S. open-source peers, GLM 4.5 does perform better than some models such as Llama 4 Maverick , but it is still outperformed by GPT OSS 120B across nearly every benchmark.
GLM 4.5 definitely still has room for improvement. We’re looking forward to seeing how open-source models continue to progress, but for now there is still a long way to go.
View GLM 4.5 Results
Benchmark
08/27/2025
Grok Code Evaluated on Coding Benchmarks!
We evaluated xAI’s Grok Code Fast on three of our coding benchmarks and found it to be much faster (and cheaper) for practical coding tasks, but significantly worse than xAI’s flagship model Grok 4 in general. Our findings are below:
- Grok Code Fast scores 62% on LCB, placing the model in the middle of the pack, comparable to other reasoning models like Claude Sonnet 4 (Nonthinking) , but for a tenth of the price.
- On IOI, Grok Code Fast gets 4.3% placing the model at 8/12. By contrast, Grok 4 gets 26.2% and places first overall!
- On SWE-bench, Grok Code Fast gets an impressive 57.6% percent, placing 4th right behind Grok 4 ‘s 58.6%, but with a latency of 264.68s compared to Grok 4 ‘s 704.8s.
Grok Code Fast is a snappier (and cheaper) model optimized for coding, and our results show that while there is significant room for improvement relative to other frontier models including xAI’s Grok 4 , it performs competitively on practical coding tasks while offering benefits in terms of latency and cost.
View Grok Code Results
Benchmark
08/26/2025
GPT-5 has been Evaluated on SWE-Bench!
GPT 5 achieved the highest overall accuracy on SWE-Bench, attaining an impressive 68.8%!
Results released come from running the model with the following settings:
- High reasoning
- Default verbosity
- New response endpoint
Evaluated across all four task categories based off difficulty and 500 benchmark instances, GPT 5 ranked first in every category except for the “>4 hours” group, where it was among four models tied with a 33% completion rate on the most challenging tasks.
These results demonstrate that GPT 5 represents a significant advancement over previous OpenAI models.
View SWE-Bench Results
Latest Benchmarks
View All Benchmarks
Latest Model Releases
View All Models
Grok Code Fast
Release date : 8/25/2025
GPT 5 Nano
Release date : 8/7/2025
GPT 5 Mini
Release date : 8/7/2025
GPT 5
Release date : 8/7/2025