New Finance Agent Benchmark Released

Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Model

08/29/2025

GLM 4.5 Evaluated!

There’s been speculation that open-source models from China have overtaken U.S. models, so we put another China-based model to the test. We found that Z.ai’s GLM 4.5 model still doesn’t beat the top U.S. open-source models.

That said, for an open-source model it delivers solid top-twenty results on AIME (#5/51), GPQA (#16/53), MMLU Pro (#15/51), LiveCodeBench (#15/53), and our own CaseLaw benchmark (#20/27).

Beyond those highlights, the rest of the performance is fairly standard. When compared directly to U.S. open-source peers, GLM 4.5 does perform better than some models such as Llama 4 Maverick , but it is still outperformed by GPT OSS 120B across nearly every benchmark.

GLM 4.5 definitely still has room for improvement. We’re looking forward to seeing how open-source models continue to progress, but for now there is still a long way to go.

View GLM 4.5 Results

Benchmark

08/27/2025

Grok Code Evaluated on Coding Benchmarks!

We evaluated xAI’s Grok Code Fast on three of our coding benchmarks and found it to be much faster (and cheaper) for practical coding tasks, but significantly worse than xAI’s flagship model Grok 4 in general. Our findings are below:

Grok Code Fast is a snappier (and cheaper) model optimized for coding, and our results show that while there is significant room for improvement relative to other frontier models including xAI’s Grok 4 , it performs competitively on practical coding tasks while offering benefits in terms of latency and cost.

View Grok Code Results

Benchmark

08/26/2025

GPT-5 has been Evaluated on SWE-Bench!

GPT 5 achieved the highest overall accuracy on SWE-Bench, attaining an impressive 68.8%!

Results released come from running the model with the following settings:

  • High reasoning
  • Default verbosity
  • New response endpoint

Evaluated across all four task categories based off difficulty and 500 benchmark instances, GPT 5 ranked first in every category except for the “>4 hours” group, where it was among four models tied with a 33% completion rate on the most challenging tasks.

These results demonstrate that GPT 5 represents a significant advancement over previous OpenAI models.

View SWE-Bench Results

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

Grok Code Fast

Grok Code Fast

Release date : 8/25/2025

View Model
GPT 5 Nano

GPT 5 Nano

Release date : 8/7/2025

View Model
GPT 5 Mini

GPT 5 Mini

Release date : 8/7/2025

View Model
GPT 5

GPT 5

Release date : 8/7/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.