Vals AI in Media

View All News →

The Winners (and Losers) of This New Vibe-Coding Benchmark Will Surprise You

OpenAI's Less-Flashy Rival Might Have a Better Business Model

Vals AI Report Shows Gen AI Tools Outperforming Lawyers on Legal Research Tasks

We tested which AI gave the best answers without making stuff up. One beat ChatGPT.

Industry Leaderboard

Select industry:

xAI

Updates

We’ve launched a new benchmark, Poker Agent, as a collaborative experiment with Graphite Digital. Poker Agent pits 17 models against each other in a head-to-head, ten-handed Poker competition.

Unlike our other benchmarks, models compete directly against each other, and have to respond to each others actions.
Models employed a variety of strategies. Some models like Grok 4.1 Fast (Reasoning) were very aggressive, others took a more conservative approach.
GPT 5.2 was the top overall performer.

Allowing models to compete against each other in a shared, rather than isolated, environment represents an exciting new area of research. Stay tuned for more benchmark releases in 2026!

Evaluations are finished for MiniMax-M2.1.

It does well among open-weight models on coding tasks, ranking among the top 5 performers in SWE Bench, Terminal Bench and Live Code Bench.
It also ties for first place among open-weight models on our Case Law v2 benchmark, and is second among open-weight models on our Finance Agent benchmark.
The model also offers these capabilities at a relatively low cost and latency compared with its peers.

While there remains significant room for improvement, it is a strong open-weight release from MiniMax AI, and we look forward to continued improvements on all our benchmarks.

NOTE: The model was tested with temperature=1 and top_p=0.95, via the Anthropic-compatible MiniMax API endpoint.

Evaluations are finished for GLM 4.7.

It is neck and neck with DeepSeek V3.2 (Thinking) on SWE Bench, 67.0% and 68.8% respectively. This is a +11% bump on SWE Bench compared to GLM 4.6
On Terminal Bench, it was one of the top open-weight models, along with DeepSeek V3.2 (Nonthinking)
Overall, it is more token efficient than its predecessor, and was much cheaper per-task, despite having the same input and output pricing.

The model was tested with temperature=1 and default top_p for all benchmarks but SWE Bench and Terminal Bench, which used temperature=0.7 and top_p=1. Reasoning was enabled for all benchmarks.

MiniMax-M2.1 ranks #2 among open-weight models on our Vals Index, but offers competitive performance with lower latency and cost compared to the #1 ranked open-weight model, GLM 4.7.

It is #12 on the overall leaderboard, behind models from the likes of OpenAI, Gemini, and Anthropic.
It ranks in the top 5 open-weight performers in all the benchmarks in our Vals Index.

The model was tested with temperature=1 and top_p=0.95, via the Anthropic-compatible MiniMax API endpoint.

Congratulations to MiniMax for a strong open-weight release! Results on the full suite of benchmarks will be released soon!

benchmark

01/05/26

Poker Agent Released

View Details

Loading benchmark data...

View details

Public Enterprise LLM Benchmarks

Best Performing Models

Best Open Weight Models

Pareto Efficient Models

Vals AI in Media

Industry Leaderboard

Updates

Poker Agent Released

Poker Agent Released

Join our mailing list to receive benchmark updates