New Finance Agent Benchmark Released

Task Type:

MMMU Benchmark

Last updated

Task type :

Key Takeaways

  • Gemini 2.5 Pro Exp delivers the highest performance by 1.5 percentage points over the second place held by o3.
  • Even the top-performing models remain well below human expert performance, highlighting opportunities for further advancement. The best human experts get 88.6% on this benchmark.
  • Grok 2 Vision underperforms expectations for a specialized vision model, suggesting room for improvement in multi-modal reasoning
  • There was a significant variance in the latency of the top models. For example, Claude 3.7 Sonnnet (Thinking), took 70s to respond on average, whereas the non-thinking version took only 8s.

Context

The Multimodal Multi-task Benchmark (MMMU) follows a similar methodology to its predecessor, MMLU, but the multiple-choice questions asked include both text and images. MMMU encompasses over 1,000 high-quality tasks spanning 30 subjects in 6 major disciplines:

  • Arts & Design
  • Business
  • Science
  • Health & Medicine
  • Humanities & Social Sciences
  • Tech & Engineering

We based this benchmark on the standard 4-option multiple-choice format containing approximately 1,700 questions from the official Hugging Face dataset. The benchmark focuses specifically on how well models can process and reason about problems where images are interleaved with text, requiring sophisticated visual understanding and cross-modal reasoning capabilities.

MMMU is particularly valuable because it tests the models’ abilities to solve graduate-level questions where visual information is critical to finding the correct answer.

Methodology

We adhered closely to the official MMMU evaluation protocol with the following implementation details:

  1. Prompt Structure: We used the chain-of-thought prompt from the original MMMU repository. Each question-answer set followed this format:
Which of the following best explains the overall trend shown in the <image>?
A. Migrations to areas of Central Asia for resettlement
B. The spread of pathogens across the Silk Road
C. Invasions by Mongol tribes
D. Large-scale famine due to crop failures
Answer the preceding multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of options. Think step by step before answering.
  1. Image Processing: All <image> tags were replaced with actual image bytes during model inference, replicating the original methodology exactly.

  2. Standardization: We established a consistent baseline by using a default configuration of 8192 maximum output tokens for all models to ensure that outputs were not truncated. All models were ran with a temperature of 0.

  3. Parsing Adaptation: We modified the answer extraction regex to handle markdown output from some models, ensuring reliable parsing across all responses.

  4. Statistical Validity: This was a pass@1 evaluation (one attempt per question), with the large dataset size (1,700 questions). We found a standard deviation approximately 1% (calculated using the methodology from Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations)

  5. Thinking Models Configuration: For models with “thinking” capabilities (Claude 3.7 Sonnet Thinking, o1, etc.), we set the maximum output token limit to 16,384 to accommodate extended reasoning.

Highest Quality Models

o1

o1

Released date : 12/17/2024

Accuracy :

77.7%

Latency :

26.4s

Cost :

$15.00 / $60.00

  • Achieved the highest overall accuracy at 81.5%, surpassing the performance of the worst human experts (76.2%).
  • Showed the lowest latency among all models with thinking capabilities, averaging 24.15 seconds per response

View Model


Results Analysis

According to the original MMMU research, human expert performance ranges from 76.2% for the worst-performing experts to 88.6% for the best-performing experts. Our evaluation shows that the leading AI models are now approaching the lower bound of human expert performance but remain substantially below the upper bound.

Several notable trends emerged from our analysis:

  1. Vendor Clustering: Models from the same provider tend to demonstrate similar performance characteristics:

    • Anthropic models cluster tightly in the upper performance range with small variance
    • OpenAI shows the widest performance distribution, from exceptional o1 to mid-range GPT-4o (2024-08-06)
    • Meta’s models form a distinct cluster in the low performance range
    • Google shows a clear generational improvement pattern
  2. Cost-Performance Relationship: The scatter plot reveals a general correlation between cost and performance, but with significant outliers:

    • o1 stands as an extreme outlier, with substantially higher costs than its closest performance competitors
    • Several mid-tier models offer reasonable performance at dramatically lower price points
  3. Vision Model Specialization: Surprisingly, models marketed specifically for vision capabilities (Grok 2 Vision, Llama 3.2 Vision) underperformed relative to general-purpose models with multi-modal capabilities.

MMMU Performance vs. Cost

/

Conclusion

The MMMU benchmark results demonstrate that state-of-the-art multi-modal models have made substantial progress in visual reasoning across diverse domains. The Claude and OpenAI models lead the field, with a notable performance gap separating them from other providers. Despite these advances, even the best models remain significantly below expert human performance, indicating ample room for continued improvement in multi-modal reasoning capabilities.

These results suggest that while costs remain high for top-tier performance, the strong showing of more economical models like Gemini 2.5 Pro indicates that sophisticated multi-modal reasoning may soon become more accessible and affordable for practical applications.

Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.