New Finance Agent Benchmark Released

Kimi K2 Instruct

Released Date: 7/11/2025

Avg. Accuracy:

66.0%

Latency:

215.33s

Performance by Benchmark

Benchmarks

Accuracy

Rankings

CorpFin

62.2%

( 17 / 44 )

TaxEval

74.1%

( 26 / 61 )

Math500

94.2%

( 10 / 55 )

AIME

62.7%

( 19 / 51 )

MGSM

90.9%

( 22 / 54 )

LegalBench

80.5%

( 17 / 76 )

MedQA

84.0%

( 32 / 57 )

GPQA

71.5%

( 17 / 53 )

LiveCodeBench

70.4%

( 12 / 53 )

IOI

1.3%

( 11 / 12 )

SWE-bench

34.2%

( 11 / 15 )

Academic Benchmarks
Proprietary Benchmarks (contact us to get access)

Cost Analysis

Input Cost

$1.00 / M Tokens

Output Cost

$3.00 / M Tokens

Input Cost (per char)

$0.40 / M chars

Output Cost (per char)

$1.35 / M chars

Overview

Kimi K2 Instruct is a 1.5B parameter model from MoonShot AI, released in July 2025. It is a successor to the original Kimi model, which was released in December 2024.

Key Specifications

  • Context Window: 100,000 tokens
  • Model Size: 1.5B parameters
  • Max Output Tokens: 32,768
  • Cost: $1.00 input/$3.00 output per million tokens (Together.ai)
  • Cost: $0.60 input/$2.50 output per million tokens (Fireworks.ai)

Performance Highlights

Math Excellence: Achieves 94.2% accuracy on Math500, making it one of the top-performing open-source models for mathematical reasoning tasks

Cost-Effective: At $1.00/$3.00 per million tokens, Kimi offers competitive pricing compared to similar models like DeepSeek-R1 ($3.00/$7.00)

Strong on Public Benchmarks: Does well on public benchmarks like Math500, though struggles more on proprietary legal and financial tasks

Cheap: Makes longer tasks more affordable

Important Notes

  • Noticable difficulties making tool calls and often formatting them incorrectly
  • When using temperature 0, Kimi K2 Instruct responses contained clear indications of errors. Often repeating itself inside of the same response
  • Consistent when using either together.ai or fireworks.ai

Example output when running matplotlib. Model repeats itself multiple times inside of the same response.

Let me look more carefully at the issue.
The issue is that when interactive=True, the axes limits incorrectly include 0.
The issue mentions that this worked correctly in Matplotlib 3.4 with span_stays.
Let me look at the Rectangle creation and see if we need to make it invisible initially or position it differently.
Let me look at the Rectangle creation and see if we need to change its position or visibility.
Let me look at the issue more carefully.
The issue is that when interactive=True, the axes limits incorrectly include 0.
The issue mentions that this worked correctly in Matplotlib 3.4 with span_stays.
Let me look at the Rectangle creation and see if we need to make it invisible initially or position it differently.
Let me look at the issue more carefully.
.....

Kimi K2 Instruct attempts to make two tool calls inside of the same tool field

choices=[
  Choices(
    finish_reason='tool_calls',
    index=0,
    message=Message(
      content="Let's modify the URLValidator __call__ method to include the value in the params. Let's look at the specific places where ValidationError is raised:",
      role='assistant',
      tool_calls=[
        ChatCompletionMessageToolCall(
          index=0,
          name=None,
          function=Function(
            arguments='{"search": "raise ValidationError(self.message, code=self.code)", "replace": "raise ValidationError(self.message, code=self.code, params={\'value\': value})", "replace_all": true}',
            name='edit'
          ),
          id='functions.edit:0',
          type='function'
        )
      ],
      function_call=None,
      provider_specific_fields=None
    )
  )
]
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.