Key Takeaways
- Claude Sonnet 4.5 (Thinking) is the current top performer on Finance Agent, reaching a top score of 55.3%.
- In the last six months, we’ve seen significant improvement on the benchmark - the capability of LLMs to take on financial tasks is increasing.
- Many of the models struggled with tool use in general (especially the information retrieval tool), leading to inaccurate answers.
- Models on average performed best in the simple quantitative and qualitative retrieval tasks. These tasks are easy but time-intensive for finance analysts.
Background
The frontier of applied AI is agents – systems that independently direct their own processes to maintain control over how they accomplish tasks on behalf of users [1] [2]. As such, foundation model labs have invested heavily in developing agents that can handle complex tasks [3], making them prime candidates for delivering significant ROI in specialized industries [1] [4].
Finance is one of the most lucrative applications of agents [5], where AI has the potential to drive significant efficiency gains by performing tasks that mirror those of an entry-level financial analyst [6]. Yet, there lacks a framework for evaluating these agents [7], highlighting the need for standardized evaluation methods to measure agentic ability domain-specific tasks.
We have created a benchmark that tests the ability of agents to perform tasks expected of an entry-level financial analyst. In collaboration with Stanford researchers, a Global Systemically Important Bank, and industry experts, we created a dataset of 537 questions that evaluate skills such as simple retrieval, market research, and projections.
Results
/
The Pareto cost-curve above illustrates the relationship between model accuracy and cost per session across different models. As model performance improves, these cost considerations will likely be more important.
Some challenging queries cost more than $5 for o3 or Claude Opus 4.1 (Thinking) . Often, performance can be improved by allowing the model more tool calls and more tokens used, but this comes at a cost.
The tool call analysis reveals how many steps the models take to produce an answer. Interestingly, o3 and Claude Sonnet 4.5 (Thinking) , the two best performers on this benchmark, also register relatively high numbers of tool calls. This suggests that their comparatively strong performance on financial analyst tasks is partly due to leveraging the tools provided.
The spider graph detailing the number of tool calls performed by each model further supports the notion that the number of tool calls generally leads to higher accuracy. Again, we see several of the top models excelling by making many tool calls, especially for retrieving information, parse html_page, and Google web-search, which leads to top benchmark performance.
Model Output Examples
Q
What is the Total Number of Common Stock Shares Repurchased by Netflix (NASDAQ: NFLX) in Q4 2024?
A
The total number of common stock shares repurchased by Netflix (NASDAQ: NFLX) in Q4 2024 was 1,165,827 shares.
This breaks down by month as follows:
- October 2024: 519,883 shares at an average price of $724.15 per share
- November 2024: 457,732 shares at an average price of $792.49 per share
- December 2024: 188,212 shares at an average price of $913.13 per share
For additional context, Netflix repurchased a total of 9,861,935 shares during the full year 2024 for approximately $6,211 million (excluding the 1% excise tax). As of December 31, 2024, Netflix had $17.1 billion remaining available for future repurchases under its stock repurchase authorization programs.
{ "sources": [ { "url": "https://www.sec.gov/Archives/edgar/data/1065280/000106528025000044/nflx-20241231.htm", "name": "Netflix Inc. Form 10-K for Fiscal Year Ended December 31, 2024" } ] }
CORRECT
The right answer we were looking for was “1,165,827 Shares repurchased by Netflix in Q4 2024”. To answer this question, the finance expert did the following: go to EDGAR, search Netflix Filings by ticker, identify the 10-K for 2024, search for Company Purchases of Equity Securities, calculate the Total Number of Shares Purchased from Oct 1, 2024 - Dec 31, 2024.
GPT 4o (2024-08-06) did not retrieve the correct document and did not re-try to find the correct one; as a result it did not find an answer to the question. However, the latest frontier models listed above were able to get it right.
Trajectory Comparison
The visualizations below illustrate the sequence and types of tool calls made by two models for the example question above. Each node represents a step in the model’s reasoning process, with colored nodes indicating specific tool calls. These trajectories reveal how models approach research tasks, with more thorough tool usage often correlating with better performance.
Claude Sonnet 4.5 (Thinking) trajectory:
We see that Claude Sonnet 4.5 (Thinking) ‘s trajectory follows what we’d expect intuitively— an initial edgar_search
(corresponding to going to EDGAR and searching for Netflix filings by ticker), followed
by parse_html_page
, and finally retrieve_information
.
Gemini 2.5 Pro Preview trajectory:
In the case of Gemini 2.5 Pro Preview , it follows roughly the same pattern as Claude Sonnet 4.5 (Thinking) , except the model is also able to recover from a failed tool call!
Methodology
The finance industry comprises a wide array of tasks, but through consultation with experts at banks and hedge funds, we identified one core task shared across nearly all financial analyst workflows: performing research on the SEC filings of public companies. This task —while time-consuming— is foundational to activities such as equity research, credit analysis, and investment due diligence. We collaborated with industry experts to define a question taxonomy, write and review 537 benchmark questions.
The AI agents were evaluated in an environment where they had access to tools sufficient to produce an accurate response. This included an EDGAR search interface via the SEC_API, Google search, a document parser (ParseHTML) for loading and chunking large filings, and a retrieval tool (RetrieveInformation) that enabled targeted questioning over extracted text. The human experts did not make use of any additional tools when writing and answering their questions. See the full harness here.
Our primary evaluation metric was final answer accuracy (see the GAIA benchmark). We also recorded latency, tool utilization patterns, and associated computational cost to provide a fuller picture of agent efficiency and practical viability. Together, these components form a rigorous and domain-specific evaluation framework for agentic performance in finance, advancing the field’s ability to measure and rely on AI in high-stakes settings.
The code behind this harness is open source. Dive in and explore it yourself on this repo!
Dataset
The dataset is divided into three parts: Public Validation (50 open-source samples), Private Validation (150 samples available for license), and Test (337 samples).
-
The Public Validation set is fully open and can be accessed at here.
-
The Private Validation set is available for license. Interested parties are encouraged to contact us directly for access.
-
The Test set will remain private permanently. All results reported in this page are based solely on the Test set to prevent potential future overfitting.
The dataset splits were sampled to preserve the distribution of question types and performance characteristics. We observed a strong correlation in performance across the validation sets and the Test set, supporting the reliability of these splits.
Question Taxonomy
Quantitative Retrieval (easy)
Direct extraction of numerical information from one or more documents without any post-retrieval calculation or manipulation.
What was the quarterly revenue of Salesforce (NYSE:CRM) for the quarter ended December 31, 2024?
Qualitative Retrieval (easy)
Direct quotation or summarization of non-numerical information from one or more documents.
Describe the product offerings and business model of Microsoft (NASDAQ:MSFT)?
Numerical Reasoning (easy)
Calculations or aggregation of key numbers to produce an answer.
What is % of revenue derived from AWS in each year and the 3 year CAGR from 2021-2024 of Amazon?
Complex Retrieval (medium)
Numerical or non-numerical retrieval or content summarization requiring synthesis of information from multiple documents.
Please briefly summarize the most recent capital raise conducted by Viking Therapeutics (NASDAQ:VKTX).
Adjustments (medium)
Quantitative and qualitative analysis of reporting context bridging GAAP and Non-GAAP Financial Metrics.
What is Lemonade Insurance’s Adjusted EBITDA for the year ended December 31, 2024?
Beat or Miss (medium)
Comparison of forward management guidance versus actuals, synthesized by reconciling sequential quarterly reporting documents.
How did Lam Research’s revenue compare to management projections (at midpoint) on a quarterly basis in 2024? Format as % BEAT or MISS. Use guidance provided on a quarterly basis.
Trends (hard)
Analyze patterns within a single company’s reporting structure or calculate and contextualize evolving performance, key metrics or business composition.
Which Geographic Region has Airbnb (NASDAQ: ABNB) experienced the most revenue growth from 2022 to 2024?
Financial Modeling (hard)
Complex numerical reasoning calculations which require additional financial expertise to define and evaluate.
How much M&A firepower does Amazon have as of FY2024 end including balance sheet cash, non-restricted cash and other short term investments, and up to 2x GAAP EBITDA leverage? Round to nearest billion
Market Analysis (hard)
Advanced analysis of one or more companies using various documents, requiring normalization of comparison metrics, or complex reasoning and usage of causality to contextualize drivers of business changes or competition dynamics.
Compare the quarterly revenue growth of FAANG companies between 2022-2024.
Acknowledgements
Thanks to the following people for their support: Shirley Wu, Alfston Thomas, Andrew Schettino, Kathy Ye, Kyle Jung, Matthew Friday, Michael Xia, and Nicholas Crawley-Brown.