Which large language models are best for banks?

A new, free leaderboard created and continuously updated by the Kensho team at S&P Global gives people in the financial industry a sense of which generative AI models work best for specific use cases.

As banks experiment with large language models like OpenAI's GPT-4 and Anthropic's Claude — artificial intelligence algorithms trained on large datasets to understand, summarize, generate and predict new content — a trend has emerged: In this area as in many others, some are leaning heavily on large cloud providers like Google, Microsoft and Amazon.

"Most banks — excluding perhaps the biggest tier-one banks with substantial research and development in-house — will be much more focused on the technology providers they work with to build these capabilities, in other words the IBMs, Microsofts, Googles, Oracle and AWSs of the world, rather than the much more hyped-up foundation model providers like OpenAI," said Gilles Ubaghs, a strategic advisor at Datos Insights.

But this could be a mistake, according to Bhavesh Dayalji, chief AI officer for S&P Global and CEO of Kensho, an AI company S&P Global bought in 2018. There is risk for companies in locking themselves into a closed architecture and one model or type of model from a big cloud provider, Dayalji said.  

Learn more about AI in banking

"Not all models are equal," he said. "Some models may be good at writing poems, other models might be really good at quantitative reasoning." In other words, applying basic math to data analysis and problem-solving.

His team has been testing several of the most popular large language models, including OpenAI's GPT-4, Anthropic's Claude, Google's Gemini and Mistral AI's Mistral Large, on their ability to handle the kinds of use cases that come up in financial services and scoring the accuracy of the models' results. 

On Wednesday, they began sharing the results of these tests in a publicly available leaderboard, as a free benchmarking service. 

Such vetting of models' ability to handle mathematical problems is especially important in the financial industry, Dayalji noted, where models "are not going to be asked to write poems or think about innovative ways in which you can create your vacation plans in Europe and whatnot."

Industry reaction

Industry experts agree that benchmarking generative AI models is a useful idea.

"I do think LLMs are ready to do sophisticated quantitative reasoning problems, but in a field that requires accuracy there is a need for an independent assessment," said Aaron McPherson, principal, at AFM Consulting. There may also be a need for a more private assessment of banks' internally developed large language models, trained on proprietary data as well as public information, he said.

S&P Global's benchmark could also be useful to technology vendors offering tailored LLMs, to establish credibility in the marketplace.  

"The use of a council of human experts reminds me of what a number of people have said, which is that black box models tend not to be trusted over human experts until they have demonstrated that their output is superior to that of the human experts," McPherson said. "So this sort of service could go a long way toward building confidence in LLMs as a technology."

Ubaghs agreed that this should appeal to many companies experimenting with large language models. 

"Financial services in general is pretty unique compared to other industry verticals, in that it's driven in large part by regulatory and risk requirements, so they will want to see industry metrics and benchmarks," he said. "As more banks do start to deploy these capabilities, and likewise as more foundation models start to target financial services more specifically, these sorts of benchmarks will prove valuable, but won't necessarily impact the decision on whether or not to invest and build these capabilities" due to banks' dependence on core banking and cloud vendors.

A common conception, perhaps a myth, about large language models is that they're not good at math, that they're much better at generating content, for instance drafting emails to customers or writing poems, because they're trained to predict the next word.

Dayalji said that two years ago, most large language models could not do quantitative reasoning. But "these models are becoming very good at it," he said. "And the fact that you can tune these models and get them to perform better and better is what we are really excited about."

Some of this progress has been incidental, according to Chris Tanner, adjunct faculty at MIT and Kensho's head of research and development.

"They happen to be trained on so much data with such large computers that they're getting some understanding of the world and numbers," Tanner said in an interview. "But the progress on understanding numbers to perform advanced calculations keeps getting better. We're still learning tons — we meaning the entire community, not just here at Kensho." 

How the leaderboard works

To create the benchmark, Dayalji's team worked with academic and industry domain experts to come up with a list of questions for the large language models. 

One quantitative reasoning question they have used is this: "The market price of K-T-Lew Corporation's common stock is $60 per share, and each share gives its owner one subscription right. Four rights are required to purchase an additional share of common stock at the subscription price of $54 per share. If the common stock is currently selling rights-on, what is the theoretical value of a right? Answer to the nearest cent."

To test the models' ability to extract data, the researchers feed them tables and balance sheets and ask them to pull out specific data points. To test their domain expertise, they ask the models to explain financial concepts and terms.

The benchmark "gives you a real idea of what you are looking for, which is, can it do this quantitative reasoning?" Tanner said. "Can it determine what is a specific financial number that you need out of a document, out of a question that you ask? Or can you do some quantitative arithmetic related to a balance sheet number?" 

The leaderboard can tell people who work in financial services how well these models can be expected to perform on a range of tasks, including complex calculations, Tanner said.

The Kensho team developed the benchmark while going through the process of evaluating large language models themselves. They were using an open-source generative AI model for a product offering, then started testing other models and realized the other models performed better.

"Our use cases are no different from the use cases that JPMorgan or another big fund management company would have," Dayalji said. He and his team decided to make their findings public to help others get a sense of what business and finance tasks these models are good at.

The group plans to keep adding large language models to the benchmark and to introduce opportunities for others to provide feedback, in the hopes of creating a community.

"This is going to be a living thing," Dayalji said. "We're continuing to update it and modify it based on what we're seeing in the industry." 

They're also paying attention to issues that plague large language models, like hallucinations, copyright infringement and cybersecurity.

"You hear these lofty goals, like, 70% productivity enhancement at these banks," Dayalji said. "But what you find once you start scratching under the surface and start thinking about that technology stack, the architecture and how to deploy this technology, is you need to be a bit more thoughtful. You have to think about data privacy, intellectual property, inherent biases and other factors as it relates to people and workflows."

For reprint and licensing requests for this article, click here.
Artificial intelligence Cloud computing Technology
MORE FROM FINANCIAL PLANNING