Benchmarking AI solutions for FEC challenges

Executive Summary

We tested the use of some Large-Language models (Artificial Intelligence) in the Financial Economic Crime context of correctly identifying risky persons and organizations in text. It was found that such models may generate useful results. However, their results are lagging those which specific models optimized for such tasks may generate.

The best performing model in this test (BERT, optimized for named entity recognition) was able to correctly identify and classify 45% more persons and organizations of interested than ChatGPT (228 vs 157 point scores), whilst costing at least one order of magnitude (10x) less processing time and power.

For this reason, companies facing AML, Sanctions and KYC challenges should think critically about whether or not LLM implementations in production processes are the best fit, or if task-optimized solutions may lead to better risk identification paired with lower operating costs.

Introduction

Since the launch of OpenAI’s ChatGPT in November of 2022, the emerging artificial intelligence (AI) space of Large Language Models (LLMs) has gathered a lot of attention from business as well as consumers and regulators. Business and public institutions are conducting numerous experiments to determine how to implement LLMs to capture operational efficiencies, or automate repetitive tasks.

In the space of Financial Economic Crime (FEC), Anti-Money Laundering (AML), Anti-Bribery and Corruption (ABAC) and Know-Your-Customer (KYC), firms are experimenting with various use cases, such as:

First line operations document generation
Fact and information extraction from text
Summarizing text on demand

In this article, we will take a look at how AI models may perform in the context of financial economic crime challenges.

Challenges to using AI in the FEC space

Most popular LLMs are attempting to be a form of specific AI, having the primary task of text interpretation and generation. Conversely, these LLMs are trained to cover a wide range of general knowledge. This means they are supposed to be fit to classify all types of text, and generate answers in any format desired by end users.

This leads to a problem that for most specific business tasks, LLMs are actually overpowered. Impressively, GPT4 from OpenAI is rumoured to feature over 1 Trillion parameters¹ by which to interpret inputs and generate outputs. Competing models, e.g. Meta’s LLaMa2 feature billions of parameters to generate a useful answer to a user’s query.

The impressive size and accompanying range of possibilities of these models comes at a price: running these models is very costly. Given the size of these models, local creation, optimization (re-training), and implementation will require massive calculation capacity (e.g. running for long periodes of time on H100 GPU’s – at EUR30.000 a piece), or require an ongoing subscription to an AI provider of choice, at a semi-predictable cost for every interaction².

Applying AI in the FEC domain

The spaces of AML, ABAC, and KYC are relatively regulation-driven. This means that the use cases for LLMs in this domain are actually relatively narrow. In this article, we will explore what the added value of LLMs may be in this domain by testing their use for implementation in a specific use case, namely named entity recognition (NER) from input text. A named entity is a person, organization, location, or other element in a text which refers to its key subject:

The field of NER is applicable to various processes in FEC, AML, ABAC and KYC, such as:

Client or counterparty “bad press checking”
Sanctions and transaction screening and filtering
Processing client information on organizational structures and ownership

Additionally, NER is a field where various incumbent models can be tested and pitted against LLMs for comparison, such as Knowledge-based patterns, statistics-based models (e.g. SpaCy) and earlier transformer-based models (e.g. Google’s BERT).

With this article, we will benchmark the NER outputs of these models to determine how LLMs perform for FEC challenges.

Approach

Our input information comprises a corpus of articles from open sources on sanctions evasion and money laundering³. Based on human review, the articles contain a combined 145 person and organization names which are interesting from an AML and Sanctions perspective, in a combined text length of 9.432 words (with a token count estimate of 12.228).

The models will be scored with a simple two-step method:

For each person or organization correctly identified, one point is given to the model.
Then, for each of those identified as the correct type (either person or organization), another point is given. Thus, a total of 290 points can be gained by the models.

We are choosing not to deduct points for faulty classification or identification of incorrect entities, since in a Sanctions and AML context we generally prefer seeing some false-positive results over missing some real risk signals (false negatives).

We set up a (Python) pipeline where each article is cleaned of non-readable characters, stop words, and noise to ensure the tested algorithms function optimally.

Outcomes

SpaCy

The SpaCy Python library offers various text interpretation models for inclusion in production systems settings. It also offers various pre-trained models for recognizing entities. We found the model results to vary only to a limited extent within these variants, and included results for two variants (small and large models).
Point results: 190 points (66% of 290) were achieved by the small model, and 193 points (67% of 290) were gained by the large model.
Runtime results⁴: It took 4,16 seconds to execute the small model variant, and 4,47 seconds to execute the larger model

BERT

BERT is a language model family developed originally by Google in 2018. For our purposes, we are using the “bert-base-NER” variant, which is trained to identify named entities in text.
The BERT algorithm actually does not allow for the length of text inputs we require with our articles. Hence, text is processed in smaller chunks, leading to a significant processing time increase, and a requirement to de-duplicate the results after this processing.
Point results: 228 points (80% of 290) were attributed for BERT’s results.
Runtime results: 52,57 seconds were required to run the analysis, which can be attributed to both the model complexity, and the number of analyses run given the short input requirement of the BERT model.

ChatGPT

The ground-breaking LLM that set the current AI wave in motion was OpenAI’s ChatGPT (based on their GPT3.5 implementation). For over a year, it has allowed users free access to ask any queries, learning from whatever feedback has been given by users thus far.
Depending on the question asked, and the format in which the answer is asked, varying results were retrieved from ChatGPT for our experiment. The below is the very best result following a round of prompt engineering (which is what we call “trial-and-error your way to the best answer” nowadays). It remains interesting to see the sensitivity LLMs have to user prompts, e.g. even to swapping single words in a question.
Point result: 157 points (55% of 290). We see ChatGPT perform better in shorter articles than longer ones, yet in all cases it underperforms the BERT model.
Runtime result: Since ChatGPT runs on the Microsoft Azure cloud, we use a calculation to approximate what runtime such a model may have had on the machine we used for the SpaCy and BERT scripts⁵. From this calculation, we estimate that at least 235,81 seconds would be required to generate the persons and organizations in its output.

Graph 1: Results of Named Entity Recognition on our financial crime related articles

Bechmarking results named entity recognition in financial crime challenge

A note on the use of other LLMs
Results from various other LLM solutions aside from OpenAI’s GPT are not included in this article. This does not mean they were not researched, but that no representative results were gained from these models after various retesting efforts. Such models tested include Meta’s LlaMa 2 models (7b and 13b variants) and Google’s Bard AI Experimental model. Our interpretation from this result is that these models would require extensive retraining to make them useful for this type of (FEC and NER) challenge.

Conclusions

There are two conclusions we can draw from the outcomes of our analysis:

In an AML context, the performance of specific models can be higher than what we are able to achieve with general models. This is evidenced by the decent performance of the SpaCy models in the task we set it, and the leading performance of the BERT model.
Added intelligence in text processing models comes at a high cost. The BERT model, while significantly better at the task it was trained to do than the SpaCy model, is one order of magnitude (10x) more costly to operate than the SpaCy model. Moreover, the general nature of the ChatGPT model means it delivers a decent answer to the task of entity recognition in AML, but it does so at a two-order (100x) higher cost than the SpaCy models do.

In summary, given the requirements on AI processing power, the trial-and-error nature of prompt engineering for LLMs, and results from this analysis showing false-negative conclusions are very feasible with LLM outputs, companies facing AML, Sanctions and KYC challenges should think critically whether LLM usage in production processes is the best fit, or if tailored other AI solutions may provide better results.

Interested to talk about which AI solutions would fit your FEC challenges best?

GET IN TOUCH

Footnotes

According to unconfirmed non-official sources, e.g.: The Decoder (2023) ↩︎
Costs for LLMs tend to go per 1000 input tokens, with 1 token generally constituting three-quarters of an average English word. Other languages tend to be more token-expensive ↩︎
Example article: ICIJ Cyprus Confidential Investigation (2023) ↩︎
The runtime is calculated on a normal commercial laptop machine. This type of Python code runs on the laptop’s CPU, in this case an AMD Ryzen 7 6850u using 8 CPU cores at 4.7 ghz ↩︎
A minimum of 731 output tokens are used to generate the 157 entities ChatGPT identified (entity name followed by classification). In the most complex model we ran on our machine (Meta’s LLaMa2.0 model, the 13 billion parameter variation), we generate 3,1 output tokens per second. We realize ChatGPT is much more larger then this model, but we can be confident in stating the generation of the required output would take at least 235,81 seconds (731 / 3,1). Add to that the processing of 12.228 input tokens (for ChatGPT to “read” the 9.432 words in the articles), and we can safely conclude that it takes at least one order of magnitude more processing power than the BERT model to generate these outputs from ChatGPT on equivalent machines. ↩︎