AI for KYC Compliance – Three Use Cases

‘Know Your Customer (KYC)’ compliance is what is commonly referred to by a legally-mandated, globally developed set of guidelines to ensure that banks, financial institutions (FIs), and other related enterprises perform due diligence on potential customers. The purpose is to verify customer identity as well as the legitimacy and risk involved with developing and maintaining a business relationship.

By verifying customer identities according to the strict rules-based system of the Banking Secrecy Act of 1970, KYC compliance rules reduce instances of fraud and money laundering. Businesses comply with KYC to meet local business standards, mitigate the risk of fines and punishments, and avoid reputation damage. 

Failing to comply with KYC and anti-money laundering (AML) rules carries stiff penalties. Lengthy audits, multi-million-dollar fines, and bans on conducting business in specific countries or regions are all possible consequences.

To assist business leaders with navigating the hazard-filled, turbulent waters of KYC compliance, this article will examine the commonly-held compliance difficulties of banks, FIs, and other enterprises. In an attempt to provide some insight into potential solutions, we present several use cases–including the business problems, AI used, workflow changes, and business outcomes of said AI.

These use cases include: 

  • Automated cross-referencing and application processing: Reducing processing time and costs in expediting KYC customer data verification.
  • 10-K data extraction: Accelerate labeling workflows to shorten customer onboarding processes, labor costs, and time spend. 
  • Identifying customer relationships: Analyzing transactions between banking customers to decrease the time and cost to perform a KYC review.

The Business Costs of Know Your Customer Compliance

As the true business costs of KYC are diverse and extensive, it is important for enterprise leaders to understand both the monetary and opportunity costs of traditional, mostly manual, KYC compliance processes. The business costs of KYC compliance can be classified into direct and indirect costs; the latter being the result of inefficient processes due to antiquated technologies.

Per a report on IT operations from financial services research and advisory firm Celent, FIs spent approximately $37.1 billion on AML-KYC compliance functions. Elsewhere in Celent research, KYC compliance is described as among the riskiest and most inefficient of all banking operations due to a lack of quality data and automation.

The primary challenge of KYC compliance for most banks usually consists of one of the following dilemmas:

  1. Combining outdated technology with disparate internal and external data sources. 
  2. The absence or severe lack of high-quality data and automation. 

The results are similar: data that are difficult–if not impossible–to locate, integrate, analyze or use. The pain is especially acute during a client’s “KYC check.”

A typical corporate KYC check process involves gathering, identifying, and validating various company or individual data. These data include client ID, wealth/net worth, funding sources, corporate subsidiaries, and more. Verifiers must cross-reference these data across sources to ensure customer truthfulness and information accuracy. 

The absence of high-quality data and automation means:

  • (a) Increased security vulnerability 
  • (b) Cumbersome, time-consuming assessments of corporate subsidiaries, shareholders, and structures. 

The byproduct of effect (a) is that skilled fraudsters can take advantage of vulnerabilities by constructing networks of front firms and corporate structures. The result is increasing perplexity, with the fraudster often ultimately avoiding detection.

The byproduct of (b) shows up on the balance sheet. The volume of work that KYC/AML compliance mandates often translates to high expenses. The additional labor costs and technology spend are particularly significant.

Besides the significant expenses, opportunity costs for banks and FIs exist. These include:

  • Lost customers
  • Reduced productivity
  • Low-value-adding work
  • Stunted business growth. 

The last is of particular concern. Enterprises operating in competitive environments may lose patience and take their business elsewhere. As such, vendors that offer KYC automation solutions often list “customer satisfaction” as a key benefit to their respective solutions.

AI and machine learning can augment or automate KYC compliance processes, possibly reducing some of the aforementioned direct and indirect costs.

We begin by discussing how one fintech uses a combination of machine learning components to integrate data silos, extract form data, and cross-verify data across various internal and external data stores.

Use Case #1: Automated Cross-Referencing and Application Processing

Datametica is a software company that offers automation, cloud, machine learning, and data warehouse migration solutions. 

According to the case study, the client is a fintech firm that issues KYC acknowledgment letters. The firm’s clients transfer KYC-required customer information and assorted documentation. The firm then cross-references this data with those in-house. Reception of the acknowledgment letter is eagerly awaited by the client, as the business transaction can not proceed until this happens. There is a lot at stake for both the receiving and the issuing firms.

Before implementing the solution, the client’s workflow involved several manual processes. This included the manual acquisition of the application as well as the identity and proof of address documentation. 

In the Datamenica webinar below, presenters explain the company views on the essentials for a KYC compliance solution based on available technological capabilities. The relevant section begins at the 34:31 mark and lasts approximately one minute:

The manual process carried over to identifying, matching, and verifying application details across internal data stores and sources before issuing the acknowledgment letter. The case study report describes the process as time-consuming and costly.

Complicating the problem was the inability to scale and meet customer requirements. Due to an influx of KYC requests and difficulty scaling manual processes, Datametica claims that the traditional workflow created challenges for the fintech firm in that the client could not satisfy the terms and conditions in its service-level agreements (SLAs). 

On the extraction and processing side, the case study report also claims fintech’s leaders were also concerned about the potential for human error in several manual functions, including:

  • Processing backend data
  • Data extraction
  • Correlating customer data across sources and documents (e.g., application data versus some KYC document or database).

The case study further indicates the presence of a data bottleneck, making it difficult for the fintech firm to accommodate different application types from a single distribution point. A key influencing variable in this bottleneck was the 150+ data providers, each with their own KYC applications, documents, and supporting formats. 

To overcome these challenges, Datametica states that they automated the end-to-end reception and verification of KYC applications and associated data via a machine learning model equipped with deep learning capabilities. The company claims that the solution can extract data from any assortment of digital KYC applications and forms using a single CVL client.

The case study reports integrating the following solutions, inputs, and outputs:

  • A OCR deep learning image processing model using custom computer vision and OCR codebase: to extract applicant information from printed forms and KYC documents
  • An integrated data pipeline: Aa central data repository for easier cross-reference of application information against KYC documents and databases
  • An image processing pipeline: to retrain tagged supporting documents
  • Validation and classification model: to identify new data points in KYC forms and verify against the client’s metadata.

The case study does not reveal the specifics of before and after workflow changes. However, we may safely conclude the following workflow modifications (assuming the information and reported results within the case study is accurate):

  • Potentially significantly faster and easier access to data sources, data extraction, and cross-verification
  • Potentially significantly less manual tagging and labeling of data
  • Potentially a significantly higher automation-to-human throughput ratio

Datametica reports the fintech client was able to achieve the following results using their solution:

  • 75% reduction in operational costs from reduced manual processes, model implementation, and automated file classification
  • 66% faster KYC application processing
  • 85% accuracy in the automated verification process
  • Easier scalability with less effort

Use Case #2: 10-K Data Extraction for Customer Verification

Snorkel AI is a software company that produces solutions focused on accelerating AI applications for its clients via a patented automated data labeling method. The company has coined the term “programmatic labeling” to describe this method.

Snorkel’s client was reputedly a top-3 US bank, though no more details are given. 

Before implementing Snorkel’s solution, the bank manually extracted data from 10-K forms. The length of 10-K reports — up to 300 pages — made the manual mining of these data time-consuming and onerous. The bank reported that this method lengthened the onboarding process, costing time and money.

Manual extraction and labeling of training data is often a slow process that requires a large team of data scientists and domain experts. Labor costs and time consumption are two of the more common complaints of business leaders here.

Snorkel AI offers a platform called Snorkel Flow, which the company claims can help businesses accelerate labeling using machine learning. 

The platform uses what Snorkel dubs “programmatic labeling,” defined as “noisy, programmatic rules and heuristics that assign labels to unlabeled training data.” These attributes describe weak supervision machine learning, which appears to be at the center of the Snorkel Flow value proposition. 

The Snorkel Flow value proposition. (Source: Snorkel AI)

To understand some of the content within this case study, it is necessary to quickly define supervised machine learning and why it is sometimes not an ideal solution from a business perspective.

In short, supervised machine learning requires mapping input data to output and manual labeling. For the enterprise, this process is — literally and figuratively — expensive as it is slow, requiring a team of data scientists and, often, domain experts. 

The banking client reported the following quantitative operational problems with its KYC functions:

  • Labor costs: 300-500 KYC analysts necessary to manually extract data
  • Time spend vs. volume: 30-90 min spent manually reviewing a single 10-K report, with 10,000+ reports analyzed every year

An automated extraction solution centered around Snorkel Flow was constructed. To meet client requirements, Snorkel worked with the bank to custom-build its solution.

A key reason why Snorkel recommended the Flow solution was the programmatic labeling ability of the software. Snorkel states that programmatic labeling improves traditional methods by labeling functions by enabling large-scale labeling instead of one-by-one tagging, expediting the process. 

A screenshot of Snorkel AI’s programmatic labeling user Interface. (Source: Snorkel AI)

The end-user workflow appears to be as follows:

  • Data integration: The client integrates the platform with its data stores using APIs.
  • Writing labeling functions: Users create labeling functions in this phase to represent different weak supervision sources, such as patterns, heuristics, outside knowledge bases, and other organizational resources.
  • Modeling relationships: User-provided labeling functions are combined with new weights to develop a generative model that estimates certain accuracies and correlations.
  • Model training: The model is trained using a set of probabilistic labels generated by the software.

The case study does not provide specifics on which method was used to train the model. However, we can make some safe assumptions given the bank’s size. 

The model was likely custom-trained (not using one of the five model frameworks or AutoML), given its status as a “top-3” bank. We may also deduce that its asset holdings, intellectual property, and data science resources are significant enough to demand a more resource-intensive, technically-rigorous solution.

Concerning the input and output data, we know from a Snorkel-sponsored webinar that the input data consisted of a dataset of unstructured, multi-format 10-K reports. The software extracts this unstructured data using programmatic labeling. The output is a database comprising the key attributes of the customer. The above-cited webinar reports the following output data:

  • Company name
  • Nature of business
  • Key senior managers
  • Total assets
  • Other attributes (15-20)

The business outcomes as reported by Snorkel:

  • 89+% model accuracy
  • 10,000 labor hours saved per year, equivalent to $500,000 

Use Case #3: Identification and Tracking of Beneficial Owners

Quantexa is a London-based software company that produces decision intelligence software for banks and other enterprises.

The company produces a solution called Contextual Decision Intelligence (CDI) that it claims enables businesses to improve decision-making by mapping and displaying contextual relationships between data using machine learning.

ABN-amro is a Dutch multinational bank with a presence in 15 countries. The company reports a 2021 net profit of EUR 1.2 billion on revenues of EUR 8.47 billion.

A Product Owner at ABN-AMRO, Paul Westrate, discussed the use case in a video call with an analyst from Celent.

In the call, Westrate discusses the business reasons behind the partnership with Quantexa. He lists the time-consuming nature of financial crime investigations, high operational costs, and evolving compliance requirements as the three main impetus factors behind the partnership and sought solution.

More specifically, ABN-AMRO leaders sought an automated solution that could:

  • Reduce the labor required in manual data gathering and analysis
  • Reduce time spent on discerning legitimate and non-legitimate suspicious activities through automation
  • Combine internal and external data sources, group companies into hierarchies, and gain insight into their relationships.

Quantexa lists several components of the platform, including the core platform, underlying platform capabilities, and the underlying technology (see below). All of these may provide insight into what the workflow may look like for the bank’s end-user.

Visualization of the components and capabilities of Quantexa’s Contextual Decision Intelligence Platform  (Source: Celent)

The bank’s end-user first connects their machine to the platform via server or cloud and selects their internal and external data via an API. Among the data points within internal sources are:

  • Customer/Company data
  • Account information
  • Transaction details
  • Alerts and cases

Among the data points within external data are:

  • Company structures
  • Ultimate Beneficial Owner (UBI) data
  • Enrichment
  • Watchlists

Following data integration, the entity resolution engine creates a single view of the integrated data. An existing data schema is THEN used to infer, configure, parse, and standardize potential linking attributes.

Network generation then links entities (i.e. customers/companies) into networks that may demonstrate some connection. 

The output is a GUI of identified networks and highlighted risk areas for investigators. The display includes the most relevant connections, entities, and data links between ABN-amro customers. These data may include party and counterparty names, relationships, and transactions.

From this output, the end-user can then prepare analytic models and perform data exploration and visualization. This output reportedly helped the bank to understand, recognize, and counteract risks and threats and potentially enable more informed, accurate, and consistent investigations and decision-making.

Unfortunately, there are no publically reported quantitative benefits realized by ABN-amro. However, the following qualitative results were reported by a case study and the above-cited webinar, respectively.

According to a case study published by financial research and consulting firm, Celent, Quantexa’s CDI platform enhances KYC/CDD practices by: 

  • Pinpointing and tracking disclosed and undisclosed beneficial owners and their associations
  • Promoting effective customer risk evaluations
  • Streamlining customer due diligence processes

Mr. Westrate also listed a couple of other benefits realized from the solution in the above-cited webinar:

  • Reduction in time spent gathering and understanding data and information
  • Improvement in the overall client experience
  1. “Acknowledgement Letter Definition.” Law Insider,
  2. “Celent Case Study: Automating KYC Investigations with ABN AMRO.” Quantexa, Quantexa, 15 Mar. 2022,
  3. Datametica. “How a Finance Company Saved 75% Cost by Automating KYC Process Using Machine Learning Model: Datametica Case Study.” Datametica, 25 Apr. 2022,
  4. “Programmatic Labeling.” Snorkel AI, Snorkel AI, 13 Sept. 2022,
  5. Ray, Arin. “ABN AMRO: KYC Investigations.” Celent, Celent, 15 Mar. 2022,
  6. Ray, Arin. “ABN AMRO: KYC Investigations.” Celent, Celent, 15 Mar. 2022,
  7. “Snorkel Flow AI Application Development Platform.” Snorkel AI, 9 Dec. 2022,
  8. “Understanding the Steps of a ‘Know Your Customer’ Process.” Dow Jones Professional, 23 Sept. 2022,

Similar Posts