Unlock Hidden KYC Connections Using Graph Analytics
Complying with AML/KYC requirements is no longer a mere bureaucratic formality for the sole purpose of obtaining an operating license from the relevant competent authorities. Nowadays, regulators expect a demonstrable risk-based AML approach regardless of business size or nature, as long as it involves an asset of intrinsic value. Therefore, AML/KYC compliance has become an essential step for any startup or established business in the FinTech area. Blockchain and crypto-based startups targeting Initial Coin Offerings (ICOs) are not an exception here.
The AML/KYC market is aflood with products purporting to offer simple straightforward KYC checks. However, they often and summarily neglect vital information about closest relatives or business partners of a sanctioned or politically-exposed person. How does it sound, ethically and legally, if your ICO token is purchased by the spouse, the son, or the business associate of a well-known terrorist? We can agree that such short-sighted screening approach is not only inadequate but also would not survive scrutiny of any regulator, especially when this type of data is more than often available in the public domain.
For a handful of screening requests, a skilled team of analysts would be able to paint an accurate picture of such complex relationships and associations. However, this approach does not scale well to handle tens, let alone, hundreds or thousands of screening requests per day. In this article, we will explore how such a comprehensive screening process can be automated and made more ‘intelligent’. To implement a reliable solution, we would need a tech stack that is able to manage our sanctions lists, ‘understand’ associative information therein, and quickly search highly-connected datasets.
Graph Model
Suppose we have a list of sanctioned entities that includes a good mix of individuals and companies. Draw a circle around each of the entities at hand. Then draw lines between entities to denote the apparent relationships between these entities e.g. Person A is owner
of Company X. Let’s also add direction to these lines to show afferent (i.e. outbound) relationships. Et voila, we have a directed graph structure! Now, we are able to explore/traverse paths and ultimately expose any hidden or indirect connections.
However, in today’s world of complex geopolitics, we cannot overlook the fact that certain countries are perceived to be riskier than others. To construct a more realistic model, let’s introduce countries as well as any applicable relationships to our graph.
Technology Stack
To determine a reliable graph-based solution, we require a tech stack that would enable us to effortlessly hold and query millions of nodes. In effect, an adequate sanctions/watchlist dataset could easily consist of millions of entities and their respective associations. When converted to a graph and supporting reference data, such as countries, is added, the graph size could easily swell to 10s, if not 100s million elements.
A quick internet search for state-of-the-art graph databases will reveal that there are not many readily-available enterprise-grade solutions that are able to handle such volumes. We compared four of the leading and commonly-used solutions, namely, DataStax, Neo4j, OrientDB, and Titan. Hereunder, we present a summary of our findings:
DataStax | Neo4j | OrientDB | JanusGraph/Titan | |
---|---|---|---|---|
ACID compliant | No | Yes | Yes | Yes |
Query Language | Gremlin - |
Gremlin Cypher |
Gremlin Extended SQL |
Gremlin - |
Java API | Yes | Yes | Yes | Yes |
ORM/OGM Support | No | Yes | No | No |
Spring Data support | No | Yes | No | No |
Clustering Support | Yes | Yes | Yes | depends |
IaaS Offerings | - - AWS Marketplace |
GrapheneDB Azure Marketplace AWS Marketplace |
- Azure Marketplace AWS Marketplace |
- - AWS Marketplace |
Open Source | No | Yes | Yes | Yes |
Community | Small | Huge | Large | Medium |
DB-Engines ranking | 3 | 1 | 4 | 15 |
Apache TinkerPop | Yes | Yes | Yes | Yes |
Some of our key takeaways are:
- DataStax has a proven track record among large enterprises.
- Neo4j is somewhat a new comer. However, it already has a good presence on various platforms including Docker containers. It also available on leading Infrastructure as a Service (IaaS) platforms such as Microsoft’s Azure Cloud Marketplace and Graphene’s managed service.
- Neo4j has great support among popular frameworks such as the Spring Framework, Grails, Django, NodeJS and so forth.
- OrientDB and Neo4j are increasingly offering similar capabilities and performance. However, Neo4J’s intuitive Cypher query language made it our choice.
Proof of Concept
Let’s see how our proposed graph model can be implemented using Neo4j. First, let’s create a couple of watchlist entities using the following Cypher statements:
CREATE (n1:WatchlistOrganisationNode { pk: 'FBP', entityName: 'FooBar Petroleum' })
CREATE (n2:WatchlistPersonNode { pk: 'JD', entityName: 'John Doe' })
CREATE (n3:WatchlistPersonNode { pk: 'JR', entityName: 'Jane Roe' })
CREATE (n4:WatchlistOrganisationNode { pk: 'BB', entityName: 'Bazz Bank' })
CREATE (n5:WatchlistPersonNode { pk: 'RD', entityName: 'Rachel Doe' }
Let’s then add relationships to the newly-created nodes:
MATCH
(a:WatchlistPersonNode{pk: 'JD'}),(b:WatchlistPersonNode{pk: 'JR'}),(c:WatchlistOrganisationNode{pk: 'FBP'}),
(d:WatchlistOrganisationNode{pk: 'BB'}),(e:WatchlistPersonNode{pk: 'RD'})
CREATE
(b)-[r1:DAUGHTER]->(a),(a)-[r2:OWNER]->(c),(d)-[r3:SHAREHOLDER]->(c),(e)-[r4:SISTER]->(a)
RETURN r1, r2, r3, r4
In a very similar fashion, let’s complement our graph with a couple of country relationships (for brevity, we assume that some countries are already present in the graph with their respective names and ISO codes):
MATCH (p1:WatchlistPersonNode{pk: 'JD'}),
(p2:WatchlistPersonNode{pk: 'JR'}),
(o1:WatchlistOrganisationNode{pk: 'FBP'}),
(c1:CountryNode{iso3Code: 'IRN'}),
(c2:CountryNode{iso3Code: 'GBR'})
CREATE (p1)-[r1:BORN_IN]->(c1), (o1)-[r2:INCORPORATED]->(c2), (p2)-[r3:CITIZENSHIP]->(c2)
RETURN r1, r2, r3
Let’s check out how the complete graph looks so far:
MATCH (n) WHERE n.pk in ['FBP', 'JD', 'JR', 'BB', 'RD'] RETURN (n)-[]-()
We can clearly see that we have been successful at replicating our target graph model. Let’s now see how we can go about screening an individual and identify any relevant connections:
MATCH (n:WatchlistPersonNode) WHERE n.entityName=~'.*Roe'
RETURN (n)-[*1..2]->()
Now, it is quite clear that there is a sanctioned individual, named “Jane Roe”, whose father was born in Iran and thus would require Enhanced Due Diligence.
Real Data
Needless to say, the aforementioned example is purely for illustrative purposes. However, it is not far off from a real life scenario. With a comprehensive dataset, such as SwiftDil’s extensive database, simple queries can unlock the most obscure and opaque relationships and connections:
Uncover Closest Associates
MATCH p=(:WatchlistOrganisationNode{pk:"1048394"})-[:ASSOCIATE_OF*1..2]->(:WatchlistOrganisationNode)
WITH nodes(p) AS nodes
UNWIND nodes AS n
MATCH cp=(n)-->(c:CountryNode)
RETURN cp
Highlight Country Connections
MATCH (c1:CountryNode{iso3Code: 'GBR'}),(c2:CountryNode{iso3Code: 'SYR'}),
p = shortestPath((c1)-[*]-(c2))
WITH p
WHERE length(p) > 1
RETURN p
Conclusion
In this article, we have covered simple and yet very powerful graph concepts and we explained how they can be applied to KYC screening. These concepts can be evolved and the graph model can be further improved based on business needs. For instance, a scorecard could be built on top of Neo4j to improve the accurary of matches and widen the scope of risk indicators taken into account. An advanced scorecard may also leverage Machine Learning and advanced text/phonetic search algorithms. At SwiftDil, we have employed state-of-the-art techniques to implement a powerful scorecard around Neo4J, which has yielded to unparalleled matching rates.