What is Fuzzy Matching?
Fuzzy matching, traditionally used for name matching when undertaking customer screening, is a technique that identifies approximate matches rather than exact matches. It is particularly useful when dealing with data that may have inconsistencies, such as typographical errors, different spelling variations, or missing characters. By allowing for a certain degree of variation, fuzzy matching helps to find similar entries that are not identical.
How Fuzzy Matching Works
Fuzzy matching algorithms compare strings and determine how similar they are based on predefined criteria. These criteria can include the number of characters that need to be changed, added, or removed to turn one string into another. A common example is the Levenshtein distance, which measures the number of single-character edits required to change one word into another. Analysts can set a threshold to decide how close the match needs to be for it to be considered a valid match.
Challenges with Fuzzy Matching for Name Matching
Despite its usefulness in many scenarios, fuzzy matching presents significant challenges when it comes to name matching. The primary issue is the high number of false positives it generates. When the threshold for matching is set too low, many unrelated names may be flagged as potential matches, creating a considerable amount of work for analysts to sift through the irrelevant data. This is particularly problematic when dealing with large datasets.
Fuzzy matching also struggles with aliases and nicknames, such as “Ted” for “Edward” or “Maggie” for “Margaret.” These variations are not always phonetically or character-wise similar, making them difficult to detect using traditional fuzzy matching methods.
Cultural Limitations of Standard Algorithms
Standard fuzzy matching algorithms do not account for the likelihood of different variants or the cultural significance of certain name elements. For example, certain typographical errors are more likely than others, and accents on letters may be frequently missed.
Furthermore, some names do not have a standard English spelling, leading to multiple variations and further complicating the matching process. For example, the name “Mohammad” has numerous spellings due to the lack of vowels in Arabic. Similarly, different Latin languages will transcribe Cyrillic names differently. For example, German and English versions of the name “Vladimir Putin” may be spelled differently despite both being Latin languages.
The importance of effective transliteration, transcription and translation in customer name screening is also raised by The Wolfsberg Group in their 2022 Negative News Screening FAQs, which provide guidance on tackling different languages and scripts in the context of adverse media screening.
What is Ripjar’s Approach to Name Matching?
Unlike traditional fuzzy matching, our name variants approach is designed to minimise false positives and maximise recall, ensuring accurate and efficient name matching.
Rather than relying only on traditional fuzzy matching methods, we take a much more comprehensive, in-depth approach. Our advanced technology encompasses over 25 different techniques which work together to identify the most relevant and accurate matches based on different name variants. This can then be tuned to suit individual organisations.
Name Variants Database
Instead of relying solely on fuzzy matching, we undertake name screening in over 400 languages, scripts and dialects, and maintain an extensive database of over 1 million name variants, encompassing different spellings, translations, and truncations of names.
Our approach involves using multiple matching techniques simultaneously rather than relying on a single fuzzy matching algorithm. We consider a variety of factors, including character-based algorithms, phonetic matching, and real-life name variant patterns. By addressing all potential variations, we ensure that our system captures a comprehensive range of name possibilities.
For instance, while other systems might rely solely on the Levenshtein distance to measure character changes, we incorporate subtraction variants, spelling corrections, and database variants created from observed name representations in different countries. We also apply region-specific rules based on the origin of names, enhancing our ability to match names accurately across diverse datasets.
Additional data points such as date of birth, location, or other identifiers, are then used to help identify and discard mismatches.
Risk-Based Approach and Custom Tuning
We understand that there is no one-size-fits-all solution to name matching – it depends on the specific use case and the associated risk tolerance. That’s why we enable a risk-based approach, offering multiple different matching strategies out of the box. These can then be further refined to suit different scenarios and client needs, with 25+ name variant techniques to choose from.
Our Operational Data Science team works closely with customers to fine-tune these strategies, balancing the need for high recall with the minimisation of false positives. This collaboration ensures that the matching process is as efficient and accurate as possible. We can also tailor the matching strategies for smaller subsets of client data, or based on different types of risks, such as sanctions or adverse media, providing users with control over the balance between recall and the amount of manual review work required.
Data-Optimised Name Matching and Linguistic Matching
We leverage real data to optimise our matching processes, including analysing the frequencies of name occurrences. By understanding how often certain names appear on watchlists or in media, we can make informed decisions about which name versions to include or exclude, enhancing the overall performance of our matching engine.
Specifically for adverse media matching, we are also able to leverage the rarity of a name in a particular region to reduce the false positives arising from returning too many matches on common names.
In addition, we use linguistic techniques to understand the origin and structure of names, which allows us to identify and manage name variations more effectively. Having characterised the likely origin of a name, we use a rule-based name matching system to vary different parts of the matching to account for certain variations and name structures being more prevalent in certain cultures/languages. For example, we handle declensions in different languages, recognising them as legitimate variants rather than typographical errors. This level of linguistic sophistication enables us to accurately match names across different cultures and languages.
Our name variants approach also includes translations of corporate names in various languages, and their likely manipulations. For example, a corporate name in Chinese might have multiple international translations, and we include all plausible versions to ensure comprehensive coverage.
Conclusion
Fuzzy matching alone, while useful for finding approximate matches, falls short in the context of name matching due to its high rate of false positives and inability to account for cultural and linguistic nuances. Maintaining a comprehensive database of name variants and employing multiple matching techniques can provide more accurate and efficient results. Understanding the likely mistakes and variations specific to names is crucial for effective name matching, making it a complex task that goes beyond the capabilities of simple fuzzy matching algorithms.
Ripjar’s name matching system stands out from traditional fuzzy matching approaches by leveraging a combination of data science, linguistic techniques, and customised tuning. Our comprehensive name variants database ensures that we deliver accurate and efficient name matching for a wide range of use cases and enables a risk-based approach to be undertaken. By understanding the cultural and structural nuances of names, we provide a superior solution that meets the complex needs of global screening at scale, and consistently performs top in name matching tests.
Discover Ripjar’s name screening solution