Taming the Flood: Filtering the News with Artificial Intelligence

Published: 2 March 2021

After the Nile, Amazon and Yangtze, the Mississippi ranks as both one of the longest and most evocative rivers in the world. At the heart of the American continent both in location and in stature, the 2320 miles of the Mississippi collects water from a colossal catchment area of over a million square miles, collecting most, if not all the water from 31 US states or 40% of the entire USA, before discharging half a million cubic feet of water every second, into the Gulf of Mexico. It’s story is just as vast as its dimensions, the steamboat-era of the 1930s best captured perhaps by Mark Twain in The Adventures of Huckleberry Finn, has cemented the river’s place in popular culture and imagination.

But it is the river’s economic value that has continually driven its significance. Even the sedentary pace of the 19^th century paddle steamers was enough to develop the river as a key route for goods and passengers alike. Today the river transports more than 60% of the US agricultural output and is a critical shipping route for oil and coal throughout the country. The economic value is now estimated at over $400BN, as well as supporting millions of jobs both directly on the river and indirectly in the businesses that have come to rely on it.

It was not always like this. The meandering flow created by nature – the “Father of Waters” as the native Anishinabe people called it – did not lend itself well to its use as a shipping route. The 19th and early 20th century saw substantial attempts to shape, chisel and file off inconvenient curves, bends and other features less conducive to commerce; canals were dug to bypass impassable rapids, dams and levees were built to control flooding, and dredging meant the river could accommodate bigger and heavier ships.

These early attempts would prove disastrous in the long run. Throughout 1926 the rain had been unusually heavy, saturating the basin that fed the river for months. The rain however, did not stop. The spring of 1927 brought catastrophe. The river broke out of its banks in 145 places, and inundated 27,000 square miles to depths of more than 30 feet (9m). The flood left an estimated 750,000 people without food or shelter and created a humanitarian disaster the effects of which can still be seen today.

Those interventions – the attempts to tame the river – had not only had failed, but spectacularly backfired. Straight waterways channel water more efficiently. Deeper water flows more quickly and in greater volume. Levees and dams restrict the natural flow and increase the elevation of the water line. All combined to worsen the resultant flooding.

After the floods, the US Army Corp of Engineers were tasked to build the largest system of flood defences in the world. Taking over from the previous local and piecemeal approaches to flood defence, the Corps took a holistic approach looking at the entire region. Starting in 1928, in one of the largest public civil engineering projects ever devised, dozens of canals, locks and levees would be built to ensure the river could stay navigable for commerce, while allowing nature to take its course whenever the rains came. While new flooding has continued to refine the system, nothing like on the scale of 1927 has ever happened again.

The New Flood

While the Army Engineers were ushering in a revolution in how the great river Mississippi was managed, the 1920s also saw a different revolution in America. As a wave of prosperity surged through middle classes after WW1, new forms of entertainment in the form of tabloids, magazines, radio and motion pictures all captured the public appetite for modernity. These new forms of media ushered in profound cultural changes as new ideas could now freely proliferate, changing the very fabric of society. Time Magazine, Readers Digest, Amos and Andy, Charlie Chaplain; the 1920s set out the template for what would become billion-dollar media empires, movie studios and the liberal flow of knowledge and ideas.

For over 70 years these forms of media grew in popularity and influence, until a new flood occurred. This wasn’t a flood of water, but of information. The internet created an entire new way that media could be created and consumed. In 1996 the internet passed 100,000 websites and by 2000 there were more than 20 million websites; today there are well over a billion. Social media, which empowers anyone to become a publisher of information has created data orders of magnitude larger, with hundreds of millions of posts per day being broadcast on the top platforms.

Reducing the Noise

This blog has already talked about the value of intelligence gathering from open-source media (see Deciphering Risk from Adverse Media Reports) and the ways in which small pieces of information can even reveal information valuable to national security and countering financial crime. Technology has optimised the flow of information to create the most value for people sharing the latest news, but – just like the waters of the Mississippi – it has also created the pre-conditions for a flood.

When on-boarding clients within a financial services context, the flood becomes particularly tricky to navigate.

Using traditional search engines, an analyst may be looking for news articles, something in their history that betrays a risk, criminal activity or other potential regulatory issue. Typing in that client’s name into a search engine could result in a huge number of search results which can easily total many hundreds if not thousands of pages. Perhaps an interesting article detailing their involvement in bribery scandal appears on Page 27 of these search results. How many pages are you able and willing to scroll through to spot the risk? Perhaps, the client shares the same name as a famous sports or pop star and it is difficult to ascertain if the client is among the results at all. Perhaps the results return an article that indeed mentions the prospective clients name, but in the context of a family matter or unrelated business announcement.

This deluge of poor quality and irrelevant data often means that analysts neither have the time nor patience to wade through and find any potential risks. This has typically meant so-called adverse media or negative news searches are time consuming and rarely result in actionable intelligence as part of the on-boarding process.

Rudimentary approaches to solving this problem are widespread in the financial services industry. For instance, by adding the word “bribery” after their clients name “Acme Inc.” may help hone the results somewhat. Adding other words like “corruption”, “crime”, or “human trafficking” might expand the potential for discovering risk, but it is extremely limited as a crude attempt to shape and file off the rough corners of the mountain of data returned.

Simply, these keyword-based searches could never hope to match the range of nuance and variety in the language of news articles, webpages and other open-source data that might inform the intelligence picture around a client’s behaviour. Furthermore, limiting such keyword searches to a single language such as English reduces the likelihood of finding articles in any one of the global languages that make up the media landscape all over the world.

NLP: A Next Generation Approach to reducing Data Floods

Just like the new generations of levees and dams that were built on the Mississippi throughout the 20^th century ushered in an entirely new, holistic approach to managing America’s greatest waterway, new technology is now doing the same for managing the flood of information that is inundating compliance analysts conducing adverse media checks.

Natural Language Processing (NLP) is an application of machine intelligence which can automatically read and understand unstructured text such news articles, documents, or any written material. Like many other techniques within the field of AI, this works by training an algorithm to recognise what a news article is about at a conceptual level, allowing to the machine to understand what ‘bribery and corruption’ looks like, in multiple languages and from multiple news sources. In fact, in order to train up a classifier for adverse news, we typically look for 10,000 examples to train the system to spot future, similar articles about that topic. Once trained, these algorithms can spot, instantaneously, the features of an article about bribery and corruption in dozens of languages over millions of articles of news.

Applying this within the context of adverse media searching means NLP has the following benefits:

News Articles are automatically grouped into topics of interest – Over 3 million news articles are published every day from news outlets all over the world. Properly trained NLP can group all of these into different categories before a human has even arrived at their desk. Articles about predicate crimes such as human trafficking, embezzlement, or narcotics can all be pre-selected by the system for searches and alerting without the need for elaborate sets of keywords and query terms.
Irrelevant or Useless articles are automatically removed from review – the overwhelming majority of news articles published every day are not relevant to making informed decisions about client risk. Specifically, we estimate that of the millions of articles published every day, less than 0.8% have any value for on-boarding or periodic review. NLP removes over 99% of the noise from adverse media searches meaning that results returned are more relevant and likely to be of value.

Duplication is reduced and similar news is grouped into stories – similar articles are published every day by multiple outlets. Some are even syndicated, word-for-word copies flood news desks on multiple publications. NLP cuts through this noise, automatically discounting identical articles, and grouping similar articles about the same story into an easy to review collection. This further reduces the analyst burden.
News articles can be grouped not just by keywords, but by phrases and complex combinations of linguistic patterns – going beyond simple keyword monitoring, NLP looks more deeply at the structure and patterns within news articles. This means groupings of topics are more accurate, resulting in better quality searches and data.

Articles in other languages are automatically searched for, identified, and translated – NLP topic classification can be done in any language, provided there is sufficient source of ground truth from which to base its decisions on. Training these topic classifiers in over 15 global languages means that risk is identified no matter where the media article has been published.

The Ripjar difference

It is impressive to imagine the engineers who designed the monumental programme of flood defences after the great flood in 1927, who did it all without any advanced modelling and simulation software, let alone a pocket calculator. All they had was their ingenuity and a slide-rule. However, their engineering was able to tame ‘Big Muddy’, and make it safe for those living and working on it, while respecting the natural processes which put it there in the first place.

We have combined state of the art NLP with Entity Resolution (see What is Entity Resolution?) to give intelligence analysts within financial institutions a world-leading adverse news capability. The colossal river of information that flows digitally around the world is measured not in gallons, but in terabytes. This torrent of information can be harnessed for many different purposes, including the effective screening of clients in financial services for risk and NLP is a critical tool in reducing the volume of this torrent to make it more efficient and effective to help institutions fight financial crime.

If you’d like to learn more about our adverse media or negative news screening capabilities, please read the whitepaper here, or contact us for a demonstration.

Last updated: 16 August 2024