Building a sanctions and adverse-media screening pipeline

The sources that matter

The screening corpus has three layers.

Government-published sanctions lists. Most of the major lists are published in machine-readable formats with regular updates. The lists from the major sanctioning jurisdictions cover the majority of the entities anyone running international trade needs to be alert to. They are free, authoritative, and updated on schedules that are public.

Politically-exposed-persons databases. A separate category of risk, with overlapping but distinct sources. The PEP layer matters because counterparties that are not sanctioned but are connected to politically exposed individuals carry a different risk profile, and the workflows that follow a hit are different.

Adverse media. The hardest layer to systematise. The signal is in news articles, regulatory enforcement actions, civil judgements, and investigative reporting. The corpus is large, the quality is variable, and the matching is fuzzy. We use a combination of structured sources (regulatory enforcement registries, court records where they are open) and unstructured sources (news feeds, with the source ranked by reliability).

Building the screening corpus is a one-time setup with ongoing maintenance. Once it is in place the operating cost is the storage and the periodic refresh.

The matching problem

Naive name matching is wrong. People share names. Companies share names. Transliteration produces variation. Aliases and former names are common in sanctions lists. The matching strategy is the part of the pipeline that determines whether the screening is useful or noisy.

Our approach is layered.

Exact match on identifiers where they are available — passport numbers, registered company numbers, beneficial-ownership identifiers. When an exact match fires, the confidence is high and the workflow proceeds to detailed review.
Phonetic and transliteration-aware name matching as the primary fallback. Multiple algorithms running in parallel, with the union of the candidates surfaced for ranking.
Contextual disambiguation using available metadata: date of birth, country of registration, sector, beneficial owners. A name match that is consistent on the metadata is a high-confidence hit. A name match where the metadata is inconsistent is likely a false positive and is flagged accordingly.
Adjacency screening. The party itself may not be sanctioned but a director, beneficial owner, or known associate may be. The adjacency layer pulls one degree of corporate connection and re-runs the screening on each.

Tuning the tolerance for false positives

Sanctions screening always returns false positives. The question is what the right tolerance is. Tune the matching too tight and you miss real hits. Tune it too loose and the operator drowns in noise and starts ignoring the alerts.

The right tolerance depends on the risk profile of the use case. For a counterparty-onboarding workflow where every match is reviewed by a trained analyst, looser matching is fine — the false positives are filtered downstream. For a high-volume payment-screening workflow where most hits go through automated processing, tighter matching is necessary because the noise tolerance is much lower.

We expose the tuning explicitly. Each workflow declares its risk profile, the matching parameters are set accordingly, and the false-positive rate is monitored and reviewed quarterly. The discipline is that the tolerance is a parameter, not a default. Owning the parameter is the difference between a screening pipeline that serves the business and one that imposes the vendor's choices on the business.

Adverse media without becoming a junk filter

Adverse media is the hardest layer because the corpus is unstructured and the signal-to-noise ratio is low. The naive approach — full-text search for the counterparty name across a news corpus — produces volumes of irrelevant matches and misses the meaningful ones.

The approach that has worked for us is two-stage. First, a coarse retrieval against the news corpus, indexed by the counterparty name and any known aliases. Second, a model-based filter that reads each candidate article and judges whether the article is genuinely about the counterparty (not a coincidental name match) and whether the content is genuinely adverse (not a routine business mention).

The model is given an explicit definition of what counts as adverse — regulatory action, criminal proceedings, fraud allegation, sanctions designation in any jurisdiction, civil judgement of significant size, investigative reporting alleging misconduct — and is asked to return a structured judgement with a justification.

The output is a much shorter list of candidates than the raw retrieval, with explanations the human reviewer can act on quickly. The reviewer makes the final call. The model has done the filtering, not the deciding.

What the pipeline does not do

The screening pipeline is one input into a compliance decision, not the decision itself. It does not establish that a counterparty is acceptable to transact with. It does not satisfy any specific regulatory obligation in any jurisdiction without being integrated into a broader compliance programme. It does not replace legal counsel on questions of jurisdiction, applicable sanctions regimes, or licence requirements.

The pipeline produces a structured report with explicit findings and explicit confidence levels. The compliance officer reads the report, layers in jurisdictional knowledge and judgement, and decides. The pipeline accelerates the screening; the human owns the decision. That boundary is important, and any system that blurs it is making a worse trade-off than the one it claims to be solving.

The takeaway

Sanctions and adverse-media screening is one of the layers where building beats buying for an operator that takes the work seriously. The off-the-shelf tools are reasonable starting points; they are not durable answers because they impose the vendor's matching choices, the vendor's source choices, and the vendor's pricing curve on a capability that can be built and operated for less, with more transparency, on hardware you control.

The architecture is mature. The sources are available. The matching is engineering. The judgement is the compliance officer's, and that does not change. What does change is that the operator owns the substrate the judgement runs on top of.

Working on this?

For operators evaluating sovereign-infrastructure architecture for a business of meaningful scale, we run a quarterly cohort of stack-design engagements.

Get in touch

Search terms this article addresses

sanctions screening pipeline ukself-hosted compliance screeningadverse media screening aikyc screening pipelinepep screening uktrade compliance screeningsanctions matching algorithmoperator-grade compliance

Building a sanctions and adverse-media screening pipeline

The sources that matter

The matching problem

Tuning the tolerance for false positives

Adverse media without becoming a junk filter

What the pipeline does not do

The takeaway

Working on this?

Search terms this article addresses

Related under Commodities & Trade Verification

Document forensics with vision models: what changes when verification is cheap

Beneficial-ownership tracing: the layer that decides whether the deal is real

Why structured mandate marketplaces will reshape commodity intermediation