What the model can actually see
A modern vision-capable model, given a high-resolution scan of a document, can do several things reliably.
- Read every printed character, including in low-quality scans, in multiple languages, including handwritten endorsements where the handwriting is reasonably clear.
- Identify the structural elements of the document — letterhead, body, signature blocks, stamps, watermarks, security features — and report on whether each is present and consistent with the document type.
- Cross-reference fields within the document for internal consistency: dates, amounts, party names, reference numbers.
- Compare two documents side by side and report on consistencies and inconsistencies between them.
- Flag artefacts of digital alteration where the alteration is sufficiently visible — pixel-level inconsistencies around edited regions, font mismatches in inserted text, copy-paste artefacts from another document.
None of those individually require sophistication beyond what a careful human would do. The change is that the cost of doing all of them in parallel on every document in a stack is now trivial.
What the model cannot see, or sees badly
The honest list of failure modes is as important as the capability list.
Sophisticated forgeries pass. A document forged by someone who knows what they are doing, with access to genuine letterhead, genuine stamps, and the right printing equipment, can be visually indistinguishable from a real one. The model has nothing to flag because there is nothing visually wrong. Verification of such a document depends on cross-referencing with the issuing authority, which is a different problem.
Genuine documents from unfamiliar formats. A real document type the model has not seen during training can confuse it, generating false positives on legitimate features that the model interprets as anomalies. The fix is to provide reference examples of the format in the prompt, but that is operator overhead.
Substantive content versus cosmetic accuracy. The model can confirm the document looks right; it cannot confirm the underlying claim is true. A bill of lading that looks impeccable can still describe cargo that does not exist. Visual verification is necessary, not sufficient.
The discipline is to use the vision model for the visual layer of verification and to know exactly where the visual layer stops being useful.
The verification pipeline we run
Our document-verification pipeline is several stages, each with a clear job.
- Ingestion and normalisation. Documents arrive in mixed formats. They are converted to high-resolution images, indexed, and tagged with the type the submitter declared.
- Visual examination. The vision model receives each document with a structured prompt that tells it the expected document type and asks it to verify the structural elements, the internal consistency, and any visible alteration artefacts. The output is a structured report with specific findings.
- Cross-document consistency. When a stack of documents is submitted together (commercial invoice, packing list, bill of lading, certificate of origin), the model is asked to compare every pair and flag inconsistencies.
- Independent reference checks. Where the document references an issuing authority, a counterparty, or a vessel, those references are checked against independent sources. This is not a vision problem; it is a database problem, but the vision step extracts the references and the database step verifies them.
- Human review. A trained reviewer reads the model's report, examines the documents directly, and makes the final judgement. The model accelerates the review; it does not replace it.
The pipeline is fast enough that we can examine an entire trade document set in minutes rather than hours, and thorough enough that the human reviewer is operating from a structured report rather than a blank page.
Where the cheap verification changes incentives
The structural shift is not that verification becomes more accurate. It becomes accessible. When examining a document set costs a few pounds rather than a few thousand, the economics of fraud change in two ways.
First, verification can happen earlier in the chain. A counterparty can run their own verification on a document set the moment it arrives, before any commercial commitment. The traditional pattern of verifying only at the latest stage falls away. Bad documents are filtered earlier, and the surrounding ecosystem learns about the patterns of fraud that would previously have remained hidden inside individual transactions.
Second, verification becomes routine on legitimate trade. If everyone is running a structured visual examination on every document, the legitimate operators benefit from the auditable verification trail and the bad actors have less room to operate. The cost-benefit of producing a forgery rises because the visual layer is genuinely scrutinised rather than waved through.
The honest assessment is that this is not a complete solution to trade fraud. It is a meaningful improvement to one specific layer. The other layers — counterparty verification, beneficial ownership tracing, sanctions screening — remain necessary, and the gains compound when all of them operate together.
What we tell clients about the limits
The most important conversation in this work is the one about what verification can and cannot establish. A client who believes the visual examination is the whole answer is in a worse position than one who has done no examination at all, because the false confidence is worse than absent confidence.
The framing we use: the visual examination establishes that the documents appear consistent with what they claim to be. It does not establish that the underlying transaction is real, that the cargo exists, that the issuing authority actually issued these documents, or that the parties are who they say they are. Each of those is a separate verification question, addressed by separate methods, and a complete trade-verification operation runs all of them in parallel.
The visual examination is the cheapest layer. The cheapest layer being well-instrumented does not mean the more expensive layers can be skipped. The honest version of the work names this clearly.
The takeaway
Vision-capable models are a real upgrade to one specific layer of trade-document verification. The economics they enable change the place verification can happen in the chain and the rate at which legitimate operators can scrutinise the documents they receive. They do not change the fundamental shape of the problem; they make one of its layers cheap.
For an operator building a verification capability today, the visual layer is now table stakes. Building it is straightforward; integrating it into the broader chain of checks is the harder problem and the one that determines whether the resulting capability is useful or merely impressive.
Working on this?
For operators evaluating sovereign-infrastructure architecture for a business of meaningful scale, we run a quarterly cohort of stack-design engagements.
Get in touch