Technical

Why OCR Fails on Real Invoices (And What ML Extraction Does Instead)

January 14, 2025 · 9 min read

We get asked this question constantly: "We already have OCR — can't we just regex our way to structured fields?" After processing a few million invoices through various extraction approaches, we have a direct answer. OCR is not the bottleneck. Understanding is.

The distinction matters because it changes what you build. An ops team that frames their invoice problem as an OCR problem will spend money on better scanning hardware, higher DPI, preprocessing pipelines. An ops team that understands it as an extraction and comprehension problem will make different decisions — and get much further, much faster.

What OCR Actually Does

Optical character recognition converts image pixels into text characters. That's the complete job description. Given a scan of a page, OCR outputs a string — or more precisely, a sequence of detected text tokens with approximate bounding-box coordinates.

When OCR works well, you get something like:

Invoice No. 10847
Vendor: Glenfield Supply Co.
Date: 12/09/2024
Total Due: $14,872.50

That looks useful. And for a perfectly structured, high-contrast invoice from a single vendor who has been sending the same layout for years — it might be. The problem is that description fits maybe 15% of the invoices a mid-size enterprise actually receives.

The other 85%? They have rotated scans. Faxed copies that have been photocopied twice. Vendor letterhead that bleeds into the table area. Line items that span two rows. Currency symbols from three countries in the same document. Tax fields that are labeled "GST" on one invoice and "VAT" and "Sales Tax" on three others from the same supplier depending on which country the invoice originated from.

OCR reads pixels and outputs characters. It has no model of what a vendor ID is, where total due should be relative to a subtotal, or what the relationship is between a line-item quantity and a unit price in a multi-row table.

The Regex Trap

The natural follow-up to raw OCR is regex and rules: match a dollar sign followed by digits for amounts, match date patterns, anchor field names with known label text. This works for controlled vendor sets where you own the format. It falls apart at scale.

Here's what we saw when we inherited a mid-size distributor's existing OCR+regex pipeline. They had 300 active vendors. Their engineering team had written rules for their top 50 by volume — covering about 70% of invoices. The remaining 250 vendors either fell through to manual data entry or produced systematically wrong extractions that no one caught until reconciliation.

The specific failure modes:

Label variation: "Invoice #", "Inv. No.", "Bill Number", "Reference" — OCR produces text but rules don't generalize across synonyms
Positional shift: Same vendor, new invoice template — the "Total Due" field moved from bottom-right to top-right. Every regex anchored to position breaks silently.
Table boundary collapse: Line items in a multi-row table extract as a single blob, making quantity × price matching impossible.
Silent errors: The worst failure mode. OCR+regex extracts a value — just the wrong one. $1,487.25 gets pulled instead of $14,872.50 because the subtotal and total were adjacent and the rule matched the wrong number.

What ML Extraction Does Differently

ML-based document extraction doesn't start with OCR as the first step — it uses OCR as one signal among many. The document is processed with spatial awareness: where text appears on the page, how text blocks relate to each other, what the visual structure of tables and sections looks like.

The model's job is not "find the string that matches 'Invoice No.'" — it's "identify the field in this document that represents the unique identifier for this transaction." That's a semantic task, not a pattern match.

Concretely, what this changes:

Label generalization: The model understands that "Inv. No.", "Invoice Number", and "Bill Ref." are the same field type, because it has learned the semantic context of invoice identification across thousands of document variants.
Spatial reasoning: A "Total Due" value is extracted based on its relationship to subtotal, tax, and line-item values — not based on its position on the page. If a vendor redesigns their template, extraction continues working.
Table decomposition: Line items are segmented as structured rows, not as flat text. Quantity, unit price, description, and line total are extracted as discrete fields from the same table row.
Confidence scoring: Every extracted field comes with a confidence score. Low-confidence extractions go to exception review rather than silently propagating wrong values into your ERP.

A Concrete Failure Example

One of our early customers — a regional wholesale distributor in the Gulf Coast area — was running their invoice intake through a commercial OCR product with custom rules on top. Their average document mix included invoices from about 180 active vendors, roughly 40% of which used non-standard layouts (small vendors, international suppliers, handwritten addenda on typed invoices).

Their measured error rate at the field level: 4.8%. Their detected error rate: 1.2%. The gap between those two numbers is where the real cost lives. Errors that don't get caught until reconciliation — or accounts payable close — create rework cycles that cost significantly more than the original data entry.

After switching to ML-based extraction with field-level confidence scoring, their detected-and-routed exceptions went up initially (the system was catching errors the old system was silently passing). Total errors in the ERP — the number that actually matters — dropped from 4.8% to under 1%.

What ML Extraction Doesn't Solve

We want to be direct about the limits here. ML extraction does not eliminate exceptions — it replaces silent errors with visible ones. That's a significant improvement, but it means your exception routing workflow actually matters. If you swap OCR+regex for ML extraction but leave the "exceptions go into a shared inbox" process unchanged, you'll surface more problems without necessarily resolving them faster.

The other thing ML extraction doesn't solve: images that are genuinely unreadable. Contrast too low, resolution too degraded, or physical document damage (torn corners, staple holes through key fields) will produce low confidence scores and route to exceptions. No extraction system can read what the scan didn't capture. The OCR quality floor still matters — we just don't pretend it's sufficient on its own.

We're not saying OCR is bad. We're saying OCR is a character-recognition tool, not a document-understanding system. The distinction matters when you're designing a pipeline that processes 10,000+ invoices monthly and needs to route exceptions correctly, not just extract strings.

Practical Implications for Your Evaluation

When you're evaluating document extraction vendors — including us — the question to ask is not "what is your OCR accuracy?" The right questions are:

What is your field-level extraction accuracy on invoices you haven't seen before — new vendor templates, not templates you trained on?
What is your false-confidence rate — how often does the system extract an incorrect value with a high confidence score?
How does table decomposition work for multi-line items, and how are joined/split rows handled?
What happens to low-confidence extractions? Are they routed somewhere actionable or do they silently pass through?

Those questions separate document extraction systems from fancy OCR wrappers. We'll answer each of them directly for Fieldiq if you want to run the evaluation — bring a batch of your actual invoices, including the difficult ones from your small-vendor tail.

Published by the Fieldiq team

See Fieldiq process your documents