Buyer's Guide

How Document AI Accuracy Is Actually Measured (And How to Evaluate Vendor Claims)

July 30, 2025 · 10 min read

Every document AI vendor claims high accuracy. "99% accuracy" appears in our marketing. It appears in our competitors' marketing. It appears in pitch decks from vendors who, when you run an actual evaluation on your documents, produce output with error rates that are considerably higher. The number is real — but the number being compared is not always the same number.

This guide is written for ops and finance leaders evaluating document extraction vendors. It explains the distinct accuracy metrics in play, how vendors can legitimately claim high numbers while delivering poor real-world performance, and what to ask — and test — to get a meaningful comparison.

The Three Accuracy Metrics That Vendors Conflate

1. Character-Level OCR Accuracy

This is the percentage of individual characters that are correctly recognized from the image. A document with 2,000 characters where 10 are misread has 99.5% character-level accuracy. This is the number that makes OCR vendors look good — and it's genuinely high for clean documents. Tesseract, Google Vision, and other modern OCR engines achieve 98-99.9% character accuracy on high-quality, high-contrast document scans.

Why it doesn't matter much: a character-level error in an invoice number transforms "INV-10847" into "INV-10847" — almost certainly wrong. An invoice number with one wrong digit is entirely wrong for ERP matching purposes. High character accuracy doesn't mean the extracted value is usable.

2. Field-Level Extraction Accuracy

This is the percentage of individual fields extracted correctly — the value for invoice_number matches the ground truth, the value for total_due matches, etc. A document extraction system with 99% field-level accuracy on a 20-field invoice has an average of 0.2 field errors per document.

This is the metric that matters for AP operations, and it's the one we report. But even within this definition there are important sub-questions:

Weighted or unweighted? Averaging accuracy across fields equally weights vendor_address the same as total_due. A vendor with 100% accuracy on total_due and 80% accuracy on vendor_address should be viewed differently from one with the reverse.
Test set distribution? Field accuracy on a curated test set of clean, high-quality documents from major vendors looks very different from field accuracy on a real production document mix that includes faxed copies, small vendor invoices, and documents with partial handwriting.
Exact match vs. normalized match? "$14,872.50" and "14872.5" and "14,872.50 USD" are the same value. Depending on how a vendor scores accuracy, all three might count as correct or only one might.

3. Straight-Through Processing Rate

Straight-through processing (STP) rate is the percentage of documents that complete extraction and pass into the ERP without any human review. This is the operational metric — the percentage of your daily invoice volume that your team never has to touch.

STP rate is a product of field accuracy AND exception threshold configuration. A vendor with 97% field accuracy and aggressive exception flagging might have a lower STP rate than a vendor with 94% field accuracy and lenient thresholds — because the second vendor is silently passing more errors through. High STP rate achieved by lowering the exception threshold is not a win; it's a different kind of error rate.

The right STP target depends on your error tolerance. For a 10,000 invoice/month AP operation with an average invoice value of $8,000, a 1% error rate that passes through undetected has potential financial exposure of $80,000/month. Your STP rate and your exception review capacity are a deliberate trade-off, not just a benchmark number.

The Test Set Problem: What "99% Accuracy" Usually Means

Most vendor accuracy claims are measured on the vendor's own test set. That test set is usually drawn from documents the model has seen during training, from clean document types, and from the vendor's strongest-performing document categories. It is not your document mix.

This is not necessarily deceptive — it's the standard way ML systems are benchmarked internally. But it means vendor-reported accuracy numbers are upper bounds, not production estimates. The gap between benchmark accuracy and real-world accuracy depends on how different your documents are from the vendor's training distribution.

The distribution factors that matter most:

Vendor diversity: A test set from 20 large vendors will understate error rates on your 300-vendor tail with diverse layout formats
Document quality distribution: Curated test sets rarely include the 10-15% of your volume that arrives as degraded scans, faxed copies, or photographed documents from phones
Language and currency mix: Vendors often benchmark on English-language, US-format invoices. If you have European or Latin American supplier invoices, accuracy typically drops
Document age: Template drift — vendors redesigning their invoice layouts — is not represented in static test sets

How to Run a Meaningful Vendor Evaluation

A meaningful accuracy evaluation uses your documents, not the vendor's samples. Here's the structure we recommend:

Step 1: Build a representative test set. Pull 200-500 invoices from the last 6 months — not a random sample, but a stratified one: include your top-20 vendors by volume plus a representative sample of your long-tail vendors. Include your hardest documents (faxed, low-contrast, handwritten line-item additions). Manually annotate the correct field values for these documents. This is 2-4 days of internal work and is the most valuable investment you'll make in the evaluation.

Step 2: Run each vendor on your test set blind. Provide the same document set to each vendor. Don't tell them which documents you consider hardest or which field accuracy you're most concerned about. Ask for field-level extraction output for each document.

Step 3: Score against your annotated ground truth. Calculate field-level accuracy for each of your critical fields separately — don't let a high-accuracy easy field (like invoice date) mask poor accuracy on a harder field (like multi-line item tables). Weight the critical fields by their financial consequence.

Step 4: Evaluate exception handling, not just accuracy. Look at what each vendor flagged as low-confidence. Did they catch the difficult documents you know are in your set? Did they flag false positives on documents that are actually clean? The calibration of their confidence scores matters as much as the accuracy number.

Where Fieldiq Stands on This

We claim 99% field-level accuracy on invoices. That number comes from our production performance across a mixed document distribution — not from a curated test set. It is the accuracy we measure on documents after they've been processed, with exceptions identified and counted against us in the denominator.

We also want to be direct: our 99% is not uniform across all fields. Our accuracy on total_due and invoice_number is higher than our accuracy on line_item_description on low-quality scans. Our accuracy on clean, digital-origin PDFs is higher than on faxed documents with 200 DPI resolution. We can give you field-by-field accuracy breakdowns if you bring us your document distribution.

We're not saying our competitors are misleading you when they claim high accuracy numbers — some of those numbers are real within their measurement framework. We're saying the right question for any vendor, including us, is: what is your accuracy on documents that look like mine, measured in a way that counts the right errors? That question has a specific answer, and any vendor who won't give it to you on your own test set is giving you a number you can't rely on.

A Quick Reference: Questions to Ask Every Vendor

What document distribution was your accuracy benchmark measured on?
Is your accuracy metric field-level or form-level? Exact match or normalized?
What is your false-confidence rate — errors that passed with high confidence scores?
Will you run your system on 200 of our documents and report per-field accuracy against our annotated ground truth?
What exception rate can I expect in the first 30 days vs. steady state?
How does accuracy change on degraded scan quality documents?

If a vendor won't run the evaluation on your documents, or won't give you per-field accuracy breakdowns, treat their headline number as unverified. The evaluation costs both parties time — but it's the only way to know what you're actually buying.

Published by the Fieldiq team

See Fieldiq process your documents