Skip to content
Purchase Order Extraction: Field Coverage, Edge Cases, and Multi-Format Handling
← Back to Blog
Product

Purchase Order Extraction: Field Coverage, Edge Cases, and Multi-Format Handling

· 9 min read

Purchase orders are structurally simpler than invoices on paper. There's a buyer, a seller, a list of items, and a price. In practice, POs are some of the most format-variable documents we process. The enterprise procurement world spans custom ERP-generated PDFs, EDI-printed outputs, scanned paper forms from 2003, and portal-exported spreadsheets that someone converted to PDF before sending. All of them contain the same core data. Extracting it reliably across all of them is the actual engineering problem.

This post walks through how Fieldiq handles PO extraction: what fields we extract, which formats create edge cases, and what we do when a PO doesn't fit a clean template. It's not a sales overview — it's what we actually had to solve to get PO extraction to 99% field-level accuracy across a genuinely diverse format corpus.

The Core PO Field Set

Every enterprise PO contains a header block, one or more line items, and a footer block. The challenge isn't finding those three zones — it's that the same semantic field can appear in any of them depending on the layout, and the column structure of the line-item table varies substantially across formats.

Header fields we extract on every PO:

  • PO number (required — this is the join key to your ERP)
  • Issue date and required delivery date
  • Buyer entity name and address
  • Vendor entity name and vendor ID (when present)
  • Ship-to address (often different from billing address)
  • Payment terms (Net 30, Net 45, etc.)
  • Currency code
  • Approval authority or authorized signatory, when present

Per-line-item fields:

  • Line number or item sequence
  • Item description (often a long-text field)
  • Part number or SKU (buyer-assigned and/or vendor-assigned)
  • Unit of measure
  • Quantity ordered
  • Unit price
  • Line total
  • GL account code (present on ERP-generated POs, absent on manual formats)
  • Delivery date per line (when the PO has line-level delivery windows)

Footer fields: subtotal, applicable taxes, shipping charges, total PO value, and any special terms or amendment notes appended to the document.

The Variadic Line-Item Problem

The hardest extraction problem in POs isn't the header — header fields are usually findable with a decent spatial model. The hard part is line-item tables, specifically because procurement teams in different industries use different column schemas, and those schemas change over time as ERP configurations evolve.

We've seen PO line-item tables with as few as 4 columns and as many as 18. We've seen tables where the "description" spans two rows per line item. We've seen tables that split across two pages with headers repeated — and tables that split across two pages with headers not repeated (which means you have to infer the column mapping from context on page 2).

Template-based extraction fails hard here. A template rule that says "column 3 is quantity" works until a procurement team adds a GL code column between description and quantity in their next ERP update. Then every future PO from that buyer routes to exception queue until someone retrains the template.

Our approach uses a table-structure classification step before extraction. The model first identifies the semantic role of each column — not based on position but based on the combination of header text, value types in each cell, and the relationship between adjacent columns. Once the semantic schema is mapped, extraction proceeds against the schema rather than against fixed positions. Adding a GL code column doesn't break the model because quantity is identified as "quantity" by what it is, not where it sits.

Multi-Currency POs and Cross-Border Procurement

Procurement teams buying from international vendors see multi-currency POs regularly. The extraction challenge isn't just identifying the currency code — it's correctly handling documents where the currency symbol appears in the line items but not in the header, or where the PO shows both the foreign currency unit price and the USD equivalent in separate columns.

Consider a manufacturing company with plants in Texas and Mexico. Their procurement team issues POs in USD, MXN, and occasionally EUR for European equipment vendors. Each of those PO formats was generated by the same ERP but shows different currency handling — USD POs suppress the currency symbol as a given, MXN POs show it explicitly, and EUR POs sometimes show both EUR and USD equivalent totals when the buyer's internal policy requires USD reporting.

We extract currency code as a distinct field, normalize it to ISO 4217 format (USD, MXN, EUR — not $, $, €), and tag the currency on both the PO header total and per-line-item amounts when they differ. The output schema gives your ERP team a clean join regardless of what the source document looked like.

Amendment POs and Version Chains

A frequent source of AP reconciliation problems is amendment POs — revised versions of an original purchase order that modify quantities, pricing, or delivery terms. These are often processed as separate documents, and without version tracking, your ERP ends up with the original PO and the amendment as independent records rather than a linked chain.

Fieldiq extracts the amendment indicator when present: "PO Amendment #2," "Revised PO," "Change Order 3" — these all signal that the document references a prior PO number and should be treated as a delta, not a new record. We extract both the current document's PO number and the referenced original PO number when available, giving your integration layer the data it needs to link versions.

We're not saying amendment detection solves your version control problem — that's ultimately an ERP data model and integration question. What extraction can do is flag the document type correctly and provide the linkage fields, so your integration layer has the right inputs.

When Exception Flagging Kicks In

PO extraction exceptions fall into three buckets: confidence threshold failures (the model's per-field confidence score drops below your set threshold), structural anomalies (the document doesn't have a recognizable table structure), and validation failures (extracted values fail your rule engine — like a line total that doesn't match quantity × unit price).

For most PO workflows, the validation rules are the high-value catches. A PO where the extracted line total is off from the computed total by more than a rounding difference is a document worth human review — either the extraction missed a column, or the source document has a calculation error that your AP team needs to catch before issuing a matching invoice payment.

Exception routing for POs follows the same queue configuration as invoices: exceptions go to a named reviewer, show the specific field and reason for the flag, and allow the reviewer to correct and approve in one step. After correction, the corrected extraction is used for downstream processing, and the correction event is logged for audit purposes.

Output to Your Procurement System

PO extraction output maps to a standardized JSON schema that mirrors the structure of common ERP PO objects: a header object, a line-items array, and a footer object. For SAP customers, we map to PO header and item tables (EKKO/EKPO). For Oracle NetSuite, we map to the PurchaseOrder and PurchaseOrderItem objects. For Coupa, we map to the orders/line-items endpoint format.

Custom field mappings are configured per integration — if your ERP uses non-standard field names or your procurement team has added custom fields to the PO schema, those map cleanly as long as they're present in the source document. Setup for a new integration typically takes 2-3 hours with your ERP admin for field mapping review, not weeks of template training.

Published by the Fieldiq team

See Fieldiq process your documents