Invoice Data Recognition

Jatien · March 23, 2026, 9:36pm

I would like to build a model that was capable of taking the data from an invoice and detect different tags to automatically make an accounting chart. The tags would be the name of the provider, product tags and their respective price tags, total amount… Any recommendations on how would you approach this model. I know the most natural would be some king of NER model, but i would like to know if any specifics would be helpful for this problem.

Thank you!

John6666 · March 24, 2026, 7:09am

While there are plenty of good existing OCR models, you shouldn’t expect a single model to work well on its own when dealing with extremely messy invoices. It’s better to use them in combination.

How heavy the OCR model or other models need to be depends on just how messy the invoices are…

Build it as a document understanding pipeline, not as a plain NER model.

That is the main recommendation.

Invoices are visually structured documents. They contain header fields, totals blocks, and line-item tables. That is why the mature invoice systems from AWS, Google, and Azure all treat invoices as a mix of summary fields and line items, not as one flat sequence-labeling problem. AWS Textract returns SummaryFields and LineItemGroups. Google’s Invoice Parser extracts both header and line-item fields. Azure’s invoice model also extracts key fields plus line items in structured JSON. (AWS Document)

Why plain NER is not enough

Classic NER assumes text is mostly a sequence. In invoices, meaning depends heavily on position and grouping:

the same number can be a unit price, tax amount, subtotal, or total
a product description may wrap across multiple lines
values belonging to one row may be far apart in reading order but aligned visually
totals often appear multiple times in different boxes

That is why document models such as LayoutLMv3 use both text and image/layout information, and why DocILE evaluates Key Information Localization and Extraction separately from Line Item Recognition. Line-item recognition exists as a separate task because finding fields is easier than grouping them into the correct item rows. (Hugging Face)

The right way to think about the problem

Your real goal is not only “tag supplier, products, prices, total.”

Your real goal is:

read the invoice correctly
recover its structure
normalize the values
validate the math
map the result to accounting categories

So I would split the system into two major outputs:

Output A. Structured invoice extraction

This produces:

supplier name
supplier tax ID
invoice number
invoice date
due date
currency
subtotal
tax
total
line items

Output B. Accounting decision

This uses the structured output to predict:

GL account
expense category
tax code
cost center
approval or exception flags

That separation is important. Extraction answers “what is on the invoice.” Accounting classification answers “how finance should code it.” Google’s Document AI flow reflects this distinction too: you can use a pretrained invoice parser and then uptrain it with your own fields and data when the generic parser is not enough. (Google Cloud Documentation)

The architecture I would recommend

1. Ingestion and document triage

First decide what kind of file you have:

born-digital PDF with selectable text
scanned PDF
photo or image
multi-page mixed document

For born-digital PDFs, extract the text layer and coordinates first. For scans or photos, run OCR. A hybrid setup is better than forcing every document through image OCR, because clean PDF text is usually more accurate than OCR. OCR stacks such as docTR are useful here because they return localized word predictions, not just a plain string. (mindee.github.io)

2. Layout zoning

Before extracting fields, segment the page into likely regions:

header
addresses
line-item region
totals region
footer

This makes the rest of the system much easier. If you can isolate the totals area from the item table, you reduce many false assignments immediately. Azure’s layout and invoice tooling explicitly emphasizes extracting text and layout information from documents, not only OCR text. (Microsoft Learn)

3. Header-field extraction

For fields like supplier, invoice number, dates, subtotal, tax, and total, use a layout-aware extractor.

A strong open baseline is LayoutLMv3. It is built for Document AI and combines text and image signals. This is much better suited to invoices than plain token NER because it can use both wording and spatial placement. (Hugging Face)

4. Line-item extraction

This is the hardest part, and it deserves its own subsystem.

Use two modes:

Mode 1. True table mode

When the invoice has a clear table, use a table detector and structure recognizer. Table Transformer is a good open-source building block here, and its official repository is also the home of PubTables-1M and the GriTS metric. (GitHub)

Mode 2. Implicit table mode

Many invoices do not have a clean bordered table. They use whitespace alignment. In that case:

find right-aligned numeric columns first
cluster text boxes by vertical overlap into candidate rows
treat left text as description
merge rows when the description continues but no new numeric anchors appear
carry row state across pages if the table continues

This is where many projects fail. DocILE’s separate line-item task is strong evidence that row grouping is not just post-processing noise. It is a central modeling problem. (arXiv)

5. Normalization

Convert raw text into canonical values:

dates to ISO format
amounts to decimals
currency to a standard code
supplier names to canonical vendor IDs

Example:

1.234,56 and 1,234.56 should become the same internal number
ACME Ltd. and ACME LIMITED should map to the same vendor entity

This step is critical for downstream accounting, duplicate detection, and analytics. The commercial invoice parsers all return structured values because raw OCR text is not enough for workflow automation. (Microsoft Learn)

6. Validation and reconciliation

This is the most important non-model part of the system.

Do not trust extraction because it “looks right.” Trust it only if it reconciles.

Checks should include:

sum(line amounts) ≈ subtotal
subtotal + tax + shipping − discount ≈ total
quantity × unit price ≈ line amount
currency is consistent across the document
page-level totals do not get mixed into line items

This is not just cleanup. It is your error detector. KIEval makes the same broad point from an evaluation perspective: industrial document extraction must assess grouped structured information, not just isolated entities. (arXiv)

7. Accounting classification

Only after the invoice is structured should you predict accounting labels.

Inputs can include:

canonical vendor
line-item descriptions
extracted tax rate
amount ranges
vendor history
previous GL mappings for similar items

Start simple. A gradient boosting model or logistic regression over engineered features can work surprisingly well once the document is already structured. You do not need a giant end-to-end model for the accounting part on day one. That is a reasoning recommendation, supported by the fact that major invoice systems focus first on structured extraction and then on downstream workflow integration. (Google Cloud)

Best model choices

You have three realistic routes.

Route 1. Managed parser first

Use AWS Textract, Google Document AI, or Azure Document Intelligence as a production baseline.

This is the fastest way to get a working benchmark because those systems already parse invoices into header fields and line items. Google also supports uptraining its pretrained invoice processor on your own data. (AWS Document)

This route is best when:

you need speed
labeling data is limited
your differentiation is in accounting logic, not OCR research

Route 2. Modular open-source stack

This is the route I would recommend if you want control.

A solid stack is:

OCR: docTR or equivalent
header extractor: LayoutLMv3
line-item structure: Table Transformer
rules: normalization + validation

This combination matches the actual structure of the problem. docTR handles word localization and recognition. LayoutLMv3 handles layout-aware field extraction. Table Transformer handles table structure. (mindee.github.io)

Route 3. End-to-end document parser

If you want to benchmark a modern page-to-structured-output model, try an OCR-free or integrated document model such as Donut, PaddleOCR-VL-1.5, or GLM-OCR. Donut is explicitly OCR-free. PaddleOCR-VL-1.5 and GLM-OCR are current document-parsing models on Hugging Face focused on complex document understanding. (Hugging Face)

This route is attractive for fast prototyping, but I would still keep explicit validation and line-item logic around it. End-to-end models are useful front ends. They should not be the only safety mechanism in an accounting workflow. (Hugging Face)

What data to label

Start with a small, useful schema. Do not annotate 80 fields first.

Phase 1 fields

supplier_name
invoice_id
invoice_date
due_date
currency
subtotal
tax_amount
total_amount

Phase 2 line items

description
quantity
unit
unit_price
line_amount
tax_rate

Phase 3 accounting labels

vendor_id
GL_account
tax_code
cost_center

For public benchmarks and prototyping, DocILE is the closest fit to your problem because it is built on business documents and includes line-item recognition. FUNSD and CORD are useful smaller sets for form understanding and receipt-style parsing, but DocILE is the strongest conceptual match for invoices. (arXiv)

How to evaluate it

Do not evaluate only token F1.

Use at least four evaluation layers:

1. Field accuracy

Exact or normalized match for supplier, invoice number, dates, subtotal, tax, and total. (Microsoft Learn)

2. Line-item grouping accuracy

Did the right quantity, price, and amount end up in the same row? This is exactly the kind of structure KIEval argues should be evaluated explicitly. (arXiv)

3. Reconciliation pass rate

What percentage of invoices pass your arithmetic checks with no human correction? This is one of the best business metrics for this use case. It is a design recommendation, supported by the structured nature of invoice outputs and grouping-sensitive evaluation. (AWS Document)

4. Vendor-split testing

Hold out vendors or layouts, not only random pages. DocILE’s test design includes zero-shot and few-shot layouts, which reflects the real production risk: overfitting to common templates. (arXiv)

Practical pitfalls

The common failure modes are predictable:

multi-line descriptions split into fake extra rows
subtotal mistaken for grand total
unit price mistaken for line amount
tables continuing across pages
tax-inclusive vs tax-exclusive pricing
duplicate totals in summary boxes
supplier name variants
OCR noise on phone photos

This is why hybrid systems remain strong in practice. Template-based tools such as invoice2data still matter because they support line-item and table plugins, multiple regex patterns per field, and custom organizational fields. That is a useful reminder that deterministic rules still have value, especially for high-volume repeat vendors. (GitHub)

What I would do first

If I were building this from scratch, I would do it in this order:

Build a canonical invoice schema
Implement PDF text extraction plus OCR fallback
Add layout zoning
Train or fine-tune a header extractor
Build a dedicated line-item subsystem
Add normalization and reconciliation
Only then add the accounting classifier

That sequence gives you a system that is explainable, measurable, and safe enough for financial workflows. It also mirrors how the mature invoice platforms structure the task. (AWS Document)

Bottom line

A plain NER model is not the best framing.

The better framing is:

Document OCR/parsing → layout understanding → header extraction → line-item extraction → normalization → validation → accounting classification

If you want the most practical custom baseline, I would choose:

docTR for OCR
LayoutLMv3 for header fields
Table Transformer for line items
rules plus reconciliation for acceptance
a separate classifier for the accounting chart

If you want the fastest benchmark, compare that against one managed parser such as Google, Azure, or AWS. (mindee.github.io)

Use the matrix below by asking one question first:

What is the dominant failure mode in my invoices?

That is the right selector. Mature invoice parsers already split the job into key fields and line items, so one baseline rarely fits every invoice population. AWS exposes SummaryFields and LineItemGroups, Google’s Invoice Parser extracts both header and line-item fields, and Azure’s invoice model does the same. (AWS Document)

Side-by-side decision matrix

Dominant case	Recommended custom baseline	Why this baseline	Labels / data needed	Cheapest first experiment	Most likely failure mode	Source anchors
Mostly clean, born-digital PDFs	PyMuPDF or pdfplumber → region rules → Table Transformer → reconciliation	Best when OCR is unnecessary. PyMuPDF can extract word boxes and tables directly. pdfplumber is built for detailed PDF geometry, table extraction, and visual debugging, and works best on machine-generated PDFs.	Very little labeled data at first. Often enough to start with regex/rules plus a few manually checked examples.	Run native PDF extraction on 50 invoices. Compare field coverage and line-item recovery before adding any OCR.	Hidden reading-order issues, merged text blocks, whitespace-only tables.	(PyMuPDF)
Scans, phone photos, skew, warping	PP-DocLayoutV3 → GLM-OCR → Table Transformer → reconciliation	Good when geometry is the problem, not just text recognition. PP-DocLayoutV3 is designed for non-planar documents and reading order. GLM-OCR is a current multimodal OCR model for complex document understanding.	Small labeled set for validation is enough to start. Stronger gains come from representative distorted samples.	Benchmark 30 hard pages with and without the layout stage. Measure row recovery, not just OCR text quality.	Curved pages, bad lighting, over-segmentation, wrong reading order.	(Hugging Face)
Few labels, many repeat vendors	invoice2data templates + PDF/OCR backend + vendor normalizers	Strong when the same vendor layouts repeat. `invoice2data` supports templates, static fields, and plugins for line items and tables.	Very low ML labeling need. You mainly need clean templates and vendor-specific cleanup rules.	Template the top 10 vendors that make up most volume. Track touchless rate before training anything.	Template drift, unseen vendors, multiline descriptions that break template assumptions.	(Invoice2data)
Header fields are fine, line items are the blocker	PDF text or OCR → Table Transformer-first pipeline → row repair rules	Best when supplier/date/total extraction is mostly solved but row grouping is not. Table Transformer is explicitly for table detection and structure recognition from PDF images.	Need labeled line items more than labeled headers. Focus annotation on row grouping and numeric columns.	Evaluate on 100 invoices using only line-item metrics: row grouping, numeric binding, subtotal reconciliation.	Wrapped descriptions, implicit tables, page breaks, missing cell boundaries.	(Hugging Face)
You want one end-to-end trainable parser baseline	Donut → schema normalization → reconciliation	Good as a clean benchmark for “how far can one model go?” Donut is OCR-free and directly maps document images to structured outputs.	Needs paired page→target-schema examples. Best when you can supervise against JSON-like targets.	Fine-tune on a narrow schema first: supplier, invoice ID, date, subtotal, total, and one simple line-item format.	Hallucinated structure, unstable long outputs, weak row grouping on dense tables.	(Hugging Face)
You want a current open document-parser baseline	PaddleOCR-VL-1.5 → schema normalization → reconciliation	Good zero-shot benchmark when layouts vary a lot. The current model card positions it as a 0.9B document parser with strong table/text performance and robustness to scanning, skew, warping, screen photography, and illumination.	Little task-specific labeling needed to start. You mainly need a holdout set for honest evaluation.	Run it on a representative vendor mix and compare only business outputs: valid totals, valid rows, review rate.	Great parse text but imperfect field binding, overconfident outputs on rare layouts.	(Hugging Face)
You want a fuller parsing system with less assembly work	PP-StructureV3 → custom field mapping → reconciliation	Good when your goal is end-to-end document parsing rather than assembling many separate tools. PP-StructureV3 is presented as a document parsing solution that converts PDFs and document images to Markdown and JSON.	Moderate. You still need business-specific mapping and validation, but less low-level pipeline glue.	Use it as a parser front end, then map its structure into your invoice schema and test on 20 messy multi-page invoices.	General parser output that is structurally rich but not yet aligned to accounting fields.	(Hugging Face)
You want the cheapest serious sanity-check baseline	PyMuPDF/pdfplumber + regex/keywords for headers + column heuristics for lines + reconciliation	Best for finding out whether ML is even needed yet. If native PDF text and coordinates already solve most of the problem, you learn that before investing in training.	Almost none initially. You need sample invoices and manual error review.	Try on 100 digital PDFs. Count how many pass field extraction and arithmetic checks with no ML.	Fails badly on scans, implicit multi-line rows, and vendor layouts with weak alignment.	(PyMuPDF)

How to choose fast

If your invoices are mostly:

digital PDFs → start with PyMuPDF/pdfplumber
photos or distorted scans → start with PP-DocLayoutV3 + GLM-OCR
repeat vendors with low label budget → start with invoice2data
line-item-heavy → start with a Table Transformer-first stack
mixed layouts and you want a modern open benchmark → start with PaddleOCR-VL-1.5
one-model benchmark → try Donut as the clean end-to-end baseline (GitHub)

Default pick if you are unsure

If you do not yet know your dominant failure mode, I would test in this order:

PyMuPDF/pdfplumber baseline on digital PDFs
PaddleOCR-VL-1.5 as the modern open parser benchmark
Table Transformer-first for line items
invoice2data for high-volume repeat vendors
Donut only as the end-to-end control baseline (PyMuPDF)

bacca400 · March 31, 2026, 9:49am

What about the costs of shipping, freight (which is different from shipping in our accounting system), and discounts? Those all affect the final invoice amount.

John6666 gives good advice as I tried to extract text from 500+ invoices in a single PDF file. It didn’t go well. I would be looking for text like “invoice number \d{4,6}” where the invoice number was 4-6 digits in a row, but the OCR software would often not extract this data in the same order. So it would not find the invoice number, and many other data.

But I did that with free OCR software and not with AI designed to do OCR.

John6666 · April 1, 2026, 3:08am

Hmm… While commercial OCR services may include such features, standalone OCR models are often not very good at properly interpreting multi-page data. This is because, in most cases, the models are primarily trained on pairs of a single page and the information to be extracted…

The most straightforward workaround is to split the document into individual pages before feeding them to the OCR model:

Your old approach broke for a structural reason, not just because the OCR was free.

You were effectively doing:

500+ invoices in one PDF → OCR everything → flatten to one text stream → run regex like invoice number \d{4,6}

That is brittle because PDF/OCR extraction often does not preserve normal reading order. PyMuPDF’s docs say plain PDF text extraction may come out “not in usual reading order,” with unexpected line breaks, and recommend using blocks or words with position data instead. (PyMuPDF)

So the main fix is not “use a better regex.” The main fix is:

split first, extract locally, keep coordinates, then validate the totals. Google’s Custom Splitter exists specifically to split packed PDFs into logical documents before extraction, and Google notes that bad splits are especially damaging because one split error causes downstream extraction errors. (Google Cloud Documentation)

What to do with shipping, freight, and discounts

Treat them as separate normalized fields in your schema. Do not fold them into one generic “total adjustment.”

A practical invoice schema for your case is:

{
  "invoice_id": "123456",
  "vendor_name": "ACME Supplies Ltd",
  "invoice_date": "2026-03-15",
  "currency": "USD",

  "subtotal": 1000.00,
  "line_item_discount_total": 20.00,
  "invoice_level_discount": 10.00,
  "shipping_charge": 15.00,
  "freight_charge": 40.00,
  "handling_charge": 0.00,
  "tax_total": 102.50,
  "invoice_total": 1127.50,
  "amount_due": 1127.50
}

And also store the raw label text that produced each field:

raw_label = "Shipping"
raw_label = "Shipping & Handling"
raw_label = "Freight"
raw_label = "Discount"

That matters because standard parsers do not always match your accounting distinctions exactly. AWS Textract explicitly standardizes DISCOUNT and SHIPPING_HANDLING_CHARGE. Azure’s invoice model extracts invoice fields and line items into structured JSON. Microsoft Dynamics’ invoice entity explicitly models FreightAmount, TotalDiscountAmount, TotalLineItemAmount, TotalAmountLessFreight, and TotalTax, which is close to the accounting structure you need. (AWS Document)

The formula to validate

Use arithmetic validation as a hard gate.

A practical rule is:

invoice_total
≈ subtotal
- line_item_discount_total
- invoice_level_discount
+ shipping_charge
+ freight_charge
+ handling_charge
+ tax_total
+ other_surcharges

And if the invoice has prior balance or prior credits:

amount_due
≈ invoice_total
+ previous_unpaid_balance
- credits_or_payments

This is not cosmetic cleanup. It is your error detector. If the parser confuses freight with a line item, or misses a discount, this check will usually fail.

Why your invoice number was missed

Your regex expected the text to appear like this:

Invoice Number 123456

But OCR/PDF extraction often returns something more like:

Invoice
Date
123456
Number

or mixes it with neighboring text from another block. PyMuPDF’s docs describe exactly this kind of issue and recommend using block and word extraction with coordinates to rebuild reading order or search local rectangles instead of relying on one global text stream. (PyMuPDF)

So instead of searching the full document with:

invoice number \d{4,6}

do this:

find the header region
find labels such as Invoice No, Invoice Number, Invoice #
collect candidate values near those labels
rank them by distance and alignment
then validate the winner with ^\d{4,6}$

That changes regex from a discovery method into a validator. That is much more reliable.

The concrete pipeline I would use

1. Split the packed PDF into individual invoices

This is the first change.

Start with page-level signals:

Invoice near the top
an invoice-number/date block near the header
totals block near the bottom
repeated vendor header/logo
continuation pages with line-item tables but no new invoice header

Google’s Custom Splitter is built around exactly this use case: composite files containing multiple logical documents that then get routed to the appropriate extractor. (Google Cloud Documentation)

2. Use native PDF text before OCR when possible

If a page is born-digital, extract words and blocks directly from the PDF first. PyMuPDF recommends block and word extraction because plain text order may be wrong, and Page.get_text("blocks") / Page.get_text("words") preserve useful position information. (PyMuPDF)

3. Use document OCR only for scanned pages

For scanned pages or images, use invoice/document AI OCR rather than generic OCR-only tooling. Azure’s invoice model is built to handle phone captures, scanned documents, and digital PDFs, and returns recognized text, tables, and invoice-specific fields plus line items. AWS Textract’s invoice/receipt path similarly outputs structured summary fields and line items instead of one text blob. (Microsoft Learn)

4. Keep coordinates in your intermediate data

For each word, keep:

page number
text
bounding box
line ID
block ID
confidence
source type: native PDF or OCR

This is what lets you ask useful questions like “what is near the invoice-number label?” instead of “does the whole OCR blob contain the pattern?”

5. Zone the page before extracting fields

Split each invoice into approximate regions:

header
vendor/bill-to area
line-item area
totals area
footer/remittance area

Then only search:

invoice number and date in the header
shipping/freight/discount/tax/total in the totals area
products, qty, price, amount in the line-item area

This mirrors how invoice parsers expose output: Azure returns text, tables, and invoice-specific fields; AWS separates summary fields and line items. (Microsoft Learn)

6. Treat charges as labeled totals lines

Inside the totals block, extract a list of labeled amount lines:

Raw label	Internal field
Shipping	`shipping_charge`
Shipping & Handling	`shipping_charge` or split later
Freight	`freight_charge`
Discount	`invoice_level_discount`
Rebate	`invoice_level_discount`

Because your accounting system distinguishes freight from shipping, do not collapse them automatically.

7. Reconstruct line items separately

Do not use header-field logic for line items.

For line items, use a table or pseudo-table approach:

detect numeric columns on the right
group words into rows by vertical overlap
treat left text as description
merge multiline descriptions when there is no new numeric anchor

That is where invoice extraction usually becomes hard.

Best practical options

Fastest path

Benchmark a purpose-built invoice parser first.

Good starting options are:

Google Document AI: Custom Splitter + Invoice Parser + uptraining/custom fields. Google explicitly says you can uptrain the Invoice Parser with your own data and add custom fields that are not supported by the pretrained model. That is directly useful for a field like freight_charge. (Google Cloud Documentation)
AWS Textract AnalyzeExpense: it already standardizes DISCOUNT and SHIPPING_HANDLING_CHARGE, plus summary fields and line items. (AWS Document)
Azure Document Intelligence invoice model: it handles scanned images, PDFs, and line items in structured JSON. (Microsoft Learn)

Strong custom path

If you want to own the stack:

split packed PDFs first
use PyMuPDF blocks/words for digital PDFs
use OCR only for scanned pages
keep coordinates
zone header/totals/items separately
extract local candidates near labels
normalize freight, shipping, discount, tax
reconcile the math before posting anything

My concrete advice for you

For your situation, I would do this in order:

Phase 1

Take 30 to 50 invoices from the packed PDF and manually create a small gold set:

correct invoice boundaries
correct invoice number
subtotal
discount
shipping
freight
tax
total
amount due

Phase 2

Test two paths:

managed invoice parser with splitting
native PDF extraction + local rules on already-split invoices

That will tell you very quickly whether your real bottleneck is:

split detection
OCR quality
reading order
totals parsing
line-item grouping

Phase 3

Lock your schema before tuning models:

subtotal
line_item_discount_total
invoice_level_discount
shipping_charge
freight_charge
tax_total
invoice_total
amount_due

Phase 4

Add reconciliation rules and reject anything that does not balance.

Bottom line

Your earlier failure does not mean invoice extraction is a bad fit.

It means the earlier workflow was fragile:

too many invoices in one PDF
flattened OCR text
regex dependent on OCR order

A stronger workflow is:

packed PDF → split into invoices → extract with coordinates → parse header/totals/lines separately → keep freight separate from shipping → keep discounts explicit → validate the math

That is the path I would take. (Google Cloud Documentation)

bacca400 · April 1, 2026, 9:03am

Thank you again. I think I will start with a workflow that extracts paragraphs of text to understand the workflow as this sounds easier, there is no tabular data for my first case I will use as a test. The invoice sounds much more complicated. I know some Python but have not done AI or OCR with Python before.

If I do extract an invoice, the PDF we do get has multiple invoice summaries per page. I failed to mention this earlier. It’s an invoice summary from a major US shipper. There is one charge per shipment so they fit multiple invoices per page. We do 1000s of shipments every month with this shipper and other shippers.

John6666 · April 1, 2026, 10:57am

I know some Python but have not done AI or OCR with Python before.

Oh. PyTorch, Transformers, and other libraries handle GPU-related tasks and the acceleration of bottleneck processes, effectively wrapping them up for us.

As a result, there isn’t much difference in how you actually use standard Python functions (methods) versus functions designed for AI models. The only real precaution is to ensure the hardware isn’t busy when calling AI functions; other than that, there aren’t many other things to watch out for. I think it’s easier than functions related to disk I/O…

However, generally speaking, even the lightest general-purpose AI models available on HF are somewhat heavier than plain code, so it might be best to create a prototype without AI first. The fastest approach is to avoid using AI unless absolutely necessary.

For your first learning pass, starting with text blocks or paragraph-like chunks is a good idea. But for the shipper summary PDF you just described, I would change one word:

Do not think in terms of paragraphs.
Think in terms of repeating summary blocks on a page.

That is the key difference.

If one page contains multiple shipment summaries, then the first problem is not “extract all text.” The first problem is segment the page into summary-sized regions, then extract each region separately. A page-level splitter alone will not solve that, because Google’s Custom Splitter is designed to identify logical documents in composite files and return page-level document boundaries; if multiple summaries live on the same page, you still need an intra-page segmentation step. (Google Cloud Documentation)

What I would do first

For your first implementation, I would not start with AI.

I would start with a simple Python workflow using native PDF text extraction, because if these shipper PDFs are machine-generated, that is usually easier and more reliable than OCR. PyMuPDF can extract text as blocks and words, and its docs explicitly note that plain text may not come out in natural reading order, while block/word extraction and sorting help recover usable structure. pdfplumber is also built for detailed PDF inspection and says it works best on machine-generated PDFs. (PyMuPDF)

So the beginner-friendly path is:

open one PDF page
extract blocks or words with coordinates
detect the repeated shipment-summary regions on that page
extract text inside each region
parse one region at a time

That is much easier than OCR-first AI work, and it teaches the right workflow. (PyMuPDF)

Why your previous regex workflow failed

Your regex was probably not the main problem.

The main problem was that OCR or plain PDF extraction often returns text in an order that is not the order your regex expects. PyMuPDF’s docs say the output of plain text extraction may not match natural reading order, and they provide sort=True plus block/word extraction specifically to help with this. (PyMuPDF)

So instead of searching a giant text blob for:

invoice number \d{4,6}

you want to search inside one detected summary region, and only then look for local label/value pairs.

That is a very different workflow.

The right mental model for your shipper summary PDF

Because there is one charge per shipment and multiple summaries per page, your first real task is probably this:

page → repeated summary boxes/cards → one structured record per box

Not:

page → paragraphs

That matters because your data sounds closer to a repeating form layout than to a narrative document.

So I would define one shipment-summary record like this:

shipper account
shipment date
tracking number or shipment reference
invoice number or summary number
base charge
shipping charge
freight charge
discount
tax
total

Then repeat that extraction for every summary block on the page.

The easiest MVP

I would build the MVP in four steps.

Step 1: decide whether you even need OCR

Try native PDF extraction first.

Use PyMuPDF or pdfplumber on a few sample pages and inspect whether the text comes out clean enough. pdfplumber explicitly says it works best on machine-generated PDFs, and PyMuPDF exposes blocks, words, and rectangles you can search within. (GitHub)

If that works, you just saved yourself a lot of complexity.

Only add OCR later for scanned files or mixed-quality inputs. OCRmyPDF is a good fallback because it adds a searchable text layer to scanned PDFs and is designed to tolerate files that mix scanned and born-digital content. (OCRmyPDF)

Step 2: inspect one page visually

Use block extraction and plot the blocks or print their bounding boxes.

You want to answer:

do the shipment summaries appear as repeated vertical blocks?
are the invoice number and total in stable positions?
are the label/value pairs close together?
do all summaries have roughly the same width and height?

If yes, you can often segment the page with very simple geometry rules.

Step 3: segment one page into summary regions

Start with rules, not AI.

Examples:

cluster words/blocks by vertical gaps
detect repeated top labels like “Invoice,” “Shipment,” or “Tracking”
use horizontal rules or whitespace bands if the PDF has them
find repeated left edges and repeated heights

Because multiple summaries are on one page, this step is probably more important than OCR quality.

Step 4: parse one summary region locally

Once you isolate one region, do local extraction:

find the invoice-number label inside that region
look nearby for 4–6 digit candidates
validate the winner with regex
repeat for total, discount, freight, shipping

That local approach is much more robust than global regex over the whole document.

My recommendation about “paragraph extraction”

For a first exercise, yes, extracting paragraph-like text chunks is fine because it teaches:

how to open a PDF in Python
how to inspect blocks and words
how to handle coordinates
how to build a parser incrementally

But for your real shipper-summary use case, I would upgrade that idea to:

extract repeated blocks, not paragraphs

That is the version that matches your document structure.

A practical beginner roadmap

Phase 1: no AI, no OCR

Use PyMuPDF.

Goal:

extract blocks and words from one page
print their coordinates
manually identify where one shipment summary starts and ends

PyMuPDF’s text recipes include block extraction, word extraction, extraction inside rectangles, and sorted text output. (PyMuPDF)

Phase 2: rule-based region segmentation

Write a small function that groups blocks into summary regions.

Goal:

get from “one page” to “N shipment summaries on that page”

Phase 3: field extraction inside each region

Use local rules:

regex only inside the region
proximity to labels
fallback rules if a label is missing

Phase 4: arithmetic validation

Because shipping, freight, discounts, and total all matter, add a check like:

total ≈ base_charge - discounts + shipping + freight + tax

The exact formula depends on the shipper’s layout, but the principle is stable: do not trust extracted numbers until they balance.

Where AI helps later

AI becomes useful after you understand the document shape.

For your case, I would add AI later for one of three reasons:

some PDFs are scans, so you need OCR
the summary-region segmentation is inconsistent
local rules for labels and fields become too brittle across shippers

At that point, current document models on Hugging Face such as PaddleOCR-VL-1.5, GLM-OCR, or layout companions like PP-DocLayoutV3 become relevant, but I would not start there if your immediate goal is to learn the workflow and get a first success in Python. Those tools are better once you already know what one correct extracted record should look like. (Google Cloud)

What I would choose for you right now

I would start with this exact stack:

PyMuPDF for page, block, and word extraction
pdfplumber only as a visual debugging helper when needed
no OCR unless the sample PDFs turn out to be image-only
rule-based region segmentation
local regex + label proximity for field extraction
math validation for shipping, freight, discounts, and totals

That is the simplest path that still matches your real document structure. PyMuPDF gives you the coordinates and block-level tools you need, and pdfplumber is helpful when you want to inspect how the PDF is laid out. (PyMuPDF)

The one change I would make to your plan

Your instinct to start simpler is correct.

I would just change the target from:

paragraph extraction

to:

summary-block extraction

That single change aligns the project with the actual structure of your shipper PDF and gives you a much better chance of getting an early win.

Start with PyMuPDF only. It can extract blocks and words with coordinates, and sort=True can reorder output roughly from top-left to bottom-right. That is a much better first step than OCR for machine-generated PDFs, especially when one page contains multiple repeated shipment summaries. (PyMuPDF)

# deps:
#   pip install pymupdf
#
# notes:
# - No AI model. No OCR. CPU-safe.
# - Replace SAMPLE_PDF_URL later with your own PDF path or URL.
# - This is a first workflow script: download/open PDF -> extract text blocks ->
#   group nearby blocks into rough "summary regions" -> print/save results.

import json
import os
import urllib.request
import fitz  # PyMuPDF

# Public sample PDF for demo.
# Replace with your own local PDF path later, for example:
# PDF_SOURCE = "my_shipper_summary.pdf"
PDF_SOURCE = "https://www.w3.org/WAI/WCAG20/Techniques/working-examples/PDF20/table.pdf"

OUT_DIR = "demo_pdf_blocks"
PAGE_INDEX = 0          # first page only for the first experiment
GAP_THRESHOLD = 18.0    # larger => fewer/larger grouped regions

os.makedirs(OUT_DIR, exist_ok=True)

def ensure_local_pdf(src: str) -> str:
    """Download the PDF if src is a URL. Otherwise return the local path."""
    if src.startswith("http://") or src.startswith("https://"):
        local_path = os.path.join(OUT_DIR, "sample.pdf")
        if not os.path.exists(local_path):
            print(f"Downloading sample PDF to: {local_path}")
            urllib.request.urlretrieve(src, local_path)
        return local_path
    return src

def clean_text(s: str) -> str:
    """Normalize block text for easier printing."""
    return " ".join(s.replace("\x00", " ").split())

def group_blocks_into_regions(blocks, gap_threshold=18.0):
    """
    Very simple region grouping:
    - sort blocks top-to-bottom, then left-to-right
    - start a new region when the vertical gap is large
    This is only a first heuristic for repeated summary blocks.
    """
    regions = []
    current = []

    for block in blocks:
        x0, y0, x1, y1, text, block_no, block_type = block
        if block_type != 0:  # keep text blocks only
            continue
        text = clean_text(text)
        if not text:
            continue

        item = {
            "bbox": [round(x0, 1), round(y0, 1), round(x1, 1), round(y1, 1)],
            "text": text,
            "block_no": int(block_no),
        }

        if not current:
            current.append(item)
            continue

        prev_y1 = current[-1]["bbox"][3]
        current_y0 = item["bbox"][1]
        vertical_gap = current_y0 - prev_y1

        if vertical_gap > gap_threshold:
            regions.append(current)
            current = [item]
        else:
            current.append(item)

    if current:
        regions.append(current)

    # Add combined region bbox + joined text
    packed = []
    for idx, region in enumerate(regions):
        xs0 = [b["bbox"][0] for b in region]
        ys0 = [b["bbox"][1] for b in region]
        xs1 = [b["bbox"][2] for b in region]
        ys1 = [b["bbox"][3] for b in region]
        packed.append({
            "region_id": idx,
            "bbox": [min(xs0), min(ys0), max(xs1), max(ys1)],
            "text": "\n".join(b["text"] for b in region),
            "blocks": region,
        })
    return packed

# 1) Load PDF
pdf_path = ensure_local_pdf(PDF_SOURCE)
doc = fitz.open(pdf_path)
page = doc[PAGE_INDEX]

# 2) Extract blocks with sort=True
# PyMuPDF can also do get_text("words", sort=True) later if you need finer control.
blocks = page.get_text("blocks", sort=True)

# 3) Group blocks into rough page regions
regions = group_blocks_into_regions(blocks, gap_threshold=GAP_THRESHOLD)

# 4) Save raw outputs
raw_blocks_path = os.path.join(OUT_DIR, "page_blocks.json")
regions_path = os.path.join(OUT_DIR, "page_regions.json")
page_text_path = os.path.join(OUT_DIR, "page_text.txt")

with open(raw_blocks_path, "w", encoding="utf-8") as f:
    json.dump(blocks, f, indent=2, ensure_ascii=False)

with open(regions_path, "w", encoding="utf-8") as f:
    json.dump(regions, f, indent=2, ensure_ascii=False)

with open(page_text_path, "w", encoding="utf-8") as f:
    for region in regions:
        f.write(f"\n=== REGION {region['region_id']} ===\n")
        f.write(region["text"])
        f.write("\n")

# 5) Print a compact summary
print(f"\nPDF: {pdf_path}")
print(f"Pages: {doc.page_count}")
print(f"Using page index: {PAGE_INDEX}")
print(f"Text blocks found: {sum(1 for b in blocks if b[6] == 0)}")
print(f"Rough regions found: {len(regions)}")

for region in regions:
    x0, y0, x1, y1 = region["bbox"]
    preview = region["text"][:200].replace("\n", " | ")
    print(f"\nREGION {region['region_id']}  bbox=({x0}, {y0}, {x1}, {y1})")
    print(f"Preview: {preview}")

print("\nSaved:")
print(" -", raw_blocks_path)
print(" -", regions_path)
print(" -", page_text_path)

doc.close()

For your shipper-summary PDF, the next step after this is to replace the simple vertical-gap grouping with repeating summary-block detection, then extract fields like invoice number, shipping, freight, discount, and total inside each region only. That avoids the reading-order problem that broke the whole-document regex approach. (PyMuPDF)

Topic		Replies	Views
How can I build an invoice data extractor tool for free? Beginners	1	133	November 7, 2025
Suggestions on Invoice Extraction LLMs Models	1	788	April 19, 2024
Best route for text extraction from Invoice documents Beginners	3	2734	July 3, 2025
Best way to extract information from uggly pdf Models	1	69	October 10, 2025
How to build custom key-value extraction (similar to Azure Document Intelligence)? Beginners	2	108	April 7, 2026

Invoice Data Recognition

Why plain NER is not enough

The right way to think about the problem

Output A. Structured invoice extraction

Output B. Accounting decision

The architecture I would recommend

1. Ingestion and document triage

2. Layout zoning

3. Header-field extraction

4. Line-item extraction

Mode 1. True table mode

Mode 2. Implicit table mode

5. Normalization

6. Validation and reconciliation

7. Accounting classification

Best model choices

Route 1. Managed parser first

Route 2. Modular open-source stack

Route 3. End-to-end document parser

What data to label

Phase 1 fields

Phase 2 line items

Phase 3 accounting labels

How to evaluate it

1. Field accuracy

2. Line-item grouping accuracy

3. Reconciliation pass rate

4. Vendor-split testing

Practical pitfalls

What I would do first

Bottom line

Side-by-side decision matrix

How to choose fast

Default pick if you are unsure

What to do with shipping, freight, and discounts

The formula to validate

Why your invoice number was missed

The concrete pipeline I would use

1. Split the packed PDF into individual invoices

2. Use native PDF text before OCR when possible

3. Use document OCR only for scanned pages

4. Keep coordinates in your intermediate data

5. Zone the page before extracting fields

6. Treat charges as labeled totals lines

7. Reconstruct line items separately

Best practical options

Fastest path

Strong custom path

My concrete advice for you

Phase 1

Phase 2

Phase 3

Phase 4

Bottom line

What I would do first

Why your previous regex workflow failed

The right mental model for your shipper summary PDF

The easiest MVP

Step 1: decide whether you even need OCR

Step 2: inspect one page visually

Step 3: segment one page into summary regions

Step 4: parse one summary region locally

My recommendation about “paragraph extraction”

A practical beginner roadmap

Phase 1: no AI, no OCR

Phase 2: rule-based region segmentation

Phase 3: field extraction inside each region

Phase 4: arithmetic validation

Where AI helps later

What I would choose for you right now

The one change I would make to your plan

Related topics