Invoice Data Recognition

I would like to build a model that was capable of taking the data from an invoice and detect different tags to automatically make an accounting chart. The tags would be the name of the provider, product tags and their respective price tags, total amount… Any recommendations on how would you approach this model. I know the most natural would be some king of NER model, but i would like to know if any specifics would be helpful for this problem.

Thank you!

1 Like

While there are plenty of good existing OCR models, you shouldn’t expect a single model to work well on its own when dealing with extremely messy invoices. It’s better to use them in combination.

How heavy the OCR model or other models need to be depends on just how messy the invoices are…


Build it as a document understanding pipeline, not as a plain NER model.

That is the main recommendation.

Invoices are visually structured documents. They contain header fields, totals blocks, and line-item tables. That is why the mature invoice systems from AWS, Google, and Azure all treat invoices as a mix of summary fields and line items, not as one flat sequence-labeling problem. AWS Textract returns SummaryFields and LineItemGroups. Google’s Invoice Parser extracts both header and line-item fields. Azure’s invoice model also extracts key fields plus line items in structured JSON. (AWS Document)

Why plain NER is not enough

Classic NER assumes text is mostly a sequence. In invoices, meaning depends heavily on position and grouping:

  • the same number can be a unit price, tax amount, subtotal, or total
  • a product description may wrap across multiple lines
  • values belonging to one row may be far apart in reading order but aligned visually
  • totals often appear multiple times in different boxes

That is why document models such as LayoutLMv3 use both text and image/layout information, and why DocILE evaluates Key Information Localization and Extraction separately from Line Item Recognition. Line-item recognition exists as a separate task because finding fields is easier than grouping them into the correct item rows. (Hugging Face)

The right way to think about the problem

Your real goal is not only “tag supplier, products, prices, total.”

Your real goal is:

  1. read the invoice correctly
  2. recover its structure
  3. normalize the values
  4. validate the math
  5. map the result to accounting categories

So I would split the system into two major outputs:

Output A. Structured invoice extraction

This produces:

  • supplier name
  • supplier tax ID
  • invoice number
  • invoice date
  • due date
  • currency
  • subtotal
  • tax
  • total
  • line items

Output B. Accounting decision

This uses the structured output to predict:

  • GL account
  • expense category
  • tax code
  • cost center
  • approval or exception flags

That separation is important. Extraction answers “what is on the invoice.” Accounting classification answers “how finance should code it.” Google’s Document AI flow reflects this distinction too: you can use a pretrained invoice parser and then uptrain it with your own fields and data when the generic parser is not enough. (Google Cloud Documentation)

The architecture I would recommend

1. Ingestion and document triage

First decide what kind of file you have:

  • born-digital PDF with selectable text
  • scanned PDF
  • photo or image
  • multi-page mixed document

For born-digital PDFs, extract the text layer and coordinates first. For scans or photos, run OCR. A hybrid setup is better than forcing every document through image OCR, because clean PDF text is usually more accurate than OCR. OCR stacks such as docTR are useful here because they return localized word predictions, not just a plain string. (mindee.github.io)

2. Layout zoning

Before extracting fields, segment the page into likely regions:

  • header
  • addresses
  • line-item region
  • totals region
  • footer

This makes the rest of the system much easier. If you can isolate the totals area from the item table, you reduce many false assignments immediately. Azure’s layout and invoice tooling explicitly emphasizes extracting text and layout information from documents, not only OCR text. (Microsoft Learn)

3. Header-field extraction

For fields like supplier, invoice number, dates, subtotal, tax, and total, use a layout-aware extractor.

A strong open baseline is LayoutLMv3. It is built for Document AI and combines text and image signals. This is much better suited to invoices than plain token NER because it can use both wording and spatial placement. (Hugging Face)

4. Line-item extraction

This is the hardest part, and it deserves its own subsystem.

Use two modes:

Mode 1. True table mode

When the invoice has a clear table, use a table detector and structure recognizer. Table Transformer is a good open-source building block here, and its official repository is also the home of PubTables-1M and the GriTS metric. (GitHub)

Mode 2. Implicit table mode

Many invoices do not have a clean bordered table. They use whitespace alignment. In that case:

  • find right-aligned numeric columns first
  • cluster text boxes by vertical overlap into candidate rows
  • treat left text as description
  • merge rows when the description continues but no new numeric anchors appear
  • carry row state across pages if the table continues

This is where many projects fail. DocILE’s separate line-item task is strong evidence that row grouping is not just post-processing noise. It is a central modeling problem. (arXiv)

5. Normalization

Convert raw text into canonical values:

  • dates to ISO format
  • amounts to decimals
  • currency to a standard code
  • supplier names to canonical vendor IDs

Example:

  • 1.234,56 and 1,234.56 should become the same internal number
  • ACME Ltd. and ACME LIMITED should map to the same vendor entity

This step is critical for downstream accounting, duplicate detection, and analytics. The commercial invoice parsers all return structured values because raw OCR text is not enough for workflow automation. (Microsoft Learn)

6. Validation and reconciliation

This is the most important non-model part of the system.

Do not trust extraction because it “looks right.” Trust it only if it reconciles.

Checks should include:

  • sum(line amounts) ≈ subtotal
  • subtotal + tax + shipping − discount ≈ total
  • quantity × unit price ≈ line amount
  • currency is consistent across the document
  • page-level totals do not get mixed into line items

This is not just cleanup. It is your error detector. KIEval makes the same broad point from an evaluation perspective: industrial document extraction must assess grouped structured information, not just isolated entities. (arXiv)

7. Accounting classification

Only after the invoice is structured should you predict accounting labels.

Inputs can include:

  • canonical vendor
  • line-item descriptions
  • extracted tax rate
  • amount ranges
  • vendor history
  • previous GL mappings for similar items

Start simple. A gradient boosting model or logistic regression over engineered features can work surprisingly well once the document is already structured. You do not need a giant end-to-end model for the accounting part on day one. That is a reasoning recommendation, supported by the fact that major invoice systems focus first on structured extraction and then on downstream workflow integration. (Google Cloud)

Best model choices

You have three realistic routes.

Route 1. Managed parser first

Use AWS Textract, Google Document AI, or Azure Document Intelligence as a production baseline.

This is the fastest way to get a working benchmark because those systems already parse invoices into header fields and line items. Google also supports uptraining its pretrained invoice processor on your own data. (AWS Document)

This route is best when:

  • you need speed
  • labeling data is limited
  • your differentiation is in accounting logic, not OCR research

Route 2. Modular open-source stack

This is the route I would recommend if you want control.

A solid stack is:

  • OCR: docTR or equivalent
  • header extractor: LayoutLMv3
  • line-item structure: Table Transformer
  • rules: normalization + validation

This combination matches the actual structure of the problem. docTR handles word localization and recognition. LayoutLMv3 handles layout-aware field extraction. Table Transformer handles table structure. (mindee.github.io)

Route 3. End-to-end document parser

If you want to benchmark a modern page-to-structured-output model, try an OCR-free or integrated document model such as Donut, PaddleOCR-VL-1.5, or GLM-OCR. Donut is explicitly OCR-free. PaddleOCR-VL-1.5 and GLM-OCR are current document-parsing models on Hugging Face focused on complex document understanding. (Hugging Face)

This route is attractive for fast prototyping, but I would still keep explicit validation and line-item logic around it. End-to-end models are useful front ends. They should not be the only safety mechanism in an accounting workflow. (Hugging Face)

What data to label

Start with a small, useful schema. Do not annotate 80 fields first.

Phase 1 fields

  • supplier_name
  • invoice_id
  • invoice_date
  • due_date
  • currency
  • subtotal
  • tax_amount
  • total_amount

Phase 2 line items

  • description
  • quantity
  • unit
  • unit_price
  • line_amount
  • tax_rate

Phase 3 accounting labels

  • vendor_id
  • GL_account
  • tax_code
  • cost_center

For public benchmarks and prototyping, DocILE is the closest fit to your problem because it is built on business documents and includes line-item recognition. FUNSD and CORD are useful smaller sets for form understanding and receipt-style parsing, but DocILE is the strongest conceptual match for invoices. (arXiv)

How to evaluate it

Do not evaluate only token F1.

Use at least four evaluation layers:

1. Field accuracy

Exact or normalized match for supplier, invoice number, dates, subtotal, tax, and total. (Microsoft Learn)

2. Line-item grouping accuracy

Did the right quantity, price, and amount end up in the same row? This is exactly the kind of structure KIEval argues should be evaluated explicitly. (arXiv)

3. Reconciliation pass rate

What percentage of invoices pass your arithmetic checks with no human correction? This is one of the best business metrics for this use case. It is a design recommendation, supported by the structured nature of invoice outputs and grouping-sensitive evaluation. (AWS Document)

4. Vendor-split testing

Hold out vendors or layouts, not only random pages. DocILE’s test design includes zero-shot and few-shot layouts, which reflects the real production risk: overfitting to common templates. (arXiv)

Practical pitfalls

The common failure modes are predictable:

  • multi-line descriptions split into fake extra rows
  • subtotal mistaken for grand total
  • unit price mistaken for line amount
  • tables continuing across pages
  • tax-inclusive vs tax-exclusive pricing
  • duplicate totals in summary boxes
  • supplier name variants
  • OCR noise on phone photos

This is why hybrid systems remain strong in practice. Template-based tools such as invoice2data still matter because they support line-item and table plugins, multiple regex patterns per field, and custom organizational fields. That is a useful reminder that deterministic rules still have value, especially for high-volume repeat vendors. (GitHub)

What I would do first

If I were building this from scratch, I would do it in this order:

  1. Build a canonical invoice schema
  2. Implement PDF text extraction plus OCR fallback
  3. Add layout zoning
  4. Train or fine-tune a header extractor
  5. Build a dedicated line-item subsystem
  6. Add normalization and reconciliation
  7. Only then add the accounting classifier

That sequence gives you a system that is explainable, measurable, and safe enough for financial workflows. It also mirrors how the mature invoice platforms structure the task. (AWS Document)

Bottom line

A plain NER model is not the best framing.

The better framing is:

Document OCR/parsing → layout understanding → header extraction → line-item extraction → normalization → validation → accounting classification

If you want the most practical custom baseline, I would choose:

  • docTR for OCR
  • LayoutLMv3 for header fields
  • Table Transformer for line items
  • rules plus reconciliation for acceptance
  • a separate classifier for the accounting chart

If you want the fastest benchmark, compare that against one managed parser such as Google, Azure, or AWS. (mindee.github.io)


Use the matrix below by asking one question first:

What is the dominant failure mode in my invoices?

That is the right selector. Mature invoice parsers already split the job into key fields and line items, so one baseline rarely fits every invoice population. AWS exposes SummaryFields and LineItemGroups, Google’s Invoice Parser extracts both header and line-item fields, and Azure’s invoice model does the same. (AWS Document)

Side-by-side decision matrix

Dominant case Recommended custom baseline Why this baseline Labels / data needed Cheapest first experiment Most likely failure mode Source anchors
Mostly clean, born-digital PDFs PyMuPDF or pdfplumber → region rules → Table Transformer → reconciliation Best when OCR is unnecessary. PyMuPDF can extract word boxes and tables directly. pdfplumber is built for detailed PDF geometry, table extraction, and visual debugging, and works best on machine-generated PDFs. Very little labeled data at first. Often enough to start with regex/rules plus a few manually checked examples. Run native PDF extraction on 50 invoices. Compare field coverage and line-item recovery before adding any OCR. Hidden reading-order issues, merged text blocks, whitespace-only tables. (PyMuPDF)
Scans, phone photos, skew, warping PP-DocLayoutV3 → GLM-OCR → Table Transformer → reconciliation Good when geometry is the problem, not just text recognition. PP-DocLayoutV3 is designed for non-planar documents and reading order. GLM-OCR is a current multimodal OCR model for complex document understanding. Small labeled set for validation is enough to start. Stronger gains come from representative distorted samples. Benchmark 30 hard pages with and without the layout stage. Measure row recovery, not just OCR text quality. Curved pages, bad lighting, over-segmentation, wrong reading order. (Hugging Face)
Few labels, many repeat vendors invoice2data templates + PDF/OCR backend + vendor normalizers Strong when the same vendor layouts repeat. invoice2data supports templates, static fields, and plugins for line items and tables. Very low ML labeling need. You mainly need clean templates and vendor-specific cleanup rules. Template the top 10 vendors that make up most volume. Track touchless rate before training anything. Template drift, unseen vendors, multiline descriptions that break template assumptions. (Invoice2data)
Header fields are fine, line items are the blocker PDF text or OCR → Table Transformer-first pipeline → row repair rules Best when supplier/date/total extraction is mostly solved but row grouping is not. Table Transformer is explicitly for table detection and structure recognition from PDF images. Need labeled line items more than labeled headers. Focus annotation on row grouping and numeric columns. Evaluate on 100 invoices using only line-item metrics: row grouping, numeric binding, subtotal reconciliation. Wrapped descriptions, implicit tables, page breaks, missing cell boundaries. (Hugging Face)
You want one end-to-end trainable parser baseline Donut → schema normalization → reconciliation Good as a clean benchmark for “how far can one model go?” Donut is OCR-free and directly maps document images to structured outputs. Needs paired page→target-schema examples. Best when you can supervise against JSON-like targets. Fine-tune on a narrow schema first: supplier, invoice ID, date, subtotal, total, and one simple line-item format. Hallucinated structure, unstable long outputs, weak row grouping on dense tables. (Hugging Face)
You want a current open document-parser baseline PaddleOCR-VL-1.5 → schema normalization → reconciliation Good zero-shot benchmark when layouts vary a lot. The current model card positions it as a 0.9B document parser with strong table/text performance and robustness to scanning, skew, warping, screen photography, and illumination. Little task-specific labeling needed to start. You mainly need a holdout set for honest evaluation. Run it on a representative vendor mix and compare only business outputs: valid totals, valid rows, review rate. Great parse text but imperfect field binding, overconfident outputs on rare layouts. (Hugging Face)
You want a fuller parsing system with less assembly work PP-StructureV3 → custom field mapping → reconciliation Good when your goal is end-to-end document parsing rather than assembling many separate tools. PP-StructureV3 is presented as a document parsing solution that converts PDFs and document images to Markdown and JSON. Moderate. You still need business-specific mapping and validation, but less low-level pipeline glue. Use it as a parser front end, then map its structure into your invoice schema and test on 20 messy multi-page invoices. General parser output that is structurally rich but not yet aligned to accounting fields. (Hugging Face)
You want the cheapest serious sanity-check baseline PyMuPDF/pdfplumber + regex/keywords for headers + column heuristics for lines + reconciliation Best for finding out whether ML is even needed yet. If native PDF text and coordinates already solve most of the problem, you learn that before investing in training. Almost none initially. You need sample invoices and manual error review. Try on 100 digital PDFs. Count how many pass field extraction and arithmetic checks with no ML. Fails badly on scans, implicit multi-line rows, and vendor layouts with weak alignment. (PyMuPDF)

How to choose fast

If your invoices are mostly:

  • digital PDFs → start with PyMuPDF/pdfplumber
  • photos or distorted scans → start with PP-DocLayoutV3 + GLM-OCR
  • repeat vendors with low label budget → start with invoice2data
  • line-item-heavy → start with a Table Transformer-first stack
  • mixed layouts and you want a modern open benchmark → start with PaddleOCR-VL-1.5
  • one-model benchmark → try Donut as the clean end-to-end baseline (GitHub)

Default pick if you are unsure

If you do not yet know your dominant failure mode, I would test in this order:

  1. PyMuPDF/pdfplumber baseline on digital PDFs
  2. PaddleOCR-VL-1.5 as the modern open parser benchmark
  3. Table Transformer-first for line items
  4. invoice2data for high-volume repeat vendors
  5. Donut only as the end-to-end control baseline (PyMuPDF)
1 Like

What about the costs of shipping, freight (which is different from shipping in our accounting system), and discounts? Those all affect the final invoice amount.

John6666 gives good advice as I tried to extract text from 500+ invoices in a single PDF file. It didn’t go well. I would be looking for text like “invoice number \d{4,6}” where the invoice number was 4-6 digits in a row, but the OCR software would often not extract this data in the same order. So it would not find the invoice number, and many other data.

But I did that with free OCR software and not with AI designed to do OCR.

1 Like

Hmm… While commercial OCR services may include such features, standalone OCR models are often not very good at properly interpreting multi-page data. This is because, in most cases, the models are primarily trained on pairs of a single page and the information to be extracted…

The most straightforward workaround is to split the document into individual pages before feeding them to the OCR model:


Your old approach broke for a structural reason, not just because the OCR was free.

You were effectively doing:

500+ invoices in one PDF → OCR everything → flatten to one text stream → run regex like invoice number \d{4,6}

That is brittle because PDF/OCR extraction often does not preserve normal reading order. PyMuPDF’s docs say plain PDF text extraction may come out “not in usual reading order,” with unexpected line breaks, and recommend using blocks or words with position data instead. (PyMuPDF)

So the main fix is not “use a better regex.” The main fix is:

split first, extract locally, keep coordinates, then validate the totals. Google’s Custom Splitter exists specifically to split packed PDFs into logical documents before extraction, and Google notes that bad splits are especially damaging because one split error causes downstream extraction errors. (Google Cloud Documentation)

What to do with shipping, freight, and discounts

Treat them as separate normalized fields in your schema. Do not fold them into one generic “total adjustment.”

A practical invoice schema for your case is:

{
  "invoice_id": "123456",
  "vendor_name": "ACME Supplies Ltd",
  "invoice_date": "2026-03-15",
  "currency": "USD",

  "subtotal": 1000.00,
  "line_item_discount_total": 20.00,
  "invoice_level_discount": 10.00,
  "shipping_charge": 15.00,
  "freight_charge": 40.00,
  "handling_charge": 0.00,
  "tax_total": 102.50,
  "invoice_total": 1127.50,
  "amount_due": 1127.50
}

And also store the raw label text that produced each field:

  • raw_label = "Shipping"
  • raw_label = "Shipping & Handling"
  • raw_label = "Freight"
  • raw_label = "Discount"

That matters because standard parsers do not always match your accounting distinctions exactly. AWS Textract explicitly standardizes DISCOUNT and SHIPPING_HANDLING_CHARGE. Azure’s invoice model extracts invoice fields and line items into structured JSON. Microsoft Dynamics’ invoice entity explicitly models FreightAmount, TotalDiscountAmount, TotalLineItemAmount, TotalAmountLessFreight, and TotalTax, which is close to the accounting structure you need. (AWS Document)

The formula to validate

Use arithmetic validation as a hard gate.

A practical rule is:

invoice_total
≈ subtotal
- line_item_discount_total
- invoice_level_discount
+ shipping_charge
+ freight_charge
+ handling_charge
+ tax_total
+ other_surcharges

And if the invoice has prior balance or prior credits:

amount_due
≈ invoice_total
+ previous_unpaid_balance
- credits_or_payments

This is not cosmetic cleanup. It is your error detector. If the parser confuses freight with a line item, or misses a discount, this check will usually fail.

Why your invoice number was missed

Your regex expected the text to appear like this:

Invoice Number 123456

But OCR/PDF extraction often returns something more like:

Invoice
Date
123456
Number

or mixes it with neighboring text from another block. PyMuPDF’s docs describe exactly this kind of issue and recommend using block and word extraction with coordinates to rebuild reading order or search local rectangles instead of relying on one global text stream. (PyMuPDF)

So instead of searching the full document with:

invoice number \d{4,6}

do this:

  1. find the header region
  2. find labels such as Invoice No, Invoice Number, Invoice #
  3. collect candidate values near those labels
  4. rank them by distance and alignment
  5. then validate the winner with ^\d{4,6}$

That changes regex from a discovery method into a validator. That is much more reliable.

The concrete pipeline I would use

1. Split the packed PDF into individual invoices

This is the first change.

Start with page-level signals:

  • Invoice near the top
  • an invoice-number/date block near the header
  • totals block near the bottom
  • repeated vendor header/logo
  • continuation pages with line-item tables but no new invoice header

Google’s Custom Splitter is built around exactly this use case: composite files containing multiple logical documents that then get routed to the appropriate extractor. (Google Cloud Documentation)

2. Use native PDF text before OCR when possible

If a page is born-digital, extract words and blocks directly from the PDF first. PyMuPDF recommends block and word extraction because plain text order may be wrong, and Page.get_text("blocks") / Page.get_text("words") preserve useful position information. (PyMuPDF)

3. Use document OCR only for scanned pages

For scanned pages or images, use invoice/document AI OCR rather than generic OCR-only tooling. Azure’s invoice model is built to handle phone captures, scanned documents, and digital PDFs, and returns recognized text, tables, and invoice-specific fields plus line items. AWS Textract’s invoice/receipt path similarly outputs structured summary fields and line items instead of one text blob. (Microsoft Learn)

4. Keep coordinates in your intermediate data

For each word, keep:

  • page number
  • text
  • bounding box
  • line ID
  • block ID
  • confidence
  • source type: native PDF or OCR

This is what lets you ask useful questions like “what is near the invoice-number label?” instead of “does the whole OCR blob contain the pattern?”

5. Zone the page before extracting fields

Split each invoice into approximate regions:

  • header
  • vendor/bill-to area
  • line-item area
  • totals area
  • footer/remittance area

Then only search:

  • invoice number and date in the header
  • shipping/freight/discount/tax/total in the totals area
  • products, qty, price, amount in the line-item area

This mirrors how invoice parsers expose output: Azure returns text, tables, and invoice-specific fields; AWS separates summary fields and line items. (Microsoft Learn)

6. Treat charges as labeled totals lines

Inside the totals block, extract a list of labeled amount lines:

Raw label Internal field
Shipping shipping_charge
Shipping & Handling shipping_charge or split later
Freight freight_charge
Discount invoice_level_discount
Rebate invoice_level_discount

Because your accounting system distinguishes freight from shipping, do not collapse them automatically.

7. Reconstruct line items separately

Do not use header-field logic for line items.

For line items, use a table or pseudo-table approach:

  • detect numeric columns on the right
  • group words into rows by vertical overlap
  • treat left text as description
  • merge multiline descriptions when there is no new numeric anchor

That is where invoice extraction usually becomes hard.

Best practical options

Fastest path

Benchmark a purpose-built invoice parser first.

Good starting options are:

  • Google Document AI: Custom Splitter + Invoice Parser + uptraining/custom fields. Google explicitly says you can uptrain the Invoice Parser with your own data and add custom fields that are not supported by the pretrained model. That is directly useful for a field like freight_charge. (Google Cloud Documentation)
  • AWS Textract AnalyzeExpense: it already standardizes DISCOUNT and SHIPPING_HANDLING_CHARGE, plus summary fields and line items. (AWS Document)
  • Azure Document Intelligence invoice model: it handles scanned images, PDFs, and line items in structured JSON. (Microsoft Learn)

Strong custom path

If you want to own the stack:

  • split packed PDFs first
  • use PyMuPDF blocks/words for digital PDFs
  • use OCR only for scanned pages
  • keep coordinates
  • zone header/totals/items separately
  • extract local candidates near labels
  • normalize freight, shipping, discount, tax
  • reconcile the math before posting anything

My concrete advice for you

For your situation, I would do this in order:

Phase 1

Take 30 to 50 invoices from the packed PDF and manually create a small gold set:

  • correct invoice boundaries
  • correct invoice number
  • subtotal
  • discount
  • shipping
  • freight
  • tax
  • total
  • amount due

Phase 2

Test two paths:

  • managed invoice parser with splitting
  • native PDF extraction + local rules on already-split invoices

That will tell you very quickly whether your real bottleneck is:

  • split detection
  • OCR quality
  • reading order
  • totals parsing
  • line-item grouping

Phase 3

Lock your schema before tuning models:

  • subtotal
  • line_item_discount_total
  • invoice_level_discount
  • shipping_charge
  • freight_charge
  • tax_total
  • invoice_total
  • amount_due

Phase 4

Add reconciliation rules and reject anything that does not balance.

Bottom line

Your earlier failure does not mean invoice extraction is a bad fit.

It means the earlier workflow was fragile:

  • too many invoices in one PDF
  • flattened OCR text
  • regex dependent on OCR order

A stronger workflow is:

packed PDF → split into invoices → extract with coordinates → parse header/totals/lines separately → keep freight separate from shipping → keep discounts explicit → validate the math

That is the path I would take. (Google Cloud Documentation)

Thank you again. I think I will start with a workflow that extracts paragraphs of text to understand the workflow as this sounds easier, there is no tabular data for my first case I will use as a test. The invoice sounds much more complicated. I know some Python but have not done AI or OCR with Python before.

If I do extract an invoice, the PDF we do get has multiple invoice summaries per page. I failed to mention this earlier. It’s an invoice summary from a major US shipper. There is one charge per shipment so they fit multiple invoices per page. We do 1000s of shipments every month with this shipper and other shippers.

1 Like

I know some Python but have not done AI or OCR with Python before.

Oh. PyTorch, Transformers, and other libraries handle GPU-related tasks and the acceleration of bottleneck processes, effectively wrapping them up for us.

As a result, there isn’t much difference in how you actually use standard Python functions (methods) versus functions designed for AI models. The only real precaution is to ensure the hardware isn’t busy when calling AI functions; other than that, there aren’t many other things to watch out for. I think it’s easier than functions related to disk I/O…:sweat_smile:

However, generally speaking, even the lightest general-purpose AI models available on HF are somewhat heavier than plain code, so it might be best to create a prototype without AI first. The fastest approach is to avoid using AI unless absolutely necessary.


For your first learning pass, starting with text blocks or paragraph-like chunks is a good idea. But for the shipper summary PDF you just described, I would change one word:

Do not think in terms of paragraphs.
Think in terms of repeating summary blocks on a page.

That is the key difference.

If one page contains multiple shipment summaries, then the first problem is not “extract all text.” The first problem is segment the page into summary-sized regions, then extract each region separately. A page-level splitter alone will not solve that, because Google’s Custom Splitter is designed to identify logical documents in composite files and return page-level document boundaries; if multiple summaries live on the same page, you still need an intra-page segmentation step. (Google Cloud Documentation)

What I would do first

For your first implementation, I would not start with AI.

I would start with a simple Python workflow using native PDF text extraction, because if these shipper PDFs are machine-generated, that is usually easier and more reliable than OCR. PyMuPDF can extract text as blocks and words, and its docs explicitly note that plain text may not come out in natural reading order, while block/word extraction and sorting help recover usable structure. pdfplumber is also built for detailed PDF inspection and says it works best on machine-generated PDFs. (PyMuPDF)

So the beginner-friendly path is:

  1. open one PDF page
  2. extract blocks or words with coordinates
  3. detect the repeated shipment-summary regions on that page
  4. extract text inside each region
  5. parse one region at a time

That is much easier than OCR-first AI work, and it teaches the right workflow. (PyMuPDF)

Why your previous regex workflow failed

Your regex was probably not the main problem.

The main problem was that OCR or plain PDF extraction often returns text in an order that is not the order your regex expects. PyMuPDF’s docs say the output of plain text extraction may not match natural reading order, and they provide sort=True plus block/word extraction specifically to help with this. (PyMuPDF)

So instead of searching a giant text blob for:

invoice number \d{4,6}

you want to search inside one detected summary region, and only then look for local label/value pairs.

That is a very different workflow.

The right mental model for your shipper summary PDF

Because there is one charge per shipment and multiple summaries per page, your first real task is probably this:

page → repeated summary boxes/cards → one structured record per box

Not:

page → paragraphs

That matters because your data sounds closer to a repeating form layout than to a narrative document.

So I would define one shipment-summary record like this:

  • shipper account
  • shipment date
  • tracking number or shipment reference
  • invoice number or summary number
  • base charge
  • shipping charge
  • freight charge
  • discount
  • tax
  • total

Then repeat that extraction for every summary block on the page.

The easiest MVP

I would build the MVP in four steps.

Step 1: decide whether you even need OCR

Try native PDF extraction first.

Use PyMuPDF or pdfplumber on a few sample pages and inspect whether the text comes out clean enough. pdfplumber explicitly says it works best on machine-generated PDFs, and PyMuPDF exposes blocks, words, and rectangles you can search within. (GitHub)

If that works, you just saved yourself a lot of complexity.

Only add OCR later for scanned files or mixed-quality inputs. OCRmyPDF is a good fallback because it adds a searchable text layer to scanned PDFs and is designed to tolerate files that mix scanned and born-digital content. (OCRmyPDF)

Step 2: inspect one page visually

Use block extraction and plot the blocks or print their bounding boxes.

You want to answer:

  • do the shipment summaries appear as repeated vertical blocks?
  • are the invoice number and total in stable positions?
  • are the label/value pairs close together?
  • do all summaries have roughly the same width and height?

If yes, you can often segment the page with very simple geometry rules.

Step 3: segment one page into summary regions

Start with rules, not AI.

Examples:

  • cluster words/blocks by vertical gaps
  • detect repeated top labels like “Invoice,” “Shipment,” or “Tracking”
  • use horizontal rules or whitespace bands if the PDF has them
  • find repeated left edges and repeated heights

Because multiple summaries are on one page, this step is probably more important than OCR quality.

Step 4: parse one summary region locally

Once you isolate one region, do local extraction:

  • find the invoice-number label inside that region
  • look nearby for 4–6 digit candidates
  • validate the winner with regex
  • repeat for total, discount, freight, shipping

That local approach is much more robust than global regex over the whole document.

My recommendation about “paragraph extraction”

For a first exercise, yes, extracting paragraph-like text chunks is fine because it teaches:

  • how to open a PDF in Python
  • how to inspect blocks and words
  • how to handle coordinates
  • how to build a parser incrementally

But for your real shipper-summary use case, I would upgrade that idea to:

extract repeated blocks, not paragraphs

That is the version that matches your document structure.

A practical beginner roadmap

Phase 1: no AI, no OCR

Use PyMuPDF.

Goal:

  • extract blocks and words from one page
  • print their coordinates
  • manually identify where one shipment summary starts and ends

PyMuPDF’s text recipes include block extraction, word extraction, extraction inside rectangles, and sorted text output. (PyMuPDF)

Phase 2: rule-based region segmentation

Write a small function that groups blocks into summary regions.

Goal:

  • get from “one page” to “N shipment summaries on that page”

Phase 3: field extraction inside each region

Use local rules:

  • regex only inside the region
  • proximity to labels
  • fallback rules if a label is missing

Phase 4: arithmetic validation

Because shipping, freight, discounts, and total all matter, add a check like:

total ≈ base_charge - discounts + shipping + freight + tax

The exact formula depends on the shipper’s layout, but the principle is stable: do not trust extracted numbers until they balance.

Where AI helps later

AI becomes useful after you understand the document shape.

For your case, I would add AI later for one of three reasons:

  • some PDFs are scans, so you need OCR
  • the summary-region segmentation is inconsistent
  • local rules for labels and fields become too brittle across shippers

At that point, current document models on Hugging Face such as PaddleOCR-VL-1.5, GLM-OCR, or layout companions like PP-DocLayoutV3 become relevant, but I would not start there if your immediate goal is to learn the workflow and get a first success in Python. Those tools are better once you already know what one correct extracted record should look like. (Google Cloud)

What I would choose for you right now

I would start with this exact stack:

  • PyMuPDF for page, block, and word extraction
  • pdfplumber only as a visual debugging helper when needed
  • no OCR unless the sample PDFs turn out to be image-only
  • rule-based region segmentation
  • local regex + label proximity for field extraction
  • math validation for shipping, freight, discounts, and totals

That is the simplest path that still matches your real document structure. PyMuPDF gives you the coordinates and block-level tools you need, and pdfplumber is helpful when you want to inspect how the PDF is laid out. (PyMuPDF)

The one change I would make to your plan

Your instinct to start simpler is correct.

I would just change the target from:

paragraph extraction

to:

summary-block extraction

That single change aligns the project with the actual structure of your shipper PDF and gives you a much better chance of getting an early win.


Start with PyMuPDF only. It can extract blocks and words with coordinates, and sort=True can reorder output roughly from top-left to bottom-right. That is a much better first step than OCR for machine-generated PDFs, especially when one page contains multiple repeated shipment summaries. (PyMuPDF)

# deps:
#   pip install pymupdf
#
# notes:
# - No AI model. No OCR. CPU-safe.
# - Replace SAMPLE_PDF_URL later with your own PDF path or URL.
# - This is a first workflow script: download/open PDF -> extract text blocks ->
#   group nearby blocks into rough "summary regions" -> print/save results.

import json
import os
import urllib.request
import fitz  # PyMuPDF

# Public sample PDF for demo.
# Replace with your own local PDF path later, for example:
# PDF_SOURCE = "my_shipper_summary.pdf"
PDF_SOURCE = "https://www.w3.org/WAI/WCAG20/Techniques/working-examples/PDF20/table.pdf"

OUT_DIR = "demo_pdf_blocks"
PAGE_INDEX = 0          # first page only for the first experiment
GAP_THRESHOLD = 18.0    # larger => fewer/larger grouped regions

os.makedirs(OUT_DIR, exist_ok=True)

def ensure_local_pdf(src: str) -> str:
    """Download the PDF if src is a URL. Otherwise return the local path."""
    if src.startswith("http://") or src.startswith("https://"):
        local_path = os.path.join(OUT_DIR, "sample.pdf")
        if not os.path.exists(local_path):
            print(f"Downloading sample PDF to: {local_path}")
            urllib.request.urlretrieve(src, local_path)
        return local_path
    return src

def clean_text(s: str) -> str:
    """Normalize block text for easier printing."""
    return " ".join(s.replace("\x00", " ").split())

def group_blocks_into_regions(blocks, gap_threshold=18.0):
    """
    Very simple region grouping:
    - sort blocks top-to-bottom, then left-to-right
    - start a new region when the vertical gap is large
    This is only a first heuristic for repeated summary blocks.
    """
    regions = []
    current = []

    for block in blocks:
        x0, y0, x1, y1, text, block_no, block_type = block
        if block_type != 0:  # keep text blocks only
            continue
        text = clean_text(text)
        if not text:
            continue

        item = {
            "bbox": [round(x0, 1), round(y0, 1), round(x1, 1), round(y1, 1)],
            "text": text,
            "block_no": int(block_no),
        }

        if not current:
            current.append(item)
            continue

        prev_y1 = current[-1]["bbox"][3]
        current_y0 = item["bbox"][1]
        vertical_gap = current_y0 - prev_y1

        if vertical_gap > gap_threshold:
            regions.append(current)
            current = [item]
        else:
            current.append(item)

    if current:
        regions.append(current)

    # Add combined region bbox + joined text
    packed = []
    for idx, region in enumerate(regions):
        xs0 = [b["bbox"][0] for b in region]
        ys0 = [b["bbox"][1] for b in region]
        xs1 = [b["bbox"][2] for b in region]
        ys1 = [b["bbox"][3] for b in region]
        packed.append({
            "region_id": idx,
            "bbox": [min(xs0), min(ys0), max(xs1), max(ys1)],
            "text": "\n".join(b["text"] for b in region),
            "blocks": region,
        })
    return packed

# 1) Load PDF
pdf_path = ensure_local_pdf(PDF_SOURCE)
doc = fitz.open(pdf_path)
page = doc[PAGE_INDEX]

# 2) Extract blocks with sort=True
# PyMuPDF can also do get_text("words", sort=True) later if you need finer control.
blocks = page.get_text("blocks", sort=True)

# 3) Group blocks into rough page regions
regions = group_blocks_into_regions(blocks, gap_threshold=GAP_THRESHOLD)

# 4) Save raw outputs
raw_blocks_path = os.path.join(OUT_DIR, "page_blocks.json")
regions_path = os.path.join(OUT_DIR, "page_regions.json")
page_text_path = os.path.join(OUT_DIR, "page_text.txt")

with open(raw_blocks_path, "w", encoding="utf-8") as f:
    json.dump(blocks, f, indent=2, ensure_ascii=False)

with open(regions_path, "w", encoding="utf-8") as f:
    json.dump(regions, f, indent=2, ensure_ascii=False)

with open(page_text_path, "w", encoding="utf-8") as f:
    for region in regions:
        f.write(f"\n=== REGION {region['region_id']} ===\n")
        f.write(region["text"])
        f.write("\n")

# 5) Print a compact summary
print(f"\nPDF: {pdf_path}")
print(f"Pages: {doc.page_count}")
print(f"Using page index: {PAGE_INDEX}")
print(f"Text blocks found: {sum(1 for b in blocks if b[6] == 0)}")
print(f"Rough regions found: {len(regions)}")

for region in regions:
    x0, y0, x1, y1 = region["bbox"]
    preview = region["text"][:200].replace("\n", " | ")
    print(f"\nREGION {region['region_id']}  bbox=({x0}, {y0}, {x1}, {y1})")
    print(f"Preview: {preview}")

print("\nSaved:")
print(" -", raw_blocks_path)
print(" -", regions_path)
print(" -", page_text_path)

doc.close()

For your shipper-summary PDF, the next step after this is to replace the simple vertical-gap grouping with repeating summary-block detection, then extract fields like invoice number, shipping, freight, discount, and total inside each region only. That avoids the reading-order problem that broke the whole-document regex approach. (PyMuPDF)