I know some Python but have not done AI or OCR with Python before.
Oh. PyTorch, Transformers, and other libraries handle GPU-related tasks and the acceleration of bottleneck processes, effectively wrapping them up for us.
As a result, there isn’t much difference in how you actually use standard Python functions (methods) versus functions designed for AI models. The only real precaution is to ensure the hardware isn’t busy when calling AI functions; other than that, there aren’t many other things to watch out for. I think it’s easier than functions related to disk I/O…
However, generally speaking, even the lightest general-purpose AI models available on HF are somewhat heavier than plain code, so it might be best to create a prototype without AI first. The fastest approach is to avoid using AI unless absolutely necessary.
For your first learning pass, starting with text blocks or paragraph-like chunks is a good idea. But for the shipper summary PDF you just described, I would change one word:
Do not think in terms of paragraphs.
Think in terms of repeating summary blocks on a page.
That is the key difference.
If one page contains multiple shipment summaries, then the first problem is not “extract all text.” The first problem is segment the page into summary-sized regions, then extract each region separately. A page-level splitter alone will not solve that, because Google’s Custom Splitter is designed to identify logical documents in composite files and return page-level document boundaries; if multiple summaries live on the same page, you still need an intra-page segmentation step. (Google Cloud Documentation)
What I would do first
For your first implementation, I would not start with AI.
I would start with a simple Python workflow using native PDF text extraction, because if these shipper PDFs are machine-generated, that is usually easier and more reliable than OCR. PyMuPDF can extract text as blocks and words, and its docs explicitly note that plain text may not come out in natural reading order, while block/word extraction and sorting help recover usable structure. pdfplumber is also built for detailed PDF inspection and says it works best on machine-generated PDFs. (PyMuPDF)
So the beginner-friendly path is:
- open one PDF page
- extract blocks or words with coordinates
- detect the repeated shipment-summary regions on that page
- extract text inside each region
- parse one region at a time
That is much easier than OCR-first AI work, and it teaches the right workflow. (PyMuPDF)
Why your previous regex workflow failed
Your regex was probably not the main problem.
The main problem was that OCR or plain PDF extraction often returns text in an order that is not the order your regex expects. PyMuPDF’s docs say the output of plain text extraction may not match natural reading order, and they provide sort=True plus block/word extraction specifically to help with this. (PyMuPDF)
So instead of searching a giant text blob for:
invoice number \d{4,6}
you want to search inside one detected summary region, and only then look for local label/value pairs.
That is a very different workflow.
The right mental model for your shipper summary PDF
Because there is one charge per shipment and multiple summaries per page, your first real task is probably this:
page → repeated summary boxes/cards → one structured record per box
Not:
page → paragraphs
That matters because your data sounds closer to a repeating form layout than to a narrative document.
So I would define one shipment-summary record like this:
- shipper account
- shipment date
- tracking number or shipment reference
- invoice number or summary number
- base charge
- shipping charge
- freight charge
- discount
- tax
- total
Then repeat that extraction for every summary block on the page.
The easiest MVP
I would build the MVP in four steps.
Step 1: decide whether you even need OCR
Try native PDF extraction first.
Use PyMuPDF or pdfplumber on a few sample pages and inspect whether the text comes out clean enough. pdfplumber explicitly says it works best on machine-generated PDFs, and PyMuPDF exposes blocks, words, and rectangles you can search within. (GitHub)
If that works, you just saved yourself a lot of complexity.
Only add OCR later for scanned files or mixed-quality inputs. OCRmyPDF is a good fallback because it adds a searchable text layer to scanned PDFs and is designed to tolerate files that mix scanned and born-digital content. (OCRmyPDF)
Step 2: inspect one page visually
Use block extraction and plot the blocks or print their bounding boxes.
You want to answer:
- do the shipment summaries appear as repeated vertical blocks?
- are the invoice number and total in stable positions?
- are the label/value pairs close together?
- do all summaries have roughly the same width and height?
If yes, you can often segment the page with very simple geometry rules.
Step 3: segment one page into summary regions
Start with rules, not AI.
Examples:
- cluster words/blocks by vertical gaps
- detect repeated top labels like “Invoice,” “Shipment,” or “Tracking”
- use horizontal rules or whitespace bands if the PDF has them
- find repeated left edges and repeated heights
Because multiple summaries are on one page, this step is probably more important than OCR quality.
Step 4: parse one summary region locally
Once you isolate one region, do local extraction:
- find the invoice-number label inside that region
- look nearby for 4–6 digit candidates
- validate the winner with regex
- repeat for total, discount, freight, shipping
That local approach is much more robust than global regex over the whole document.
My recommendation about “paragraph extraction”
For a first exercise, yes, extracting paragraph-like text chunks is fine because it teaches:
- how to open a PDF in Python
- how to inspect blocks and words
- how to handle coordinates
- how to build a parser incrementally
But for your real shipper-summary use case, I would upgrade that idea to:
extract repeated blocks, not paragraphs
That is the version that matches your document structure.
A practical beginner roadmap
Phase 1: no AI, no OCR
Use PyMuPDF.
Goal:
- extract blocks and words from one page
- print their coordinates
- manually identify where one shipment summary starts and ends
PyMuPDF’s text recipes include block extraction, word extraction, extraction inside rectangles, and sorted text output. (PyMuPDF)
Phase 2: rule-based region segmentation
Write a small function that groups blocks into summary regions.
Goal:
- get from “one page” to “N shipment summaries on that page”
Phase 3: field extraction inside each region
Use local rules:
- regex only inside the region
- proximity to labels
- fallback rules if a label is missing
Phase 4: arithmetic validation
Because shipping, freight, discounts, and total all matter, add a check like:
total ≈ base_charge - discounts + shipping + freight + tax
The exact formula depends on the shipper’s layout, but the principle is stable: do not trust extracted numbers until they balance.
Where AI helps later
AI becomes useful after you understand the document shape.
For your case, I would add AI later for one of three reasons:
- some PDFs are scans, so you need OCR
- the summary-region segmentation is inconsistent
- local rules for labels and fields become too brittle across shippers
At that point, current document models on Hugging Face such as PaddleOCR-VL-1.5, GLM-OCR, or layout companions like PP-DocLayoutV3 become relevant, but I would not start there if your immediate goal is to learn the workflow and get a first success in Python. Those tools are better once you already know what one correct extracted record should look like. (Google Cloud)
What I would choose for you right now
I would start with this exact stack:
- PyMuPDF for page, block, and word extraction
- pdfplumber only as a visual debugging helper when needed
- no OCR unless the sample PDFs turn out to be image-only
- rule-based region segmentation
- local regex + label proximity for field extraction
- math validation for shipping, freight, discounts, and totals
That is the simplest path that still matches your real document structure. PyMuPDF gives you the coordinates and block-level tools you need, and pdfplumber is helpful when you want to inspect how the PDF is laid out. (PyMuPDF)
The one change I would make to your plan
Your instinct to start simpler is correct.
I would just change the target from:
paragraph extraction
to:
summary-block extraction
That single change aligns the project with the actual structure of your shipper PDF and gives you a much better chance of getting an early win.
Start with PyMuPDF only. It can extract blocks and words with coordinates, and sort=True can reorder output roughly from top-left to bottom-right. That is a much better first step than OCR for machine-generated PDFs, especially when one page contains multiple repeated shipment summaries. (PyMuPDF)
# deps:
# pip install pymupdf
#
# notes:
# - No AI model. No OCR. CPU-safe.
# - Replace SAMPLE_PDF_URL later with your own PDF path or URL.
# - This is a first workflow script: download/open PDF -> extract text blocks ->
# group nearby blocks into rough "summary regions" -> print/save results.
import json
import os
import urllib.request
import fitz # PyMuPDF
# Public sample PDF for demo.
# Replace with your own local PDF path later, for example:
# PDF_SOURCE = "my_shipper_summary.pdf"
PDF_SOURCE = "https://www.w3.org/WAI/WCAG20/Techniques/working-examples/PDF20/table.pdf"
OUT_DIR = "demo_pdf_blocks"
PAGE_INDEX = 0 # first page only for the first experiment
GAP_THRESHOLD = 18.0 # larger => fewer/larger grouped regions
os.makedirs(OUT_DIR, exist_ok=True)
def ensure_local_pdf(src: str) -> str:
"""Download the PDF if src is a URL. Otherwise return the local path."""
if src.startswith("http://") or src.startswith("https://"):
local_path = os.path.join(OUT_DIR, "sample.pdf")
if not os.path.exists(local_path):
print(f"Downloading sample PDF to: {local_path}")
urllib.request.urlretrieve(src, local_path)
return local_path
return src
def clean_text(s: str) -> str:
"""Normalize block text for easier printing."""
return " ".join(s.replace("\x00", " ").split())
def group_blocks_into_regions(blocks, gap_threshold=18.0):
"""
Very simple region grouping:
- sort blocks top-to-bottom, then left-to-right
- start a new region when the vertical gap is large
This is only a first heuristic for repeated summary blocks.
"""
regions = []
current = []
for block in blocks:
x0, y0, x1, y1, text, block_no, block_type = block
if block_type != 0: # keep text blocks only
continue
text = clean_text(text)
if not text:
continue
item = {
"bbox": [round(x0, 1), round(y0, 1), round(x1, 1), round(y1, 1)],
"text": text,
"block_no": int(block_no),
}
if not current:
current.append(item)
continue
prev_y1 = current[-1]["bbox"][3]
current_y0 = item["bbox"][1]
vertical_gap = current_y0 - prev_y1
if vertical_gap > gap_threshold:
regions.append(current)
current = [item]
else:
current.append(item)
if current:
regions.append(current)
# Add combined region bbox + joined text
packed = []
for idx, region in enumerate(regions):
xs0 = [b["bbox"][0] for b in region]
ys0 = [b["bbox"][1] for b in region]
xs1 = [b["bbox"][2] for b in region]
ys1 = [b["bbox"][3] for b in region]
packed.append({
"region_id": idx,
"bbox": [min(xs0), min(ys0), max(xs1), max(ys1)],
"text": "\n".join(b["text"] for b in region),
"blocks": region,
})
return packed
# 1) Load PDF
pdf_path = ensure_local_pdf(PDF_SOURCE)
doc = fitz.open(pdf_path)
page = doc[PAGE_INDEX]
# 2) Extract blocks with sort=True
# PyMuPDF can also do get_text("words", sort=True) later if you need finer control.
blocks = page.get_text("blocks", sort=True)
# 3) Group blocks into rough page regions
regions = group_blocks_into_regions(blocks, gap_threshold=GAP_THRESHOLD)
# 4) Save raw outputs
raw_blocks_path = os.path.join(OUT_DIR, "page_blocks.json")
regions_path = os.path.join(OUT_DIR, "page_regions.json")
page_text_path = os.path.join(OUT_DIR, "page_text.txt")
with open(raw_blocks_path, "w", encoding="utf-8") as f:
json.dump(blocks, f, indent=2, ensure_ascii=False)
with open(regions_path, "w", encoding="utf-8") as f:
json.dump(regions, f, indent=2, ensure_ascii=False)
with open(page_text_path, "w", encoding="utf-8") as f:
for region in regions:
f.write(f"\n=== REGION {region['region_id']} ===\n")
f.write(region["text"])
f.write("\n")
# 5) Print a compact summary
print(f"\nPDF: {pdf_path}")
print(f"Pages: {doc.page_count}")
print(f"Using page index: {PAGE_INDEX}")
print(f"Text blocks found: {sum(1 for b in blocks if b[6] == 0)}")
print(f"Rough regions found: {len(regions)}")
for region in regions:
x0, y0, x1, y1 = region["bbox"]
preview = region["text"][:200].replace("\n", " | ")
print(f"\nREGION {region['region_id']} bbox=({x0}, {y0}, {x1}, {y1})")
print(f"Preview: {preview}")
print("\nSaved:")
print(" -", raw_blocks_path)
print(" -", regions_path)
print(" -", page_text_path)
doc.close()
For your shipper-summary PDF, the next step after this is to replace the simple vertical-gap grouping with repeating summary-block detection, then extract fields like invoice number, shipping, freight, discount, and total inside each region only. That avoids the reading-order problem that broke the whole-document regex approach. (PyMuPDF)