Is it possible to create a Résumé parser using a Huggingface model?

In other words, is it possible to train a supervised transformer model to pull out specific from unstructured or semi-structured text and if so, which pretrained model would be best for this?

In the resume example, I’d want to input the text version of a person’s resume and get a json like the following as output: {‘Education’: [‘BS Harvard University 2010’, ‘MS Stanford University 2012’], ‘Experience’: [‘Microsoft, 2012-2016’, ‘Google, 2016 - Present’]}

Obviously, I’ll need to label hundreds or thousands of resumes with their relevant Education and Experience fields before I’ll have a model that is capable of the above.

Here’s another example of the solution that I’m talking about although this person seems to be using GPT-3 and didn’t have any code provided. Is this something that any of the huggingface pipelines is capable of and if so, which pipeline would be most appropriate?

Is there any reason you’re looking to do this with a transformer?
This is a common vision problem, and transformers aren’t usually the first port of call for a problem like this.

I’m looking to do this with a transformer because I’ll be receiving the raw text, not images, as input.

And yes, I’m using the Resume example as a proxy for a confidential use case at my company.

Yes, we’ve seen companies using transformers for similar use cases. If you don’t have a lot of labelled data, it usually involves a mix of zero-shot classification to understand the sections (ex: https://huggingface.co/facebook/bart-large-mnli?text=2001+-+2003+harvard%2C+master+in+management&labels=school%2C+job%2C+hobbies&multiclass=false) and then NER to extract the right information for the right classes (https://huggingface.co/dslim/bert-base-NER?text=I+worked+at+Facebook). Sometimes you want to add entity linking to the mix depending on how elaborate a system you need.

Would you be open to jump on a call to see if some of our commercial offering could be useful or are you looking at doing that all by yourself?

This is a common NLP problem, and transformers are a good first port of call for a problem like this.

You should google these terms:

  • Named entity recognition
  • Chunking

You’ll need to have labeled data, usually marking every token in your document with perhaps IOB tags that can demarcate the start and end of a coherent chunk of text.

The problem I’m trying to solve, in the most general sense, is that you’re given a set of documents and each document in your set has specific information that you’re trying to pull out. Examples of specific information to pull out include:

  • Author of the document
  • When was the document written
  • Who is the recipient of the document
  • etc.

And keep in mind information in the document may not always be stated in an obvious way. In one document, the author may be given as “Author: Joe Shmo”. In another one it might say: “From Jane Doe”. But let’s assume that every document has an author and any normal adult human with an average comprehension of English can pull out the author even though the author may not be stated in the exact same way in each document (and let’s assume there are countless ways of saying who the author is and no reasonable Regex pattern can be used to pull it out.) Ditto to the other fields like the date when it’s written and the intended recipient of the document.

I originally thought of using a Question Answering model as a basis for this task but it might be overkill. Regarding the Resume example, I might end up training the model on just two questions and their respective answers for each resume in the dataset:

  • What is the education?
  • What is the experience?

I suppose training a custom NER model might be another route to take.

Is there any reason you’re looking to do this with a transformer?
This is a common vision problem, and transformers aren’t usually the first port of call for a problem like this.

LOL. What?

I misread I thought he had the physical resumes and was jumping straight to: https://arxiv.org/abs/2010.11929

Hi @clem would be very interested in checking out HF’s commercial offering for this. Can we chat somehow?

Hey clem. Can we jump on a call to look into your offerings regarding the same? Thanks.

Hi,

We do have several models available for that. These include (at the time of writing):

LayoutLM is a BERT-like model by Microsoft that adds layout information of the text tokens to improve performance on document processing tasks, like information extraction from documents, document image classification and document visual question answering. Its successors are LayoutLMv2 and v3 which improve its performance. Notebooks for all of those can be found in my Github repo.

Then there are also the Document Image Transformer, which is mostly suited for document image classification and layout analysis (like table detection in PDFs), and TrOCR, which is a Transformer-based encoder-decoder model for optical character recognition on single text-line images. Notebooks for those can also be found in my Github repo.

Yes it is possible and infect i have a demo available on my space here. It was initial version and i uploaded that almost an year ago.

@Sybghat Can you share your source code if uploaded in github or so ??

you can see the source here Sybghat/resume-parser at main

  1. i have found this deepset/tinyroberta-squad2 but this will work only when resume contains label i have used this using haystack

  2. impira/layoutlm-document-qa which use LayoutLM behind which work fine but its again i have passed resume without name label only values it cannot detected

How accurate is your model?

@xjdeng where you able to find something accurate to extract that info?

Hey,

It’s crazy that the thread started in 2020, got a reasonable answer after 2 years by @nielsr, and it’s still ongoing in 2026.
I’m too working on a similar project as @xjdeng using LayoutLMv3, but i’m currently stuck at the training phase because i haven’t been able to find a decent resume dataset to train on.

if anyone can provide some guidance or point toward a suitable resources, i would be very grateful.

Thanks

For now, version as of June 2026… This probably changes quite a bit depending on whether you specifically need to make it work with LayoutLMv3, or whether any model/tool is acceptable as long as the resume parsing goal is met:


TL;DR

I would split this into two different tracks:

  1. If you specifically need LayoutLMv3, the main problem is not just “which model?” or “which dataset?”. You need a dataset/pipeline with:

    • page images or rendered PDF pages,
    • OCR words,
    • word-level bounding boxes,
    • normalized LayoutLM-style boxes,
    • word/token labels,
    • reading order,
    • and the same OCR/bbox preprocessing at training and inference time.

    I did not find a clean public resume-specific dataset that is already “LayoutLMv3-ready” in that full sense.

  2. If the goal is simply resume parsing / resume information extraction, then there are more practical resources now. I would look at:

The most important practical distinction is:

LayoutLMv3 wants image + OCR words + boxes + labels.
Many newer resume resources are text-to-JSON, NER, OCR, or document-parsing resources. Useful, but not the same training format.


Track A — If you must use LayoutLMv3

For LayoutLMv3, I would first check the expected input contract carefully.

Relevant docs:

The important part is that LayoutLMv3 token classification is not just ordinary text NER. The processor/model path expects layout-aware inputs, usually something like:

{
    "image": "<page image>",
    "words": ["John", "Doe", "Software", "Engineer", "..."],
    "boxes": [[x0, y0, x1, y1], ...],
    "word_labels": ["B-NAME", "I-NAME", "B-TITLE", "I-TITLE", "..."]
}

The boxes should be word-level bounding boxes, normalized in the LayoutLM-style coordinate system, usually 0–1000 scale. The labels need to align with the OCR words, and then the tokenizer has to propagate word labels to subword tokens.

The forum thread above is worth reading because it points out a common failure mode: training with one kind of box/annotation setup, then using a different OCR/bbox setup at inference. That can break the model even if the training code looked fine.

Practical LayoutLMv3 checklist

If I had to make the LayoutLMv3 route work, I would do something like this:

1. Choose one OCR engine and freeze it.
   Examples: Tesseract, EasyOCR, PaddleOCR, pdfplumber/PDF text extraction, etc.

2. Convert each resume page into:
   - page image
   - OCR words
   - word-level bounding boxes
   - reading order

3. Annotate those OCR words/boxes.
   Do not annotate totally separate hand-drawn boxes unless you can reproduce the same boxes at inference.

4. Convert annotations into BIO/BILOU labels.
   Example labels:
   - B-NAME / I-NAME
   - B-EMAIL / I-EMAIL
   - B-PHONE / I-PHONE
   - B-COMPANY / I-COMPANY
   - B-JOB_TITLE / I-JOB_TITLE
   - B-DEGREE / I-DEGREE
   - B-INSTITUTION / I-INSTITUTION
   - B-SKILL / I-SKILL
   - O

5. Normalize boxes exactly as LayoutLMv3 expects.

6. Keep OCR, ordering, box normalization, truncation, and page splitting identical at training and inference.

7. Start with a small gold evaluation set before scaling.

In other words, for LayoutLMv3 the dataset problem is really an annotation and preprocessing contract problem.

Why ordinary resume text datasets are not enough for LayoutLMv3

A dataset like:

{
  "text": "John Doe\nSoftware Engineer\n...",
  "json": {
    "name": "John Doe",
    "title": "Software Engineer"
  }
}

can be useful for text-to-JSON models or LLM fine-tuning, but it is not directly enough for LayoutLMv3 token classification because it is missing:

  • page image,
  • OCR words,
  • word bounding boxes,
  • word-level labels,
  • reading order,
  • box normalization,
  • page-level segmentation.

You might still use a text-to-JSON model to create weak labels. For example:

resume PDF
→ OCR words + boxes
→ text-to-JSON extractor
→ extracted field values
→ string-match field values back to OCR words
→ weak BIO labels
→ human correction
→ LayoutLMv3 fine-tuning

But I would treat that as a weak-labeling/bootstrap approach, not as a clean substitute for a real gold dataset.

A LayoutLMv3-adjacent resource

One interesting LayoutLMv3-related resume resource is:

I would not treat it as a complete resume field parser, but it is relevant because it uses LayoutLMv3 on resume PDFs and discusses a resume-oriented preprocessing pattern. It may be useful as a reference if you want to see how someone handled PDF words, boxes, and LayoutLMv3-style representations in the resume domain.


Track B — If any model/tool is acceptable

If the goal is simply “parse resumes into structured fields”, I would probably not start with LayoutLMv3. I would start with a pipeline view:

PDF / DOCX / image resume
→ OCR / PDF parsing / layout reconstruction
→ clean text or Markdown with reading order
→ section routing
→ structured extraction
→ JSON validation
→ evaluation on a small hand-checked set

A resume parser is often not one model. It is a pipeline.

The hard part may be upstream: converting a visually complex resume PDF into faithful text/Markdown/layout before extracting fields.


Strongest current resource: SmartResume

I would look at SmartResume first:

Why it matters:

SmartResume is very close to the actual problem. It is not just a generic NER model. It treats resume parsing as a layout-aware pipeline:

resume PDF / image / Office document
→ OCR + PDF metadata extraction
→ layout detection
→ reading order reconstruction
→ structured information extraction with a compact LLM

The paper is especially useful because it frames the problem correctly:

  • resumes have diverse layouts,
  • resumes often have multi-column structures,
  • reading order matters,
  • LLM-only extraction can be expensive or brittle,
  • standardized resume extraction datasets/evaluation tools are limited.

This is probably the best “goal-first” starting point I found.


General structured extraction: NuExtract3

Another strong candidate is:

This is not resume-specific, but it is very relevant. It is a vision-language document understanding model for structured extraction and image-to-Markdown conversion.

The useful pattern is:

input document + JSON template + optional instructions
→ structured JSON output

For a resume, the template might look like:

{
  "name": "verbatim-string",
  "email": "email",
  "phone": "verbatim-string",
  "location": "verbatim-string",
  "summary": "string",
  "skills": ["verbatim-string"],
  "education": [
    {
      "institution": "verbatim-string",
      "degree": "verbatim-string",
      "field": "verbatim-string",
      "start_date": "date-time",
      "end_date": "date-time"
    }
  ],
  "experience": [
    {
      "company": "verbatim-string",
      "title": "verbatim-string",
      "location": "verbatim-string",
      "start_date": "date-time",
      "end_date": "date-time",
      "responsibilities": ["string"],
      "achievements": ["string"]
    }
  ],
  "certifications": [
    {
      "name": "verbatim-string",
      "issuer": "verbatim-string",
      "date": "date-time"
    }
  ]
}

I would still evaluate it on real resumes. But as a modern structured-extraction route, it is very relevant.


OCR / document parsing layer: PaddleOCR-VL, PaddleOCR 3.5, Docling, olmOCR

If your input is PDF/image resumes, I would also look at current OCR/document parsing tools. These are not resume parsers by themselves, but they may solve the most painful upstream step.

Useful resources:

Why this matters:

A two-column resume, a sidebar resume, or a scanned resume can fail before the extraction model even sees the content correctly. If the OCR/layout step scrambles the reading order, a good NER or LLM extractor may still produce bad JSON.

So I would separate:

document parsing quality

from:

field extraction quality

They are related, but not the same problem.


Resume text → JSON resources

If you can already get reasonably clean text from the resume, there are more direct resources.

Qwen3 resume JSON dataset/model

This route is useful if your pipeline is:

PDF/DOCX/image
→ text extraction
→ raw resume text
→ structured JSON extractor

The dataset is raw resume text to structured JSON. The model is a Qwen3-0.6B LoRA adapter for resume JSON extraction.

Caveats:

  • It is not a LayoutLMv3 dataset.
  • It does not solve OCR/layout.
  • The model repo contains the LoRA adapter, so the base model is also needed.
  • Long resumes and unusual formats still need evaluation.

Small local resume extractor

This is a resume/CV structured extraction model based on NuExtract-tiny / Qwen2.5-0.5B. It is useful if you want a small local route, especially for raw text to JSON.

Caveat: check the model card carefully. It is trained on synthetic resumes, so I would not trust it without testing on real resumes from your target distribution.


NER / section-routing route

If you want something more deterministic and easier to debug than “LLM returns JSON”, a section classifier + NER pipeline may be easier to control.

Useful resources:

A practical version of this route could be:

OCR/PDF text chunks
→ classify chunks into sections:
   contact / summary / experience / education / skills / certifications / projects / etc.
→ run section-aware NER
→ normalize dates, phone, email, skills, company names
→ group entities into experience[] and education[]
→ validate JSON

This is less glamorous than a single end-to-end model, but easier to debug.

For example:

  • Contact fields can often be handled with regex + NER.
  • Skills can use NER + skill dictionaries.
  • Experience needs grouping: company, title, dates, bullet points.
  • Education needs grouping: institution, degree, field, dates/GPA.

The important caveat with many resume NER models is that reported scores may come from internal or narrow test sets. I would always create a small hand-labeled evaluation set from your actual target resumes.


Resource table

Resource Type Best for Not for Notes
LayoutLMv3 docs Model docs Understanding LayoutLMv3 input contract Finding resume data Essential for image/words/boxes/labels
LayoutLMv3 forum thread Forum/debugging OCR/bbox train-inference consistency Turnkey solution Very relevant practical warning
SmartResume Resume-specific system Full resume parsing pipeline Pure LayoutLMv3 training Strongest goal-first candidate
Alibaba-EI/SmartResume Model repo Weights/resources for SmartResume General NER Includes resume extraction/layout components
Layout-Aware Parsing Meets Efficient LLMs Paper Modern resume extraction architecture Drop-in code alone Useful framing and evaluation discussion
NuExtract3 General document extraction VLM Template-based JSON extraction Resume-specific guarantee Strong candidate if model choice is flexible
PaddleOCR-VL OCR/document parsing Upstream PDF/image parsing Resume field extraction by itself Strong document parsing candidate
PaddleOCR GitHub OCR/document stack OCR/layout/table/formula/chart extraction Resume-specific schema Good ingestion layer
Docling Document parser PDF/DOCX/image to structured text/Markdown/JSON Resume labels Useful preprocessing layer
olmOCR PDF-to-text/Markdown OCR Clean reading order / linearized text Resume JSON fields Useful before extraction
resume-json-extraction-5k Dataset Resume text → JSON SFT LayoutLMv3 training Directly relevant for text route
qwen3-0.6b-resume-json LoRA adapter Lightweight resume JSON extraction OCR/layout Needs base Qwen3 model
NuExtract-tiny Resume Small local extractor Local raw-text resume JSON extraction Robust PDF layout Synthetic-data caveat
oksomu/resume-ner NER + postprocess Deterministic entity extraction route Full layout parsing Detailed card; evaluate externally
amosify section classifier Text classifier Section routing Field extraction alone Useful middle layer
amosify resume NER NER model Section-aware NER Nested JSON alone Pair with section routing
resume-parsing model tag HF model search Discovering current resume models Exhaustive coverage Some models are not tagged consistently

Suggested practical pipelines

Pipeline 1: LayoutLMv3-only route

Use this if LayoutLMv3 is required.

resume PDF/image
→ render pages
→ OCR with one fixed engine
→ words + word boxes + reading order
→ annotate OCR words/boxes
→ BIO labels
→ LayoutLMv3 token classification
→ field grouping + post-processing
→ JSON validation

This is the most faithful LayoutLMv3 route, but also the most annotation-heavy.

Pipeline 2: Modern model-free route

Use this if the goal is just accurate resume parsing.

resume PDF/DOCX/image
→ Docling / PaddleOCR / olmOCR / SmartResume-style parsing
→ Markdown or layout-preserving text
→ NuExtract3 / SmartResume / Qwen3-resume-json / NER pipeline
→ structured JSON
→ validation and evaluation

This is probably the more practical route for most projects.

Pipeline 3: Hybrid route

Use this if you want to eventually train LayoutLMv3, but need a bootstrap path.

resume PDF/image
→ OCR words + boxes
→ text-to-JSON or NER extractor
→ map extracted field values back to OCR words
→ create weak BIO labels
→ manually correct a subset
→ train LayoutLMv3
→ evaluate against gold set

This can reduce annotation cost, but weak labels can be noisy.


Evaluation checklist

For resume parsing, I would not evaluate only “does it produce JSON?”. I would check:

Aspect Example
JSON validity Does it always return parseable JSON?
Schema compliance Does it follow the target schema exactly?
Field exact match email, phone, URLs
Normalized match dates, locations, company names
Semantic match job titles, degree names, responsibilities
Array alignment does each title match the correct company/date range?
Omission did it miss an experience item or degree?
Hallucination did it invent a company, skill, date, or degree?
Layout robustness two-column resumes, sidebars, scanned PDFs
Long-document handling multi-page resumes, truncation, repeated headers
Privacy handling PII, consent, local processing, data retention

For structured extraction evaluation ideas, you can also look at:

These are not resume-specific, but they are useful for thinking about PDF-to-JSON and document parsing evaluation.


How I would keep searching

I would not rely only on the resume-parsing tag. Some newer models are weakly tagged or not tagged consistently.

Search across:

Hugging Face Models

Hugging Face Spaces

Search for resume parser demos, but inspect the implementation:

  • app.py
  • requirements.txt
  • model calls
  • OCR/PDF parsing method
  • whether it handles PDF, DOCX, image, or only raw text
  • whether there is any evaluation

Spaces are useful for implementation patterns, but I would not treat a demo as evidence of model quality.

Blogs / Papers / Posts

Also watch document parsing and OCR releases, not just resume-specific models.

Useful entry points:

The reason is that the hardest part may be converting the resume PDF/image into a faithful text/layout representation before the actual field extraction step.


My practical recommendation

If I were trying to solve this now, I would do this:

  1. If LayoutLMv3 is mandatory:

    • stop looking only for ordinary resume text datasets;
    • build a small LayoutLMv3-style gold dataset with images, OCR words, boxes, and labels;
    • keep train/inference OCR identical;
    • possibly use a text extractor/NER model to create weak labels, then manually correct them.
  2. If any model/tool is acceptable:

  3. In both cases:

    • create a small hand-checked evaluation set from your actual target resumes;
    • test two-column resumes, scanned resumes, multi-page resumes, and unusual layouts;
    • evaluate omissions and hallucinations, not only JSON validity.

So my short answer would be:

I did not find a perfect public LayoutLMv3-ready resume dataset.
But if the goal is resume parsing rather than specifically LayoutLMv3 training, the ecosystem is much better now: SmartResume, NuExtract3, PaddleOCR-VL/PaddleOCR, Qwen3 resume JSON, NuExtract-tiny Resume, and resume NER/section-routing models are all worth checking.