Is it possible to create a Résumé parser using a Huggingface model?

xjdeng · December 15, 2020, 10:48pm

In other words, is it possible to train a supervised transformer model to pull out specific from unstructured or semi-structured text and if so, which pretrained model would be best for this?

In the resume example, I’d want to input the text version of a person’s resume and get a json like the following as output: {‘Education’: [‘BS Harvard University 2010’, ‘MS Stanford University 2012’], ‘Experience’: [‘Microsoft, 2012-2016’, ‘Google, 2016 - Present’]}

Obviously, I’ll need to label hundreds or thousands of resumes with their relevant Education and Experience fields before I’ll have a model that is capable of the above.

Here’s another example of the solution that I’m talking about although this person seems to be using GPT-3 and didn’t have any code provided. Is this something that any of the huggingface pipelines is capable of and if so, which pipeline would be most appropriate?

FL33TW00D · December 15, 2020, 10:54pm

Is there any reason you’re looking to do this with a transformer?
This is a common vision problem, and transformers aren’t usually the first port of call for a problem like this.

xjdeng · December 15, 2020, 10:56pm

I’m looking to do this with a transformer because I’ll be receiving the raw text, not images, as input.

And yes, I’m using the Resume example as a proxy for a confidential use case at my company.

clem · December 16, 2020, 1:00am

Yes, we’ve seen companies using transformers for similar use cases. If you don’t have a lot of labelled data, it usually involves a mix of zero-shot classification to understand the sections (ex: https://huggingface.co/facebook/bart-large-mnli?text=2001+-+2003+harvard%2C+master+in+management&labels=school%2C+job%2C+hobbies&multiclass=false) and then NER to extract the right information for the right classes (https://huggingface.co/dslim/bert-base-NER?text=I+worked+at+Facebook). Sometimes you want to add entity linking to the mix depending on how elaborate a system you need.

Would you be open to jump on a call to see if some of our commercial offering could be useful or are you looking at doing that all by yourself?

facehugger2020 · December 16, 2020, 1:02am

This is a common NLP problem, and transformers are a good first port of call for a problem like this.

You should google these terms:

Named entity recognition
Chunking

You’ll need to have labeled data, usually marking every token in your document with perhaps IOB tags that can demarcate the start and end of a coherent chunk of text.

xjdeng · December 16, 2020, 2:48am

The problem I’m trying to solve, in the most general sense, is that you’re given a set of documents and each document in your set has specific information that you’re trying to pull out. Examples of specific information to pull out include:

Author of the document
When was the document written
Who is the recipient of the document
etc.

And keep in mind information in the document may not always be stated in an obvious way. In one document, the author may be given as “Author: Joe Shmo”. In another one it might say: “From Jane Doe”. But let’s assume that every document has an author and any normal adult human with an average comprehension of English can pull out the author even though the author may not be stated in the exact same way in each document (and let’s assume there are countless ways of saying who the author is and no reasonable Regex pattern can be used to pull it out.) Ditto to the other fields like the date when it’s written and the intended recipient of the document.

I originally thought of using a Question Answering model as a basis for this task but it might be overkill. Regarding the Resume example, I might end up training the model on just two questions and their respective answers for each resume in the dataset:

What is the education?
What is the experience?

I suppose training a custom NER model might be another route to take.

facehugger2020 · December 16, 2020, 10:18pm

Is there any reason you’re looking to do this with a transformer?
This is a common vision problem, and transformers aren’t usually the first port of call for a problem like this.

LOL. What?

FL33TW00D · December 16, 2020, 11:28pm

I misread I thought he had the physical resumes and was jumping straight to: https://arxiv.org/abs/2010.11929

meditations · July 15, 2021, 1:18am

Hi @clem would be very interested in checking out HF’s commercial offering for this. Can we chat somehow?

tnavin · August 3, 2022, 12:05am

Hey clem. Can we jump on a call to look into your offerings regarding the same? Thanks.

nielsr · August 3, 2022, 9:44am

Hi,

We do have several models available for that. These include (at the time of writing):

LayoutLM is a BERT-like model by Microsoft that adds layout information of the text tokens to improve performance on document processing tasks, like information extraction from documents, document image classification and document visual question answering. Its successors are LayoutLMv2 and v3 which improve its performance. Notebooks for all of those can be found in my Github repo.

Then there are also the Document Image Transformer, which is mostly suited for document image classification and layout analysis (like table detection in PDFs), and TrOCR, which is a Transformer-based encoder-decoder model for optical character recognition on single text-line images. Notebooks for those can also be found in my Github repo.

Sybghat · April 14, 2023, 9:28am

Yes it is possible and infect i have a demo available on my space here. It was initial version and i uploaded that almost an year ago.

rakin061 · October 12, 2023, 5:32am

@Sybghat Can you share your source code if uploaded in github or so ??

radames · October 12, 2023, 8:12pm

you can see the source here Sybghat/resume-parser at main

umesh19 · October 18, 2023, 5:55am

i have found this deepset/tinyroberta-squad2 but this will work only when resume contains label i have used this using haystack
impira/layoutlm-document-qa which use LayoutLM behind which work fine but its again i have passed resume without name label only values it cannot detected

ddeisadze · March 22, 2024, 12:36am

How accurate is your model?

ddeisadze · March 22, 2024, 12:36am

@xjdeng where you able to find something accurate to extract that info?

jiroxin · May 31, 2026, 4:09pm

Hey,

It’s crazy that the thread started in 2020, got a reasonable answer after 2 years by @nielsr, and it’s still ongoing in 2026.
I’m too working on a similar project as @xjdeng using LayoutLMv3, but i’m currently stuck at the training phase because i haven’t been able to find a decent resume dataset to train on.

if anyone can provide some guidance or point toward a suitable resources, i would be very grateful.

Thanks

John6666 · June 3, 2026, 9:04pm

For now, version as of June 2026… This probably changes quite a bit depending on whether you specifically need to make it work with LayoutLMv3, or whether any model/tool is acceptable as long as the resume parsing goal is met:

TL;DR

I would split this into two different tracks:

If you specifically need LayoutLMv3, the main problem is not just “which model?” or “which dataset?”. You need a dataset/pipeline with:
- page images or rendered PDF pages,
- OCR words,
- word-level bounding boxes,
- normalized LayoutLM-style boxes,
- word/token labels,
- reading order,
- and the same OCR/bbox preprocessing at training and inference time.
I did not find a clean public resume-specific dataset that is already “LayoutLMv3-ready” in that full sense.
If the goal is simply resume parsing / resume information extraction, then there are more practical resources now. I would look at:

The most important practical distinction is:

LayoutLMv3 wants image + OCR words + boxes + labels.
Many newer resume resources are text-to-JSON, NER, OCR, or document-parsing resources. Useful, but not the same training format.

Track A — If you must use LayoutLMv3

For LayoutLMv3, I would first check the expected input contract carefully.

Relevant docs:

The important part is that LayoutLMv3 token classification is not just ordinary text NER. The processor/model path expects layout-aware inputs, usually something like:

{
    "image": "<page image>",
    "words": ["John", "Doe", "Software", "Engineer", "..."],
    "boxes": [[x0, y0, x1, y1], ...],
    "word_labels": ["B-NAME", "I-NAME", "B-TITLE", "I-TITLE", "..."]
}

The boxes should be word-level bounding boxes, normalized in the LayoutLM-style coordinate system, usually 0–1000 scale. The labels need to align with the OCR words, and then the tokenizer has to propagate word labels to subword tokens.

The forum thread above is worth reading because it points out a common failure mode: training with one kind of box/annotation setup, then using a different OCR/bbox setup at inference. That can break the model even if the training code looked fine.

Practical LayoutLMv3 checklist

If I had to make the LayoutLMv3 route work, I would do something like this:

1. Choose one OCR engine and freeze it.
   Examples: Tesseract, EasyOCR, PaddleOCR, pdfplumber/PDF text extraction, etc.

2. Convert each resume page into:
   - page image
   - OCR words
   - word-level bounding boxes
   - reading order

3. Annotate those OCR words/boxes.
   Do not annotate totally separate hand-drawn boxes unless you can reproduce the same boxes at inference.

4. Convert annotations into BIO/BILOU labels.
   Example labels:
   - B-NAME / I-NAME
   - B-EMAIL / I-EMAIL
   - B-PHONE / I-PHONE
   - B-COMPANY / I-COMPANY
   - B-JOB_TITLE / I-JOB_TITLE
   - B-DEGREE / I-DEGREE
   - B-INSTITUTION / I-INSTITUTION
   - B-SKILL / I-SKILL
   - O

5. Normalize boxes exactly as LayoutLMv3 expects.

6. Keep OCR, ordering, box normalization, truncation, and page splitting identical at training and inference.

7. Start with a small gold evaluation set before scaling.

In other words, for LayoutLMv3 the dataset problem is really an annotation and preprocessing contract problem.

Why ordinary resume text datasets are not enough for LayoutLMv3

A dataset like:

{
  "text": "John Doe\nSoftware Engineer\n...",
  "json": {
    "name": "John Doe",
    "title": "Software Engineer"
  }
}

can be useful for text-to-JSON models or LLM fine-tuning, but it is not directly enough for LayoutLMv3 token classification because it is missing:

page image,
OCR words,
word bounding boxes,
word-level labels,
reading order,
box normalization,
page-level segmentation.

You might still use a text-to-JSON model to create weak labels. For example:

resume PDF
→ OCR words + boxes
→ text-to-JSON extractor
→ extracted field values
→ string-match field values back to OCR words
→ weak BIO labels
→ human correction
→ LayoutLMv3 fine-tuning

But I would treat that as a weak-labeling/bootstrap approach, not as a clean substitute for a real gold dataset.

A LayoutLMv3-adjacent resource

One interesting LayoutLMv3-related resume resource is:

Smutypi3/applai-layoutlmv3

I would not treat it as a complete resume field parser, but it is relevant because it uses LayoutLMv3 on resume PDFs and discusses a resume-oriented preprocessing pattern. It may be useful as a reference if you want to see how someone handled PDF words, boxes, and LayoutLMv3-style representations in the resume domain.

Track B — If any model/tool is acceptable

If the goal is simply “parse resumes into structured fields”, I would probably not start with LayoutLMv3. I would start with a pipeline view:

PDF / DOCX / image resume
→ OCR / PDF parsing / layout reconstruction
→ clean text or Markdown with reading order
→ section routing
→ structured extraction
→ JSON validation
→ evaluation on a small hand-checked set

A resume parser is often not one model. It is a pipeline.

The hard part may be upstream: converting a visually complex resume PDF into faithful text/Markdown/layout before extracting fields.

Strongest current resource: SmartResume

I would look at SmartResume first:

Why it matters:

SmartResume is very close to the actual problem. It is not just a generic NER model. It treats resume parsing as a layout-aware pipeline:

resume PDF / image / Office document
→ OCR + PDF metadata extraction
→ layout detection
→ reading order reconstruction
→ structured information extraction with a compact LLM

The paper is especially useful because it frames the problem correctly:

resumes have diverse layouts,
resumes often have multi-column structures,
reading order matters,
LLM-only extraction can be expensive or brittle,
standardized resume extraction datasets/evaluation tools are limited.

This is probably the best “goal-first” starting point I found.

General structured extraction: NuExtract3

Another strong candidate is:

numind/NuExtract3

This is not resume-specific, but it is very relevant. It is a vision-language document understanding model for structured extraction and image-to-Markdown conversion.

The useful pattern is:

input document + JSON template + optional instructions
→ structured JSON output

For a resume, the template might look like:

{
  "name": "verbatim-string",
  "email": "email",
  "phone": "verbatim-string",
  "location": "verbatim-string",
  "summary": "string",
  "skills": ["verbatim-string"],
  "education": [
    {
      "institution": "verbatim-string",
      "degree": "verbatim-string",
      "field": "verbatim-string",
      "start_date": "date-time",
      "end_date": "date-time"
    }
  ],
  "experience": [
    {
      "company": "verbatim-string",
      "title": "verbatim-string",
      "location": "verbatim-string",
      "start_date": "date-time",
      "end_date": "date-time",
      "responsibilities": ["string"],
      "achievements": ["string"]
    }
  ],
  "certifications": [
    {
      "name": "verbatim-string",
      "issuer": "verbatim-string",
      "date": "date-time"
    }
  ]
}

I would still evaluate it on real resumes. But as a modern structured-extraction route, it is very relevant.

OCR / document parsing layer: PaddleOCR-VL, PaddleOCR 3.5, Docling, olmOCR

If your input is PDF/image resumes, I would also look at current OCR/document parsing tools. These are not resume parsers by themselves, but they may solve the most painful upstream step.

Useful resources:

Why this matters:

A two-column resume, a sidebar resume, or a scanned resume can fail before the extraction model even sees the content correctly. If the OCR/layout step scrambles the reading order, a good NER or LLM extractor may still produce bad JSON.

So I would separate:

document parsing quality

from:

field extraction quality

They are related, but not the same problem.

Resume text → JSON resources

If you can already get reasonably clean text from the resume, there are more direct resources.

Qwen3 resume JSON dataset/model

This route is useful if your pipeline is:

PDF/DOCX/image
→ text extraction
→ raw resume text
→ structured JSON extractor

The dataset is raw resume text to structured JSON. The model is a Qwen3-0.6B LoRA adapter for resume JSON extraction.

Caveats:

It is not a LayoutLMv3 dataset.
It does not solve OCR/layout.
The model repo contains the LoRA adapter, so the base model is also needed.
Long resumes and unusual formats still need evaluation.

Small local resume extractor

nimendraai/NuExtract-tiny-Resume-Data-Extractor

This is a resume/CV structured extraction model based on NuExtract-tiny / Qwen2.5-0.5B. It is useful if you want a small local route, especially for raw text to JSON.

Caveat: check the model card carefully. It is trained on synthetic resumes, so I would not trust it without testing on real resumes from your target distribution.

NER / section-routing route

If you want something more deterministic and easier to debug than “LLM returns JSON”, a section classifier + NER pipeline may be easier to control.

Useful resources:

A practical version of this route could be:

OCR/PDF text chunks
→ classify chunks into sections:
   contact / summary / experience / education / skills / certifications / projects / etc.
→ run section-aware NER
→ normalize dates, phone, email, skills, company names
→ group entities into experience[] and education[]
→ validate JSON

This is less glamorous than a single end-to-end model, but easier to debug.

For example:

Contact fields can often be handled with regex + NER.
Skills can use NER + skill dictionaries.
Experience needs grouping: company, title, dates, bullet points.
Education needs grouping: institution, degree, field, dates/GPA.

The important caveat with many resume NER models is that reported scores may come from internal or narrow test sets. I would always create a small hand-labeled evaluation set from your actual target resumes.

Resource table

Resource	Type	Best for	Not for	Notes
LayoutLMv3 docs	Model docs	Understanding LayoutLMv3 input contract	Finding resume data	Essential for image/words/boxes/labels
LayoutLMv3 forum thread	Forum/debugging	OCR/bbox train-inference consistency	Turnkey solution	Very relevant practical warning
SmartResume	Resume-specific system	Full resume parsing pipeline	Pure LayoutLMv3 training	Strongest goal-first candidate
Alibaba-EI/SmartResume	Model repo	Weights/resources for SmartResume	General NER	Includes resume extraction/layout components
Layout-Aware Parsing Meets Efficient LLMs	Paper	Modern resume extraction architecture	Drop-in code alone	Useful framing and evaluation discussion
NuExtract3	General document extraction VLM	Template-based JSON extraction	Resume-specific guarantee	Strong candidate if model choice is flexible
PaddleOCR-VL	OCR/document parsing	Upstream PDF/image parsing	Resume field extraction by itself	Strong document parsing candidate
PaddleOCR GitHub	OCR/document stack	OCR/layout/table/formula/chart extraction	Resume-specific schema	Good ingestion layer
Docling	Document parser	PDF/DOCX/image to structured text/Markdown/JSON	Resume labels	Useful preprocessing layer
olmOCR	PDF-to-text/Markdown OCR	Clean reading order / linearized text	Resume JSON fields	Useful before extraction
resume-json-extraction-5k	Dataset	Resume text → JSON SFT	LayoutLMv3 training	Directly relevant for text route
qwen3-0.6b-resume-json	LoRA adapter	Lightweight resume JSON extraction	OCR/layout	Needs base Qwen3 model
NuExtract-tiny Resume	Small local extractor	Local raw-text resume JSON extraction	Robust PDF layout	Synthetic-data caveat
oksomu/resume-ner	NER + postprocess	Deterministic entity extraction route	Full layout parsing	Detailed card; evaluate externally
amosify section classifier	Text classifier	Section routing	Field extraction alone	Useful middle layer
amosify resume NER	NER model	Section-aware NER	Nested JSON alone	Pair with section routing
resume-parsing model tag	HF model search	Discovering current resume models	Exhaustive coverage	Some models are not tagged consistently

Suggested practical pipelines

Pipeline 1: LayoutLMv3-only route

Use this if LayoutLMv3 is required.

resume PDF/image
→ render pages
→ OCR with one fixed engine
→ words + word boxes + reading order
→ annotate OCR words/boxes
→ BIO labels
→ LayoutLMv3 token classification
→ field grouping + post-processing
→ JSON validation

This is the most faithful LayoutLMv3 route, but also the most annotation-heavy.

Pipeline 2: Modern model-free route

Use this if the goal is just accurate resume parsing.

resume PDF/DOCX/image
→ Docling / PaddleOCR / olmOCR / SmartResume-style parsing
→ Markdown or layout-preserving text
→ NuExtract3 / SmartResume / Qwen3-resume-json / NER pipeline
→ structured JSON
→ validation and evaluation

This is probably the more practical route for most projects.

Pipeline 3: Hybrid route

Use this if you want to eventually train LayoutLMv3, but need a bootstrap path.

resume PDF/image
→ OCR words + boxes
→ text-to-JSON or NER extractor
→ map extracted field values back to OCR words
→ create weak BIO labels
→ manually correct a subset
→ train LayoutLMv3
→ evaluate against gold set

This can reduce annotation cost, but weak labels can be noisy.

Evaluation checklist

For resume parsing, I would not evaluate only “does it produce JSON?”. I would check:

Aspect	Example
JSON validity	Does it always return parseable JSON?
Schema compliance	Does it follow the target schema exactly?
Field exact match	email, phone, URLs
Normalized match	dates, locations, company names
Semantic match	job titles, degree names, responsibilities
Array alignment	does each title match the correct company/date range?
Omission	did it miss an experience item or degree?
Hallucination	did it invent a company, skill, date, or degree?
Layout robustness	two-column resumes, sidebars, scanned PDFs
Long-document handling	multi-page resumes, truncation, repeated headers
Privacy handling	PII, consent, local processing, data retention

For structured extraction evaluation ideas, you can also look at:

These are not resume-specific, but they are useful for thinking about PDF-to-JSON and document parsing evaluation.

How I would keep searching

I would not rely only on the resume-parsing tag. Some newer models are weakly tagged or not tagged consistently.

Search across:

Hugging Face Models

Hugging Face Spaces

Search for resume parser demos, but inspect the implementation:

app.py
requirements.txt
model calls
OCR/PDF parsing method
whether it handles PDF, DOCX, image, or only raw text
whether there is any evaluation

Spaces are useful for implementation patterns, but I would not treat a demo as evidence of model quality.

Blogs / Papers / Posts

Also watch document parsing and OCR releases, not just resume-specific models.

Useful entry points:

The reason is that the hardest part may be converting the resume PDF/image into a faithful text/layout representation before the actual field extraction step.

My practical recommendation

If I were trying to solve this now, I would do this:

If LayoutLMv3 is mandatory:
- stop looking only for ordinary resume text datasets;
- build a small LayoutLMv3-style gold dataset with images, OCR words, boxes, and labels;
- keep train/inference OCR identical;
- possibly use a text extractor/NER model to create weak labels, then manually correct them.
If any model/tool is acceptable:
- start with SmartResume;
- test NuExtract3 with a resume JSON schema;
- use PaddleOCR-VL, Docling, or olmOCR if PDF/image ingestion is the bottleneck;
- compare against simpler text-to-JSON or NER routes like qwen3-0.6b-resume-json, NuExtract-tiny Resume, or oksomu/resume-ner.
In both cases:
- create a small hand-checked evaluation set from your actual target resumes;
- test two-column resumes, scanned resumes, multi-page resumes, and unusual layouts;
- evaluate omissions and hallucinations, not only JSON validity.

So my short answer would be:

I did not find a perfect public LayoutLMv3-ready resume dataset.
But if the goal is resume parsing rather than specifically LayoutLMv3 training, the ecosystem is much better now: SmartResume, NuExtract3, PaddleOCR-VL/PaddleOCR, Qwen3 resume JSON, NuExtract-tiny Resume, and resume NER/section-routing models are all worth checking.

Topic		Replies	Views
Custom Dataset Creation Guidance For Resume Parsing 🤗Datasets	0	1225	October 30, 2023
How to extract a specific paragraph from a text file 🤗Transformers	2	794	May 29, 2024
Transformer model for pdf invoice field extraction 🤗Transformers	0	857	January 15, 2024
Is there a Transformer specialised in splitting a large output of text? Beginners	0	145	July 6, 2023
How to build a Resume matcher to increase the probability of passing an ATS system with huggingface pipelines 🤗Transformers	0	687	October 15, 2023