🧬 High-Accuracy BioBERT NER Model for Disease Detection (F1: 89.04%) – NCBI Fine-Tuned

Ishan0612 · July 24, 2025, 12:10pm

BioBERT-based Disease NER Model Fine-Tuned on NCBI Dataset (89.04% F1, Apache 2.0)

Hi everyone!

I’m excited to share a BioBERT-based Named Entity Recognition (NER) model fine-tuned on the NCBI Disease Corpus. It’s designed for extracting disease mentions from biomedical and clinical text with strong accuracy and performance.

Model Overview

Model on Hugging Face
Trained on: NCBI Disease Dataset (793 PubMed abstracts, 6,892 disease mentions)
Task: Token classification (BIO tagging: B-Disease, I-Disease, O)
Use Case: Biomedical text mining, clinical NLP, disease tagging in healthcare records

Performance

F1-Score: 89.04%
Precision: 86.80%
Recall: 91.39%
Accuracy: 98.64%
Trained over 5 epochs using BioBERT (dmis-lab/biobert-base-cased-v1.1)

How to Use

from transformers import pipeline

nlp = pipeline(
    "ner",
    model="Ishan0612/biobert-ner-disease-ncbi",
    tokenizer="Ishan0612/biobert-ner-disease-ncbi",
    aggregation_strategy="simple"
)

text = "The patient has signs of diabetes mellitus and chronic obstructive pulmonary disease."
results = nlp(text)

for entity in results:
    print(f"{entity['word']} - ({entity['entity_group']})")

Sample Output-
the - (LABEL_0)
patient - (LABEL_0)
has - (LABEL_0)
signs - (LABEL_0)
of - (LABEL_0)
diabetes - (LABEL_1)
mellitus - (LABEL_2)
and - (LABEL_0)
chronic - (LABEL_1)
obstructive - (LABEL_2)
pulmonary - (LABEL_2)
disease - (LABEL_2)
. - (LABEL_0)

Notes

LABEL_0 = Outside (O)
LABEL_1 = Beginning of Disease (B-Disease)
LABEL_2 = Inside Disease (I-Disease)
Recommended token limit: < 450 tokens to stay under BioBERT’s 512-token limit. For long documents, split into chunks manually.

License

Licensed under the Apache 2.0 License, consistent with the original BioBERT base model.

Feedback & Collaboration

Would love your thoughts or suggestions!

Open to collaboration if you’re interested in:

Extending this to drug, gene, or chemical entity recognition
Building a spaCy wrapper or LangChain RAG component
Benchmarking vs. SciSpacy, Stanza, or other biomedical NER models

Thanks for reading!

Topic		Replies	Views
Named Entity Recognition in medical notes 🤗Transformers	0	687	November 14, 2022
Create a NER tagger for Swedish medical documents 🤗 Course Projects	0	762	November 9, 2021
Create an ADR (Adverse drug reaction) extraction model from unstructured text 🤗 Course Projects	1	1406	November 17, 2021
BioBERT NER issue Beginners	7	4732	November 27, 2022
Medical NER based on Bert in Norwegian Research	0	293	June 21, 2023