BioBERT-based Disease NER Model Fine-Tuned on NCBI Dataset (89.04% F1, Apache 2.0)
Hi everyone! ![]()
I’m excited to share a BioBERT-based Named Entity Recognition (NER) model fine-tuned on the NCBI Disease Corpus. It’s designed for extracting disease mentions from biomedical and clinical text with strong accuracy and performance.
Model Overview
Model on Hugging Face
Trained on: NCBI Disease Dataset (793 PubMed abstracts, 6,892 disease mentions)
Task: Token classification (BIO tagging: B-Disease, I-Disease, O)
Use Case: Biomedical text mining, clinical NLP, disease tagging in healthcare records
Performance
- F1-Score: 89.04%
- Precision: 86.80%
- Recall: 91.39%
- Accuracy: 98.64%
- Trained over 5 epochs using BioBERT (dmis-lab/biobert-base-cased-v1.1)
How to Use
from transformers import pipeline
nlp = pipeline(
"ner",
model="Ishan0612/biobert-ner-disease-ncbi",
tokenizer="Ishan0612/biobert-ner-disease-ncbi",
aggregation_strategy="simple"
)
text = "The patient has signs of diabetes mellitus and chronic obstructive pulmonary disease."
results = nlp(text)
for entity in results:
print(f"{entity['word']} - ({entity['entity_group']})")
Sample Output-
the - (LABEL_0)
patient - (LABEL_0)
has - (LABEL_0)
signs - (LABEL_0)
of - (LABEL_0)
diabetes - (LABEL_1)
mellitus - (LABEL_2)
and - (LABEL_0)
chronic - (LABEL_1)
obstructive - (LABEL_2)
pulmonary - (LABEL_2)
disease - (LABEL_2)
. - (LABEL_0)
Notes
LABEL_0= Outside (O)LABEL_1= Beginning of Disease (B-Disease)LABEL_2= Inside Disease (I-Disease)
Recommended token limit: < 450 tokens to stay under BioBERT’s 512-token limit. For long documents, split into chunks manually.
License
Licensed under the Apache 2.0 License, consistent with the original BioBERT base model.
Feedback & Collaboration
Would love your thoughts or suggestions!
Open to collaboration if you’re interested in:
- Extending this to drug, gene, or chemical entity recognition
- Building a spaCy wrapper or LangChain RAG component
- Benchmarking vs. SciSpacy, Stanza, or other biomedical NER models
Thanks for reading! ![]()
![]()
