🧬 High-Accuracy BioBERT NER Model for Disease Detection (F1: 89.04%) – NCBI Fine-Tuned

:rocket: BioBERT-based Disease NER Model Fine-Tuned on NCBI Dataset (89.04% F1, Apache 2.0)

Hi everyone! :waving_hand:

I’m excited to share a BioBERT-based Named Entity Recognition (NER) model fine-tuned on the NCBI Disease Corpus. It’s designed for extracting disease mentions from biomedical and clinical text with strong accuracy and performance.


:pushpin: Model Overview

  • :link: Model on Hugging Face
  • :medical_symbol: Trained on: NCBI Disease Dataset (793 PubMed abstracts, 6,892 disease mentions)
  • :bullseye: Task: Token classification (BIO tagging: B-Disease, I-Disease, O)
  • :magnifying_glass_tilted_left: Use Case: Biomedical text mining, clinical NLP, disease tagging in healthcare records

:bar_chart: Performance

  • F1-Score: 89.04%
  • Precision: 86.80%
  • Recall: 91.39%
  • Accuracy: 98.64%
  • Trained over 5 epochs using BioBERT (dmis-lab/biobert-base-cased-v1.1)

:hammer_and_wrench: How to Use

from transformers import pipeline

nlp = pipeline(
    "ner",
    model="Ishan0612/biobert-ner-disease-ncbi",
    tokenizer="Ishan0612/biobert-ner-disease-ncbi",
    aggregation_strategy="simple"
)

text = "The patient has signs of diabetes mellitus and chronic obstructive pulmonary disease."
results = nlp(text)

for entity in results:
    print(f"{entity['word']} - ({entity['entity_group']})")

Sample Output-
the - (LABEL_0)
patient - (LABEL_0)
has - (LABEL_0)
signs - (LABEL_0)
of - (LABEL_0)
diabetes - (LABEL_1)
mellitus - (LABEL_2)
and - (LABEL_0)
chronic - (LABEL_1)
obstructive - (LABEL_2)
pulmonary - (LABEL_2)
disease - (LABEL_2)
. - (LABEL_0)


:light_bulb: Notes

  • LABEL_0 = Outside (O)
  • LABEL_1 = Beginning of Disease (B-Disease)
  • LABEL_2 = Inside Disease (I-Disease)
  • :warning: Recommended token limit: < 450 tokens to stay under BioBERT’s 512-token limit. For long documents, split into chunks manually.

:package: License

Licensed under the Apache 2.0 License, consistent with the original BioBERT base model.


:folded_hands: Feedback & Collaboration

Would love your thoughts or suggestions!

:white_check_mark: Open to collaboration if you’re interested in:

  • Extending this to drug, gene, or chemical entity recognition
  • Building a spaCy wrapper or LangChain RAG component
  • Benchmarking vs. SciSpacy, Stanza, or other biomedical NER models

Thanks for reading! :man_health_worker::speech_balloon:

1 Like