Multi-input tag and ,multi-label output for token classification using Bert pretrained model

nanapc17 · January 9, 2025, 9:03am

Hi, if I have a huggingface dataset where tokens are tagged with pos and possibly so that I have (token, pos, lemma, fine_pos, wordnet.word) and I want to produce for each token t labels like t1,t2, is there any tutorial or example of how to do it?
I apologize if this question is answered before, but I’ve been searching for an answer using transformers and pretrained bert and have not managed to find one.
Thank you.

Alanturner2 · January 9, 2025, 9:26am

Hey!

It looks like you’re trying to label tokens in a Hugging Face dataset, such as tagging each token with multiple labels like t1, t2, alongside POS tags. You can achieve this using Hugging Face’s transformers library with a pre-trained model like BERT. Here’s an approach to guide you:

Dataset Preparation:
- Make sure your dataset is in a format where each token is associated with its label (e.g., POS tag, other labels).
- You can use Hugging Face’s datasets library to load and manipulate your dataset. If you’re working with token-level labels, your dataset might look like this:
```
{
    'tokens': ['I', 'am', 'happy'],
    'labels': ['PRON', 'VERB', 'ADJ']
}
```

Tokenization:

Use a tokenizer like BertTokenizer from the transformers library to split the text into tokens and match each token with its corresponding label.
Keep in mind that tokenization can split words into subwords, so you need to handle this by ensuring that each subword receives the correct label.

Here’s an example of tokenizing and aligning labels:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def align_labels_with_tokens(tokens, labels):
    encoding = tokenizer(tokens, truncation=True, padding=True, is_split_into_words=True)
    word_ids = encoding.word_ids()  # Maps tokens to words

    # Align labels with the tokenized words
    aligned_labels = [-100 if word_id is None else labels[word_id] for word_id in word_ids]
    return encoding, aligned_labels

tokens = ['I', 'am', 'happy']
labels = ['PRON', 'VERB', 'ADJ']

encoding, aligned_labels = align_labels_with_tokens(tokens, labels)
print(aligned_labels)

Using Pretrained BERT:

You can use a pre-trained BERT model for token classification (e.g., for named entity recognition or POS tagging) and fine-tune it with your labeled dataset.

Example for token classification:

from transformers import BertForTokenClassification, Trainer, TrainingArguments

model = BertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=3)

# Prepare your dataset for the Trainer
train_dataset = Dataset.from_dict({
    'input_ids': encoding['input_ids'],
    'attention_mask': encoding['attention_mask'],
    'labels': aligned_labels
})

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

Multi-Labeling:
- If you want to produce multiple labels for each token (e.g., t1, t2), you can modify the label alignment strategy by extending the model to predict multiple labels per token.
- You could use a multi-label classification approach (e.g., using a sigmoid activation function) to predict multiple labels per token. This would require modifying the loss function and model architecture slightly.
Resources:
- Hugging Face provides a Token Classification tutorial which is a good starting point.
- Look into datasets and metrics libraries for managing and evaluating your labeled dataset.

With this approach, you can train a pre-trained BERT model to output multiple labels per token and fine-tune it on your specific task. If you’re still having trouble, let me know, and I can help clarify further!

Topic		Replies	Views
Token alignment for word-level tasks 🤗Tokenizers	1	2615	August 5, 2020
Multi-label sequence labeling (for e.g., multi-label NER) 🤗Transformers	0	1632	November 21, 2022
Converting Word-level labels to WordPiece-level for Token Classification Intermediate	9	4711	January 13, 2021
Dataset for multilabel classification 🤗Transformers	1	315	January 20, 2025
Multi-label token classification 🤗Transformers	34	8189	September 6, 2023

Multi-input tag and ,multi-label output for token classification using Bert pretrained model

Related topics