Multi-input tag and ,multi-label output for token classification using Bert pretrained model

Hi, if I have a huggingface dataset where tokens are tagged with pos and possibly so that I have (token, pos, lemma, fine_pos, wordnet.word) and I want to produce for each token t labels like t1,t2, is there any tutorial or example of how to do it?
I apologize if this question is answered before, but I’ve been searching for an answer using transformers and pretrained bert and have not managed to find one.
Thank you.

Hey!

It looks like you’re trying to label tokens in a Hugging Face dataset, such as tagging each token with multiple labels like t1, t2, alongside POS tags. You can achieve this using Hugging Face’s transformers library with a pre-trained model like BERT. Here’s an approach to guide you:

  1. Dataset Preparation:

    • Make sure your dataset is in a format where each token is associated with its label (e.g., POS tag, other labels).
    • You can use Hugging Face’s datasets library to load and manipulate your dataset. If you’re working with token-level labels, your dataset might look like this:
      {
          'tokens': ['I', 'am', 'happy'],
          'labels': ['PRON', 'VERB', 'ADJ']
      }
      
  2. Tokenization:

    • Use a tokenizer like BertTokenizer from the transformers library to split the text into tokens and match each token with its corresponding label.
    • Keep in mind that tokenization can split words into subwords, so you need to handle this by ensuring that each subword receives the correct label.
    • Here’s an example of tokenizing and aligning labels:
      from transformers import BertTokenizer
      
      tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
      
      def align_labels_with_tokens(tokens, labels):
          encoding = tokenizer(tokens, truncation=True, padding=True, is_split_into_words=True)
          word_ids = encoding.word_ids()  # Maps tokens to words
      
          # Align labels with the tokenized words
          aligned_labels = [-100 if word_id is None else labels[word_id] for word_id in word_ids]
          return encoding, aligned_labels
      
      tokens = ['I', 'am', 'happy']
      labels = ['PRON', 'VERB', 'ADJ']
      
      encoding, aligned_labels = align_labels_with_tokens(tokens, labels)
      print(aligned_labels)
      
  3. Using Pretrained BERT:

    • You can use a pre-trained BERT model for token classification (e.g., for named entity recognition or POS tagging) and fine-tune it with your labeled dataset.
    • Example for token classification:
      from transformers import BertForTokenClassification, Trainer, TrainingArguments
      
      model = BertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=3)
      
      # Prepare your dataset for the Trainer
      train_dataset = Dataset.from_dict({
          'input_ids': encoding['input_ids'],
          'attention_mask': encoding['attention_mask'],
          'labels': aligned_labels
      })
      
      training_args = TrainingArguments(
          output_dir='./results',
          num_train_epochs=3,
          per_device_train_batch_size=8,
          logging_dir='./logs',
      )
      
      trainer = Trainer(
          model=model,
          args=training_args,
          train_dataset=train_dataset,
      )
      
      trainer.train()
      
  4. Multi-Labeling:

    • If you want to produce multiple labels for each token (e.g., t1, t2), you can modify the label alignment strategy by extending the model to predict multiple labels per token.
    • You could use a multi-label classification approach (e.g., using a sigmoid activation function) to predict multiple labels per token. This would require modifying the loss function and model architecture slightly.
  5. Resources:

    • Hugging Face provides a Token Classification tutorial which is a good starting point.
    • Look into datasets and metrics libraries for managing and evaluating your labeled dataset.

With this approach, you can train a pre-trained BERT model to output multiple labels per token and fine-tune it on your specific task. If you’re still having trouble, let me know, and I can help clarify further!