Domain adaptation for embeddings - fine tuning on MLM

dreamless-hurler · July 9, 2024, 6:13pm

I would like to create better search functionality for a domain specific language (DSL).

For this, I’m trying to finetune an encoder on the masked language modeling (MLM) objective as described here: Fine-tuning a masked language model - Hugging Face NLP Course. Similar to this question, except I’m leaving the tokenizer as is: Domain adaptation of Language Model and Tokenizer

I’ve tried two base-models: all-MiniLM-L6-v2 and roberta-base and finetuned them on about 30k samples from the DSL with 15% masking.

To evaluate the results, I use some semantically equivalent/similar pairs and see how well I can retrieve one from a line-up by encoding the other. Similar to InfoNCE loss, but with rankings instead of probabilities.

With both models I’ve tried, I find that as the MLM loss decreases (on eval, not just train), the actual metrics I care about get worse. In other words, roberta-base is better at searching for similar DSL pairs than a model that’s actually seen the DSL.

I realize I should also train on an InfoNCE style loss at some point (e.g. Losses — Sentence Transformers documentation), but there’s not enough data for that at the moment.

Shouldn’t there already be some improvement from training on MLM?

Would appreciate any thoughts/pointers/references!

dreamless-hurler · July 12, 2024, 10:55am

Found an answer here: MLM — Sentence Transformers documentation

Note: Only running MLM will not yield good sentence embeddings. But you can first tune your favorite transformer model with MLM on your domain specific data. Then you can fine-tune the model with the labeled data you have or using other data sets like NLI, Paraphrases, or STS.

Topic		Replies	Views
LM finetuning on domain specific unlabelled data Beginners	6	4815	April 21, 2021
Fine-tuned MLM based RoBERTa not improving performance Research	2	995	April 20, 2023
LM fine-tuning on unlabelled dataset Beginners	0	476	April 10, 2021
Domain adaptation with MLM and NSP 🤗Transformers	3	1789	January 18, 2024
Why fine-tuning BERT mlm on specific domain doesn't work? What am I doing wrong? 🤗Transformers	2	1476	November 22, 2021

Domain adaptation for embeddings - fine tuning on MLM

Related topics