What is best way to compute document similarity?

neo-benjamin · June 21, 2022, 7:26pm

What is the best way to compute document similarity?

I was thinking to use SentenceTransformers for measuring document similarity.
https://www.sbert.net/

Is this is the best way?

Also is there a model to apply contrastive learning for document similarity learning?

NimaBoscarino · June 21, 2022, 8:20pm

Yup, SentenceTransformers can definitely be used for measuring document similarity. Depending on the size of your documents, you might want to choose a model that was tuned for dot-product similarity. (E.g. from the MSMARCO docs: “Models with normalized embeddings will prefer the retrieval of shorter passages, while models tuned for dot-product will prefer the retrieval of longer passages.”) You might also have to split large passages into chunks, otherwise content gets truncated for the models.

For contrastive learning, I think you could use sentence-transformers/all-MiniLM-L6-v2 · Hugging Face with ContrastiveLoss.

We’re actually looking at ways to improve the user experience with SentenceTransformers + Hugging Face, so feel free to post here or message me directly if you have any questions or feedback

Topic		Replies	Views
Document Similarity of long documents e.g. legal contracts 🤗Transformers	6	9159	July 2, 2024
Can Similarity Sentence Returns the Similarity Content? 🤗Transformers	0	351	April 27, 2023
Computing similarity between sentences Intermediate	4	3383	July 31, 2021
How to obtain similarity values from embeddings? Beginners	2	457	April 29, 2022
How to use embeddings to compute similarity? Beginners	4	4623	January 27, 2022

What is best way to compute document similarity?

Related topics