--- pipeline_tag: sentence-similarity language: fry license: mit tags: - trimmed library_name: sentence-transformers base_model: intfloat/multilingual-e5-base base_model_relation: quantized datasets: - lbourdois/fineweb-2-trimming --- # multilingual-e5-base-fry-16384 This model is a **64.53% smaller** version of [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) optimized for **Frisian** language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/lbourdois/introduction-to-trimming) method. This trimmed model should perform similarly to the original model with only 16,384 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary. ## Model Statistics | Metric | Original | Trimmed | Reduction | |--------|----------|---------|-----------| | **Vocabulary size** | 250,037 tokens | 16,384 tokens | **93.44%** | | **Model size** | 278,043,648 params | 98,625,024 params | **64.53%** | ![image](https://raw.githubusercontent.com/lbourdois/blog/refs/heads/master/assets/images/Trimming/me5-base-16384.png) ## Mining Dataset Statistics - **Number of texts used for mining**: 200,000 texts - **Dataset**: [lbourdois/fineweb-2-trimming](https://huggingface.co/datasets/lbourdois/fineweb-2-trimming) ## Usage ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("alphaedge-ai/multilingual-e5-base-fry-16384") # Run inference with queries and documents query = "My query in Frisian" documents = [ "Chunk in Frisian", "Chunk in Frisian", "Chunk in Frisian", ] query_embeddings = model.encode_query(query) document_embeddings = model.encode_document(documents) print(query_embeddings.shape, document_embeddings.shape) # Compute similarities to determine a ranking similarities = model.similarity(query_embeddings, document_embeddings) print(similarities) ``` ## Citations #### Multilingual E5 ``` @article{wang2024multilingual, title={Multilingual E5 Text Embeddings: A Technical Report}, author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu}, journal={arXiv preprint arXiv:2402.05672}, year={2024} } ``` #### Trimming blog post ``` @misc{hf_blogpost_trimming, title={Introduction to Trimming}, author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI}, year={2026}, url={https://huggingface.co/blog/lbourdois/introduction-to-trimming}, } ```