Introduction to Trimming ✂

Community Article Published May 28, 2026

Foreword

Trimming being particularly interesting for multilingualism, this work was an opportunity for collaboration between various Hugging Face Fellows to evaluate this approach on languages other than English. Namely Loïck BOURDOIS (French), Tom AARSEN (English/Dutch), Bram VANROY (Dutch), Christopher AKIKI (Arabic/German), Woojun JUNG (Korean), Manuel ROMERO (Spanish) and Prithiv SAKTHI (Tamil).

We would like to thank the company AlphaEdge having allowed Loïck to carry out this work during professional rather than personal time.

We would like to point out that the length of this blog article is greatly overestimated due to numerous tables of results, references, or examples of texts showing the outputs obtained with the trimmed models (as well as potentially their translation into English when these examples concern another language).

Finally, please note that if you speak French, you can read a French version of this blog post here.

Introduction

In this blog post, we introduce the trimming technique. This simple method, requiring no retraining and running on a simple CPU, produces a lighter model than the original one while maintaining its performance. In the Practice 👨‍💻 section where we conducted experiments, we list the key points in 🧠 Key Takeaways boxes. A summary of all the advantages of this approach can also be found in the conclusion. To support our discussion, we reveal 5,526 models resulting from the application of this technique.



Theory 👨‍🏫

Trimming can be seen as a subset of pruning. Indeed, as with the latter, the goal of the technique is to modify/remove model weights to ultimately make it lighter (whether in terms of number of parameters or memory size). However, the specificity of this method is that we focus exclusively on the parts of the architecture related to vocabulary, whereas in pruning we generally modify the weights/layers of the rest of the architecture (i.e., the backbone). Thus, for trimming, we remove tokens from the model's original vocabulary (and consequently must also update the tokenizer) but also modify the final embedding layer managing the probability distribution of the model's vocabulary (likewise for the input layer if embeddings are tied).


What is trimming used for?

Let us consider a model with a given vocabulary size. This size may not be relevant for two main reasons:

  1. If our model is multilingual; Indeed, not all languages are necessarily of interest to us. It is then possible to remove those that are unnecessary for our use case.
  2. If the vocabulary size is not a multiple of 8 or 64. Indeed, to optimize GPU usage, these multiples are preferred (see notably here and here). This simple recommendation makes it easy to speed up model training (by 25% according to Karpathy's observations). Note that since 2023-2024, this has become common practice and models generally use a multiple of 8 or 64 for vocabulary by default. However, for older models, it may be worth modifying the size (we will see below in the Practice 👨‍💻 section that 7 of the 16 models tested have this problem).

This reduction in vocabulary size (to languages of interest and to a multiple of 8 or 64) allows reducing the model size (both in terms of number of parameters and ultimately memory size). Let us illustrate this point through a simple example: the GPT2-small by RADFORD, WU et al. (2019). This model makes it easy to understand case 2) just described, and is also an example that trimming works on monolingual models as well (whereas in the Practice 👨‍💻 section most examples focus on multilingual models).

Let us then calculate the number of parameters of GPT2-small:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("gpt2")

for name, param in model.named_parameters():
    print(f"{name:50s} {str(param.shape):25s} {param.numel():>12,}")
Full output:
# Output:
# transformer.wte.weight                             torch.Size([50257, 768])    38,597,376
# transformer.wpe.weight                             torch.Size([1024, 768])        786,432
# transformer.h.0.ln_1.weight                        torch.Size([768])                  768
# transformer.h.0.ln_1.bias                          torch.Size([768])                  768
# transformer.h.0.attn.c_attn.weight                 torch.Size([768, 2304])      1,769,472
# transformer.h.0.attn.c_attn.bias                   torch.Size([2304])               2,304
# transformer.h.0.attn.c_proj.weight                 torch.Size([768, 768])         589,824
# transformer.h.0.attn.c_proj.bias                   torch.Size([768])                  768
# transformer.h.0.ln_2.weight                        torch.Size([768])                  768
# transformer.h.0.ln_2.bias                          torch.Size([768])                  768
# transformer.h.0.mlp.c_fc.weight                    torch.Size([768, 3072])      2,359,296
# transformer.h.0.mlp.c_fc.bias                      torch.Size([3072])               3,072
# transformer.h.0.mlp.c_proj.weight                  torch.Size([3072, 768])      2,359,296
# transformer.h.0.mlp.c_proj.bias                    torch.Size([768])                  768
# transformer.h.1.ln_1.weight                        torch.Size([768])                  768
# transformer.h.1.ln_1.bias                          torch.Size([768])                  768
# transformer.h.1.attn.c_attn.weight                 torch.Size([768, 2304])      1,769,472
# transformer.h.1.attn.c_attn.bias                   torch.Size([2304])               2,304
# transformer.h.1.attn.c_proj.weight                 torch.Size([768, 768])         589,824
# transformer.h.1.attn.c_proj.bias                   torch.Size([768])                  768
# transformer.h.1.ln_2.weight                        torch.Size([768])                  768
# transformer.h.1.ln_2.bias                          torch.Size([768])                  768
# transformer.h.1.mlp.c_fc.weight                    torch.Size([768, 3072])      2,359,296
# transformer.h.1.mlp.c_fc.bias                      torch.Size([3072])               3,072
# transformer.h.1.mlp.c_proj.weight                  torch.Size([3072, 768])      2,359,296
# transformer.h.1.mlp.c_proj.bias                    torch.Size([768])                  768
# transformer.h.2.ln_1.weight                        torch.Size([768])                  768
# transformer.h.2.ln_1.bias                          torch.Size([768])                  768
# transformer.h.2.attn.c_attn.weight                 torch.Size([768, 2304])      1,769,472
# transformer.h.2.attn.c_attn.bias                   torch.Size([2304])               2,304
# transformer.h.2.attn.c_proj.weight                 torch.Size([768, 768])         589,824
# transformer.h.2.attn.c_proj.bias                   torch.Size([768])                  768
# transformer.h.2.ln_2.weight                        torch.Size([768])                  768
# transformer.h.2.ln_2.bias                          torch.Size([768])                  768
# transformer.h.2.mlp.c_fc.weight                    torch.Size([768, 3072])      2,359,296
# transformer.h.2.mlp.c_fc.bias                      torch.Size([3072])               3,072
# transformer.h.2.mlp.c_proj.weight                  torch.Size([3072, 768])      2,359,296
# transformer.h.2.mlp.c_proj.bias                    torch.Size([768])                  768
# transformer.h.3.ln_1.weight                        torch.Size([768])                  768
# transformer.h.3.ln_1.bias                          torch.Size([768])                  768
# transformer.h.3.attn.c_attn.weight                 torch.Size([768, 2304])      1,769,472
# transformer.h.3.attn.c_attn.bias                   torch.Size([2304])               2,304
# transformer.h.3.attn.c_proj.weight                 torch.Size([768, 768])         589,824
# transformer.h.3.attn.c_proj.bias                   torch.Size([768])                  768
# transformer.h.3.ln_2.weight                        torch.Size([768])                  768
# transformer.h.3.ln_2.bias                          torch.Size([768])                  768
# transformer.h.3.mlp.c_fc.weight                    torch.Size([768, 3072])      2,359,296
# transformer.h.3.mlp.c_fc.bias                      torch.Size([3072])               3,072
# transformer.h.3.mlp.c_proj.weight                  torch.Size([3072, 768])      2,359,296
# transformer.h.3.mlp.c_proj.bias                    torch.Size([768])                  768
# transformer.h.4.ln_1.weight                        torch.Size([768])                  768
# transformer.h.4.ln_1.bias                          torch.Size([768])                  768
# transformer.h.4.attn.c_attn.weight                 torch.Size([768, 2304])      1,769,472
# transformer.h.4.attn.c_attn.bias                   torch.Size([2304])               2,304
# transformer.h.4.attn.c_proj.weight                 torch.Size([768, 768])         589,824
# transformer.h.4.attn.c_proj.bias                   torch.Size([768])                  768
# transformer.h.4.ln_2.weight                        torch.Size([768])                  768
# transformer.h.4.ln_2.bias                          torch.Size([768])                  768
# transformer.h.4.mlp.c_fc.weight                    torch.Size([768, 3072])      2,359,296
# transformer.h.4.mlp.c_fc.bias                      torch.Size([3072])               3,072
# transformer.h.4.mlp.c_proj.weight                  torch.Size([3072, 768])      2,359,296
# transformer.h.4.mlp.c_proj.bias                    torch.Size([768])                  768
# transformer.h.5.ln_1.weight                        torch.Size([768])                  768
# transformer.h.5.ln_1.bias                          torch.Size([768])                  768
# transformer.h.5.attn.c_attn.weight                 torch.Size([768, 2304])      1,769,472
# transformer.h.5.attn.c_attn.bias                   torch.Size([2304])               2,304
# transformer.h.5.attn.c_proj.weight                 torch.Size([768, 768])         589,824
# transformer.h.5.attn.c_proj.bias                   torch.Size([768])                  768
# transformer.h.5.ln_2.weight                        torch.Size([768])                  768
# transformer.h.5.ln_2.bias                          torch.Size([768])                  768
# transformer.h.5.mlp.c_fc.weight                    torch.Size([768, 3072])      2,359,296
# transformer.h.5.mlp.c_fc.bias                      torch.Size([3072])               3,072
# transformer.h.5.mlp.c_proj.weight                  torch.Size([3072, 768])      2,359,296
# transformer.h.5.mlp.c_proj.bias                    torch.Size([768])                  768
# transformer.h.6.ln_1.weight                        torch.Size([768])                  768
# transformer.h.6.ln_1.bias                          torch.Size([768])                  768
# transformer.h.6.attn.c_attn.weight                 torch.Size([768, 2304])      1,769,472
# transformer.h.6.attn.c_attn.bias                   torch.Size([2304])               2,304
# transformer.h.6.attn.c_proj.weight                 torch.Size([768, 768])         589,824
# transformer.h.6.attn.c_proj.bias                   torch.Size([768])                  768
# transformer.h.6.ln_2.weight                        torch.Size([768])                  768
# transformer.h.6.ln_2.bias                          torch.Size([768])                  768
# transformer.h.6.mlp.c_fc.weight                    torch.Size([768, 3072])      2,359,296
# transformer.h.6.mlp.c_fc.bias                      torch.Size([3072])               3,072
# transformer.h.6.mlp.c_proj.weight                  torch.Size([3072, 768])      2,359,296
# transformer.h.6.mlp.c_proj.bias                    torch.Size([768])                  768
# transformer.h.7.ln_1.weight                        torch.Size([768])                  768
# transformer.h.7.ln_1.bias                          torch.Size([768])                  768
# transformer.h.7.attn.c_attn.weight                 torch.Size([768, 2304])      1,769,472
# transformer.h.7.attn.c_attn.bias                   torch.Size([2304])               2,304
# transformer.h.7.attn.c_proj.weight                 torch.Size([768, 768])         589,824
# transformer.h.7.attn.c_proj.bias                   torch.Size([768])                  768
# transformer.h.7.ln_2.weight                        torch.Size([768])                  768
# transformer.h.7.ln_2.bias                          torch.Size([768])                  768
# transformer.h.7.mlp.c_fc.weight                    torch.Size([768, 3072])      2,359,296
# transformer.h.7.mlp.c_fc.bias                      torch.Size([3072])               3,072
# transformer.h.7.mlp.c_proj.weight                  torch.Size([3072, 768])      2,359,296
# transformer.h.7.mlp.c_proj.bias                    torch.Size([768])                  768
# transformer.h.8.ln_1.weight                        torch.Size([768])                  768
# transformer.h.8.ln_1.bias                          torch.Size([768])                  768
# transformer.h.8.attn.c_attn.weight                 torch.Size([768, 2304])      1,769,472
# transformer.h.8.attn.c_attn.bias                   torch.Size([2304])               2,304
# transformer.h.8.attn.c_proj.weight                 torch.Size([768, 768])         589,824
# transformer.h.8.attn.c_proj.bias                   torch.Size([768])                  768
# transformer.h.8.ln_2.weight                        torch.Size([768])                  768
# transformer.h.8.ln_2.bias                          torch.Size([768])                  768
# transformer.h.8.mlp.c_fc.weight                    torch.Size([768, 3072])      2,359,296
# transformer.h.8.mlp.c_fc.bias                      torch.Size([3072])               3,072
# transformer.h.8.mlp.c_proj.weight                  torch.Size([3072, 768])      2,359,296
# transformer.h.8.mlp.c_proj.bias                    torch.Size([768])                  768
# transformer.h.9.ln_1.weight                        torch.Size([768])                  768
# transformer.h.9.ln_1.bias                          torch.Size([768])                  768
# transformer.h.9.attn.c_attn.weight                 torch.Size([768, 2304])      1,769,472
# transformer.h.9.attn.c_attn.bias                   torch.Size([2304])               2,304
# transformer.h.9.attn.c_proj.weight                 torch.Size([768, 768])         589,824
# transformer.h.9.attn.c_proj.bias                   torch.Size([768])                  768
# transformer.h.9.ln_2.weight                        torch.Size([768])                  768
# transformer.h.9.ln_2.bias                          torch.Size([768])                  768
# transformer.h.9.mlp.c_fc.weight                    torch.Size([768, 3072])      2,359,296
# transformer.h.9.mlp.c_fc.bias                      torch.Size([3072])               3,072
# transformer.h.9.mlp.c_proj.weight                  torch.Size([3072, 768])      2,359,296
# transformer.h.9.mlp.c_proj.bias                    torch.Size([768])                  768
# transformer.h.10.ln_1.weight                       torch.Size([768])                  768
# transformer.h.10.ln_1.bias                         torch.Size([768])                  768
# transformer.h.10.attn.c_attn.weight                torch.Size([768, 2304])      1,769,472
# transformer.h.10.attn.c_attn.bias                  torch.Size([2304])               2,304
# transformer.h.10.attn.c_proj.weight                torch.Size([768, 768])         589,824
# transformer.h.10.attn.c_proj.bias                  torch.Size([768])                  768
# transformer.h.10.ln_2.weight                       torch.Size([768])                  768
# transformer.h.10.ln_2.bias                         torch.Size([768])                  768
# transformer.h.10.mlp.c_fc.weight                   torch.Size([768, 3072])      2,359,296
# transformer.h.10.mlp.c_fc.bias                     torch.Size([3072])               3,072
# transformer.h.10.mlp.c_proj.weight                 torch.Size([3072, 768])      2,359,296
# transformer.h.10.mlp.c_proj.bias                   torch.Size([768])                  768
# transformer.h.11.ln_1.weight                       torch.Size([768])                  768
# transformer.h.11.ln_1.bias                         torch.Size([768])                  768
# transformer.h.11.attn.c_attn.weight                torch.Size([768, 2304])      1,769,472
# transformer.h.11.attn.c_attn.bias                  torch.Size([2304])               2,304
# transformer.h.11.attn.c_proj.weight                torch.Size([768, 768])         589,824
# transformer.h.11.attn.c_proj.bias                  torch.Size([768])                  768
# transformer.h.11.ln_2.weight                       torch.Size([768])                  768
# transformer.h.11.ln_2.bias                         torch.Size([768])                  768
# transformer.h.11.mlp.c_fc.weight                   torch.Size([768, 3072])      2,359,296
# transformer.h.11.mlp.c_fc.bias                     torch.Size([3072])               3,072
# transformer.h.11.mlp.c_proj.weight                 torch.Size([3072, 768])      2,359,296
# transformer.h.11.mlp.c_proj.bias                   torch.Size([768])                  768
# transformer.ln_f.weight                            torch.Size([768])                  768
# transformer.ln_f.bias                              torch.Size([768])                  768
# ─────────────────────────────────────────────────────────────────────────────
# Unique total: 124,439,808

Output grouped by component:

# ── Embeddings ──────────────────────────────
#    wte.weight                         38,597,376
#  Sub-total: 39,383,808

# ── Positional encoding ─────────────────────
#    wpe.weight                            786,432
#  Sub-total: 786,432

# ── Transformer blocks (×12) ────────────────
#  Per block:
#    ln_1.weight                               768
#    ln_1.bias                                 768
#    attn.c_attn.weight                  1,769,472
#    attn.c_attn.bias                        2,304
#    attn.c_proj.weight                    589,824
#    attn.c_proj.bias                          768
#    ln_2.weight                               768
#    ln_2.bias                                 768
#    mlp.c_fc.weight                     2,359,296
#    mlp.c_fc.bias                           3,072
#    mlp.c_proj.weight                   2,359,296
#    mlp.c_proj.bias                           768
#  Sub-total 1 block  :  7,087,872
#  Sub-total 12 blocks: 85,054,464

# ── Final LayerNorm ──────────────────────────
#    ln_f.weight                               768
#    ln_f.bias                                 768
#  Sub-total: 1,536

# ══════════════════════════════════════════════════
# Unique total: 124,439,808

Graphical view:

Number of parameters of the original GPT2 vs. a trimmed GPT2
Figure 1: Number of parameters of the original GPT2 vs. a trimmed GPT2

The 85,056,000 parameters (the 12 transformer blocks + the final LayerNorm) are not modified by trimming (pruning would be needed to reduce this part), as are the 786,432 parameters of the positional encoding (wpe.weight). However, with this technique, we can modify the 38,597,376 parameters of the embedding layer (wte.weight). In the detailed output, we can see that this layer has size [50257, 768], i.e., a vocabulary of 50,257 tokens where each token is represented on 768 dimensions.

📝 Note
This information is also available on the Hugging Face Hub without downloading the model. You can consult the "Files info" widget which is available for all Hub models whose weights have been pushed in safetensors format. For example for GPT2, we can consult https://huggingface.co/openai-community/gpt2?show_file_info=model.safetensors. By hovering over the wpe.weight line you will even see the information indicating that this layer represents 28.17% of the total model size.

We can see that the vocabulary size is not a multiple of 64! Applying trimming to this model could be worthwhile. Let us take a theoretical example. If we reduce the vocabulary size from 50,257 to 32,768 (512 × 64), the 38,597,376 parameters are then reduced to 32,768 × 768 = 25,165,824. That is 13,431,552 fewer parameters. The total number of parameters then goes from 124,439,808 to 111,008,256. This is a reduction of 10.79%. We will see in the next section that for some models, this reduction can be much more significant.


A question we can then ask is: what is the impact of this parameter reduction on the model's performance? To answer this, we analyzed 16 models covering different architectures and modalities: text encoders, text encoder-decoders, text decoders, text embedding models, visual embedding model, text/visual encoder-decoders (VLM).

Practice 👨‍💻

How to apply trimming in practice?

Several authors have taken an interest in the problem in recent years, and among those who have shared their code or even proposed a dedicated library on the subject, we can list:

  • smaller-transformers by Geotrend for trimming mBERT, the multilingual version of BERT by DEVLIN et al. (2018). A publication is available here by ABDAOUI et al. (2021), as well as a GitHub repository and some bilingual/n-lingual models on their Hugging Face account. But their approach is limited to this model and vocabularies are of different sizes depending on the language considered, and never a multiple of 64.
  • David DALE in a Medium article (2021) shows how to trim an mT5 to keep only English and Russian. A manual approach for full control, but again, this is limited to a particular model.
  • Aditya SRIVASTAVA developed the hf-trim library (2022) where he claims to support mT5 and mBART but there are several practical limitations. The biggest is that it is not possible to control the desired final vocabulary size. It should be possible to fix this by modifying the source code.
  • lm-vocab-trimmer (2023) by USHIO, ZHOU and CAMACHO-COLLADOS handles mT5, mBART and XLM-RoBERTa. This library is probably the most advanced on the subject but has some weaknesses, such as the fact that the target_vocab_size argument does not give the expected result (if you want a multiple of 64). It is not possible to reduce to n languages (a bilingual model, for example). Finally, all supported models rely on a sentencepiece tokenizer by KUDO and RICHARDSON (2018).
  • Antoine LOUIS via this Space (2024), proposes to trim already fine-tuned embedding models (using a sentencepiece tokenizer) such as mE5, BGE or GTE. It is not possible to select the desired number of tokens in the final vocabulary, and it only supports 6 languages.

All of these tools focus exclusively on models based on a sentencepiece tokenizer, encoders or encoder-decoders, and make it very difficult or simply impossible to manage the desired vocabulary size. Thus, within the scope of this introduction to trimming, we wanted to handle models based on other tokenizers (notably the BPE by SENNRICH, HADDOW and BIRCH (2015)), other architectures, other modalities, and to be able to choose the size of the new vocabulary. In what follows, we test all models on a diversity of languages in order to evaluate the robustness of the method via independent evaluators whose language does not necessarily use the Latin alphabet (Korean, Tamil, Arabic among others).


Text Embedding Models

Let us observe the impact of trimming on the performance of different types of models, starting with text embedding models.

Selected Models and Benchmarks

In this section, we focus in particular on:

Regarding benchmarks, we evaluate the models on English and Dutch via:

  • MTEB by MUENNIGHOFF et al. (2022)
  • MTEB-NL by BANAR, LOTFI et al. (2025). We chose this language because with this benchmark, these authors also introduced the e5-NL models that were built using the trimming technique.

Observed Parameter Reductions

Table 1: Observed parameter reductions for embeddings models
Model Vocabulary size (original) Vocabulary size (trimmed) % Vocabulary reduction Parameters (original) Parameters (trimmed) % Parameter reduction
granite-embedding-107m-multilingual 250,002 16,384 - 93.45% 107.0M 17.3M - 83.84%
granite-embedding-107m-multilingual 250,002 32,768 - 86.89% 107.0M 23.6M - 77.96%
granite-embedding-278m-multilingual 250,002 16,384 - 93.45% 278.0M 98.6M - 64.53%
granite-embedding-278m-multilingual 250,002 32,768 - 86.89% 278.0M 111.2M - 60.00%
multilingual-e5-small 250,037 16,384 - 93.45% 117.7M 27.9M - 76.26%
multilingual-e5-small 250,037 32,768 - 86.89% 117.7M 34.2M - 70.91%
multilingual-e5-base 250,037 16,384 - 93.45% 278.0M 98.6M - 64.53%
multilingual-e5-base 250,037 32,768 - 86.89% 278.0M 111.2M - 60.00%
multilingual-e5-large 250,037 16,384 - 93.45% 559.9M 320.7M - 42.7%
multilingual-e5-large 250,037 32,768 - 86.89% 559.9M 337.4M - 39.7%
embeddinggemma-300m 262,144 16,384 - 93.75% 307.6M 118.8M - 61.36%
embeddinggemma-300m 262,144 32,768 - 87.50% 307.6M 131.4M - 57.27%
bge-m3 250,002 16,384 - 93.45% 567.8M 328.5M - 42.14%
bge-m3 250,002 32,768 - 86.89% 567.8M 345.3M - 39.18%
Qwen3-Embedding-0.6B 151,669 16,384 - 89.20% 595.8M 457.2M - 23.25%
Qwen3-Embedding-0.6B 151,669 32,768 - 78.40% 595.8M 474.0M - 20.44%

📝 Note
The granite-embedding, mE5, BGE-M3 and Qwen3-Embedding models having a vocabulary size that is not a multiple of 64, trimming seems particularly well-suited even in cases where we do not specifically wish to create a monolingual/n-lingual model.

In the table above, and throughout the article, we provide figures for vocabulary sizes of 16,384 and 32,768 tokens. We revisit this choice in the Open Questions❓ section of the article, and more generally the challenge of how many tokens to retain in the trimmed model's vocabulary.

Results

English
Results on MTEB for embedding models in English
Table 2: Results on MTEB for embedding models in English

We can observe that overall trimming maintains the performance of the original model while producing smaller models. In 5 out of 8 cases, the 32,768-token model is even slightly better than the original. Our hypothesis is that removing noisy tokens benefits the model.
One issue is worth noting however: the performance degradation of the embeddinggemma. We initially thought we had made an error in our code, but ruled out this possibility when we looked at gemma-3 (see the sections on decoders and VLMs), which uses the same tokenizer and where we observe no performance drop (it is in fact probably one of the models most robust to trimming of this blogpost).
The explanation we put forward - and we kept this model to illustrate this phenomenon - is that trimming only works if the embedding layer is the last layer of the network. Whereas this embedding layer is indeed the last (excluding pooling) for all other models presented, for the embeddinggemma there are two extra dense layers positioned above it.

Embeddinggemma with 2 dense layers on top
Figure 2: Embeddinggemma original and trimmed (32,68) with 2 dense layers on top

For this same reason, we only present bi-encoder models here. Indeed, all our experiments on rerankers models failed due to the classification layer positioned above the embedding layer in these types of models. With the embeddinggemma, we wanted to illustrate the only limitation we identified in the trimming method.

📝 Note
We nonetheless chose to offer monolingual versions of embeddinggemma for two reasons:

  1. For English, the 32,768-token version remains better than Qwen3-Embedding (original or trimmed) even with this degradation.
  2. The degradation of 2.5 to 5.5 points (depending on the configuration chosen) is probably the most extreme due to the model being primarily trained on English. For other tested languages, the degradation is reduced to 1.7 for French (65.13 with 32,768 vs. 66.87 for the original) and to 0.6 points for Dutch (see the next paragraph).
Dutch

In the table below, models with the -trm suffix correspond to the models trimmed by the authors of e5-NL. They use 50,002 tokens (which is not a multiple of 64). Models with the -trm-nl suffix are the just-mentioned trimmed models on which the authors applied fine-tuning on an additional 950,000 data points (real and synthetic).

Note that the results differ slightly from those available on the leaderboard because we were only able to obtain results for 39 of the 40 datasets making up MTEB-NL. Specifically, it was not possible to evaluate on BelebeleRetrieval due to a language configuration error in the MTEB API. The Ret column below is therefore slightly underestimated compared to the leaderboard. We therefore also re-ran MTEB-NL on the original models so that all models are compared on 39 datasets.

Results on MTEB-NL for embedding models in Dutch
Table 3: Results on MTEB-NL for embedding models in Dutch

As with English, we can observe that trimming matches or even allows a very slight gain (generally less than 0.1 points) in performance when keeping 32,768 tokens.

An even more interesting result is that if more tokens are retained (50,002 by the MTEB-NL authors, though 50,048 might be a more hardware-friendly number), the gain rises to between 0.4 and 0.6 points for the mE5 models. Without any retraining. If we additionally fine-tune the trimmed model, the gain rises to between 2.5 and 3.1 points compared to the original model.

📝 Note
We see that keeping 50,002 tokens seems more interesting than keeping 32,768.
Following these results, we questioned whether or not we should offer models with a vocabulary of 50,048 tokens. We ultimately did not do so. Partly because within the scope of this blog post, our goal is to provide a simple introduction to the trimming method rather than the best possible models to illustrate our points. And partly because the question of what the optimal number of tokens for a tokenizer is remains unclear.
We revisit this last point in a dedicated section in the final section of this blog post on Open Questions❓.

🧠 Key Takeaways
For text embedding models, trimming makes it possible to obtain a lightweight monolingual model from a larger multilingual one.
We can notably observe that performance is maintained or even slightly improved.
If you observe a slight drop, it is advisable to fine-tune the trimmed model to recover or even surpass the original performance.
This performance gain through fine-tuning is quick to obtain because since the trimmed model is smaller than the original, training time is faster. With an equivalent compute budget, this opens the possibility of showing more data to the trimmed model than to an original model being fine-tuned (we run this experiment further below in the encoder-decoder section).
Note that the advantages listed above only apply to models where the embedding layer is the last layer of the network. Trimming will not work if there are layers positioned above the embedding layer.


Collections

We focused on English and Dutch within the scope of this blog post. Nevertheless, for each of the 5 models listed above, we generated all possible monolingual models from the languages supported by each of the original models. For each, we provide them in 16,384-token and 32,768-token versions, as well as in all available sizes. Since some providers do not supply the full list of languages supported by their model, we generated them as best we could from an analysis of the unicode characters found in the original model's tokenizer.

🎁
The collection gathering the 48 models (12 different languages) trimmed from the granite-embedding models is available here.
The collection gathering the 948 models (119 different languages) trimmed from the multilingual-e5 models is available here.
The collection gathering the 243 models (122 different languages) trimmed from embeddinggemma is available here.
The collection gathering the 178 models (89 different languages) trimmed from bge-m3 is available here.
The collection gathering the 166 models (83 different languages) trimmed from qwen3-embedding is available here.


Encoder Models

Let us continue analyzing the impact of trimming on the performance of different types of models, now moving on to encoder models.

Selected Models and Benchmarks

In this section, we focus on the mmBERT by MARONE, WELLER et al. (2025) and more particularly the small and base versions, as well as the 32k version by vLLM Semantic Router Team (2026).

We evaluate all these models on French, and more specifically on the named entity recognition task using the frenchNER_3entities dataset by BOURDOIS (2024), covering Person, Location and Organization entities. The idea here is to focus on a single dataset but to run multiple experiments on it to determine:

  • The impact of trimming order (should it be done before or after fine-tuning?);
  • The impact of the text corpus used for token mining;
  • The impact of the number of texts seen during token mining;
  • A comparison with a French model trained from scratch with the same architecture (i.e., modernBERT by WARNER, CHAFFIN, CLAVIÉ et al. (2024));
  • A comparison of trimming with the distillation method.

We also evaluate this last point on English by comparing a trimmed mmBERT small on English with DistilBERT by SANH et al. (2019) on the GLUE benchmark by WANG et al. (2018). The idea here being, conversely, to focus on a single model but evaluate it on multiple tasks.

Observed Parameter Reductions

Table 4: Observed parameter reductions for encoder models
Model Vocabulary size (original) Vocabulary size (trimmed) % Vocabulary reduction Parameters (original) Parameters (trimmed) % Parameter reduction
mmBERT-small 256,000 16,384 - 93.60% 140.5M 48.5M - 65.49%
mmBERT-small 256,000 32,768 - 87.20% 140.5M 54.8M - 61.01%
mmBERT-base 256,000 16,384 - 93.60% 306.9M 122.9M - 59.96%
mmBERT-base 256,000 32,768 - 87.20% 306.9M 135.5M - 55.86%
mmbert-32k-yarn 256,000 16,384 - 93.60% 306.9M 122.9M - 59.96%
mmbert-32k-yarn 256,000 32,768 - 87.20% 306.9M 135.5M - 55.86%

Results

French

Trimming order:

Trimming order
Table 5: Trimming order

According to our experiments, in the context of mmBERT, it is imperative to trim the original model and then fine-tune the resulting model, rather than fine-tuning the original model and then trimming it. Beyond avoiding a performance collapse, fine-tuning a trimmed model is also faster than fine-tuning an original model because the model is smaller.

📝 Note
We revisit this point about trimming order in a dedicated section in the final section of the article on Open Questions❓.

We can note that, as with text embedding models, keeping 32,768 tokens gives results similar to the original model (a small drop on the Organizations entity to mention), while keeping 16,384 performs slightly worse.

Finally, the YaRN model appears less performant than the untrimmed original model, and than the original model trimmed to 32,768 or 16,384 tokens.


Impact of the token mining text corpus:

Having discovered trimming a few years ago via David DALE's Medium article, we were using, like him, the Leipzig Corpora by GOLDHAHN et al. (2012) for token mining. On the occasion of this blog post, we took an interest in exploring the impact of the mining data on model performance. In this section, we therefore compare Leipzig Corpora with Fineweb 2 by PENEDO et al. (2025) published a few months ago. For Leipzig Corpora we used the first 200,000 entries of the fra_wikipedia_2021 dataset and for Fineweb 2 the first 200,000 entries of the fra_Latn split.

Impact of the token mining text corpus
Table 6: Impact of the token mining text corpus

We can observe similar results between the two corpora, regardless of the number of tokens retained in the vocabulary. For our part, and this applies to all results presented in this article as well as to the 5,526 generated models, we used Fineweb 2. This choice was driven purely by practical considerations. Indeed, Fineweb 2 is available directly on the Hugging Face Hub, grouping all languages and files in a single location that can be queried in an automated fashion. Whereas for Leipzig Corpora, the corpora must be downloaded manually.


Impact of the number of texts during mining:

We have just seen that the source used for token mining did not have a great impact on performance. What about the number of texts used?

Impact of the number of texts during mining
Table 7: Impact of the number of texts during mining

It turns out that the number of texts also does not have a big impact on performance. The number chosen does however have an impact on mining execution time. Indeed, obtaining the trimmed version on 20,000 texts took 2 min 04s on an Intel(R) Core(TM) Ultra 7 255H CPU (2.00 GHz) for the entire process (token mining, modification of vocabulary files and architecture layers, pushing the model to the Hub). The version trimmed on 200,000 texts took 8 min 56s. Finally, the version trimmed on 2,000,000 texts took 2h 26min 21s.

Warning ⚠️
For our part, we chose to mine all models presented in this blog post on 200,000 texts rather than 20,000 (fewer if Fineweb 2 does not offer 200,000 texts for the language in question).
Indeed, we wanted to be safe with respect to two points we were not able to test:

  1. We mined up to 32,768 tokens in the context of this blog post, but would 20,000 texts be sufficient if we wanted to keep, for example, 65,536 tokens?
  2. Although we wanted to test trimming on several languages using different alphabets, we exclusively evaluated resource-rich languages. That is, languages with a sufficient number of tokens in the original model which ultimately impacts the performance of the trimmed model, and which also have evaluation datasets for multiple tasks. We were not able to verify the impact of the number of texts on low-resource languages. An observation that prompted us to write this note is that for mmBERT we were for example able to generate a Chechen model with 16,384 tokens but not with 32,768 due to insufficient tokens (the only language, however, where this phenomenon occurred in our experiments).

📝 Note
The Fineweb 2 dataset being large (540 GB for the 360M French rows for example, and more than 8.67 TB in total), and as we have just seen that the number of rows seen during mining does not influence performance, we created a Fineweb 2 subset for 124 languages available here (only 44 GB). These 124 languages correspond to the intersection of languages offered by mBERT and XLM-RoBERTa.

📝 Note
Here, we focused on generic texts from Fineweb 2. We didn’t have time to conduct the experiment, but the fact that only 20,000 texts are needed for trimming opens the door to mining on specialized data in a given domain (medicine, banking, industry, etc.), which is generally less abundant than general-purpose data. This would allow us to retain only the tokens essential to the domain and ensure they align with the dataset needed to perform domain adaptation through additional pre-training.


Comparison with a model trained from scratch:

For this comparison of a monolingual model trained from scratch vs. a trimmed multilingual model, it seemed important to compare models sharing exactly the same architecture, namely that of modernBERT (rather than a trimmed modernBERT vs. a BERT/RoBERTa/DeBERTa trained from scratch, for example). For French, a from-scratch version of modernBERT exists: ModernCamemBERT by ANTOUN et al. (2025).

Comparison with a model trained from scratch
Table 8: Comparison with a model trained from scratch

We can clearly see that the monolingual model trained from scratch outperforms the original multilingual model as well as its trimmed version. Therefore, when a monolingual version of a model exists in your language, we recommend using it. Otherwise, a trimmed version can be a simple alternative while waiting for a monolingual model trained from scratch. As an added benefit, you'll also enjoy an optimized number of tokens per word (1.42 for moderncamembert vs. 1.58 for the trimmed version).


Comparison with distillation:

To compare a trimmed model with a model obtained by distillation, let us use distilCamemBERT by DELESTRE and AMAR (2022), distilled from CamemBERT base by MARTIN, MULLER, ORTIZ SUÁREZ, et al. (2019). To do so, in addition to the 16,384 and 32,768-token options, we also generated a 67,476-token model to have a trimmed version with the same number of parameters as distilCamemBERT, namely 67.5M.

Comparison with distillation
Table 9: Comparison with distillation

We can note that the distilled model almost matches the original mmBERT model. This aligns with the previous point that a monolingual model trained from scratch is superior to a multilingual model. Another observation is that the model trimmed to 67,476 tokens, having the same number of parameters as the distilled version, performs equivalently to distillation. The fundamental difference is the time required to obtain the two models. For the distilled version, the authors report that the model ran for 18 days on a Titan RTX GPU. On our side, we obtained the trimmed version in 8 min 56s on an Intel(R) Core(TM) Ultra 7 255H CPU (2.00 GHz) for the entire process.

As a bonus, the trimmed model handles a context size 16 times longer.


🧠 Key Takeaways
From these various experiments on mmBERT, we retain that:

  • A model should be trimmed and then fine-tuned, rather than the reverse.
  • A sufficient amount of data is needed for token mining, but it is actually relatively low (20,000 general-purpose texts may suffice for resource-rich languages) and the source of the texts does not have a great impact on performance.
  • A monolingual model trained from scratch will perform better than a trimmed multilingual model, but the latter can be an alternative while waiting for such a model to be developed.
  • Trimming can be more worthwhile than distillation. We recommend preferring it over the latter when the desired parameter reduction is equivalent.

English

Comparison with distillation:

For English, we focus on a single model but make a comparison including multiple tasks (namely 8 tasks from the GLUE benchmark). Let us clarify here that for fine-tuning, training is performed on the train split and testing on the dev split rather than the test split. We do this primarily to be able to compare with DistilBERT, which followed this methodology. We therefore use the figures reported by the authors in their publication.

Table 10: Comparison with distillation (English)
Model # MNLI QQP QNLI SST-2 CoLA STS-B MRPC RTE Average
distilbert-base-uncased 67M 82.2 88.5 89.2 91.3 51.3 85.8 87.5 59.9 79.5
mmBERT-small-eng-32768 54.8M 84.1 88.0 90.1 91.7 50.3 88.0 89.9 69.7 81.5

We can note that except on the QQP and CoLA tasks, the trimmed mmBERT small results are superior to DistilBERT. This is most likely explained by the fact that the untrimmed mmBERT small scores 84.7 on this benchmark.

🎁
For English, the trimmed mmBERT small with 54.8M parameters presents itself as an alternative to DistilBERT while being 18.2% smaller, capable of handling 8,192 tokens instead of 512, and delivering stronger results.

Again, as observed for French, an important advantage of trimming relates to the time required to obtain these smaller versions of the original models (mmBERT small for the trimmed version, and BERT base for DistilBERT). DistilBERT required approximately 90 hours of computation on 8 V100 16GB GPUs, against 9 min 01s on an Intel(R) Core(TM) Ultra 7 255H CPU (2.00 GHz) for the entire process.

🧠 Key Takeaways
Trimming delivers performance comparable to distillation while being vastly less costly, as it runs in a few minutes on a CPU vs. several days on GPU for distillation. For all models presented in this blog post, the execution time to obtain a monolingual model from a multilingual one ranged from 9 to 22 minutes for the entire process. This time difference is explained by the size of the original model used for token mining and by the number of data points on which we mine. For the first case, an mmBERT with 140.5M parameters is for example faster than a Qwen3.5 with 4.539B parameters. In the case of a model family offering multiple sizes, it is therefore advisable to mine tokens on the smallest model of the family and to save the distribution of the most frequent tokens in a cache. This cache can then be used to generate the larger models of this family (which generally share the same tokenizer). In the context of mmBERT, it is thus more efficient to mine tokens on the small size than on the base size. For the second case, we refer you to the experiment above on the impact of the number of texts during mining.

Collections

🎁
A collection gathering 514 models (124 different languages) trimmed from mmBERT is available here.


Encoder-Decoder Models

Let us now turn our attention to encoder-decoder models.

Selected Models and Benchmarks

For encoder-decoder models, we focus on:

  • The mT5 by XUE, CONSTANT, ROBERTS et al. (2021) and more particularly the small and base versions;
  • The mBART by LIU, GU, GOYAL et al. (2020).

We evaluate these models on Korean, and more specifically on the question generation task via the qg_koquad benchmark by USHIO et al. (2022).

We also evaluate them on Arabic and German in the context of a translation task. Indeed, until now we have shown uses of trimming to generate monolingual models from a multilingual model. Our goal is to show that the method can also be applied to generate n-lingual models from a multilingual model. The dataset chosen for translation is the ar-de part of MultiUN by EISELE and CHEN (2010), from which we kept the first 100,000 samples, split into 90% train, 5% eval and 5% test.

Observed Parameter Reductions

Table 11: Observed parameter reductions for encoder-decoder models
Model Vocabulary size (original) Vocabulary size (trimmed) % Vocabulary reduction Parameters (original) Parameters (trimmed) % Parameter reduction
mt5-small 250,112 16,384 - 93.45% 300.2M 60.8M - 79.73%
mt5-small 250,112 32,768 - 86.90% 300.2M 77.6M - 74.14%
mt5-base 250,112 16,384 - 93.45% 582.4M 223.4M - 61.64%
mt5-base 250,112 32,768 - 86.90% 582.4M 248.6M - 57.32%
mbart-large-50 250,054 16,384 - 93.45% 610.9M 371.6M - 39.17%
mbart-large-50 250,054 32,768 - 86.90% 610.9M 388.4M - 36.42%

📝 Note
The mBART having a vocabulary size that is not a multiple of 64, trimming seems particularly well-suited even in cases where we do not specifically wish to create a monolingual/n-lingual model.

📝 Note
We initially planned to also include the large version of mT5 (parameter reduction from 1.230B to 784.5M, i.e. -36.20%) but ultimately decided against it in order to present more recent decoder models instead (see the next section).

Results

Korean
Results on the question generation task via an encoder-decoder
Table 12: Results on the question generation task via an encoder-decoder

For the mT5 small, we can see a slight improvement on the BLEU metric and a slight degradation on the ROUGE-L metric. For the mT5 base, the 32,768-token version gives the best results across all models (original or trimmed) combined.

The mBART performs worse than mT5, as we can observe a degradation in results.


Arabic-German
Results on the translation task via an encoder-decoder
Table 13: Results on the translation task via an encoder-decoder

For the mT5 models, we can observe a slight degradation for the small version (especially with 16,384 tokens) and a match or even a slight improvement for the base version.

For mBART, instead of generating the 16,384-token version, we chose to run a different experiment. Taking advantage of the fact that the trimmed model is smaller than the original and that fine-tuning therefore runs faster, we continued training longer. This allows for a comparison not at an equivalent number of epochs (5 for each model yielding 58.45 vs. 58.77), but at an equivalent execution time (50 min each). We can then observe a gain of 1.76 BLEU points for a similar time budget.

🧠 Key Takeaways
For encoder-decoders, trimming allows achieving results more or less equivalent to the original model. Taking advantage of the fact that trimmed models are smaller to train them longer yields a significant gain over the original model.

Collections

🎁
A collection gathering 392 models (98 different languages) trimmed from mT5 is available here.
A collection gathering 104 models (52 different languages) trimmed from mBART is available here.


Decoder Models

Let us continue our experiments with decoder models.

Selected Models and Benchmarks

We focus here on four decoder families:

  • Granite 4.0 by IBM Research (2025) in its 350m and 1b sizes (in both attention and hybrid versions);
  • Qwen 3 by Qwen Team (2025) in its 0.6B and 1.7B sizes;
  • Gemma 3 by Google DeepMind (2025) in its 270m and 1b sizes;
  • SmolLM3 by Hugging Face (2025).

We evaluate these models on Dutch and Korean. For Dutch, we use the split for that language on the EuroEval leaderboard by NIELSEN et al. (2024). We focus in particular on the generative tasks proposed by this benchmark and set aside classification-type tasks on which encoders are more effective. For Korean, we apply the models to the KoBEST benchmark by KIM et al. (2022) using a 5-shot configuration.

Observed Parameter Reductions

Table 14: Observed parameter reductions for decoder models
Model Vocabulary size (original) Vocabulary size (trimmed) % Vocabulary reduction Parameters (original) Parameters (trimmed) % Parameter reduction
granite-4.0-350m 100,352 16,384 - 83.67% 352.4M 266.4M - 24.40%
granite-4.0-350m 100,352 32,768 - 67.35% 352.4M 283.2M - 19.64%
granite-4.0-h-350m 100,352 16,384 - 83.67% 340.3M 275.8M - 18.95%
granite-4.0-h-350m 100,352 32,768 - 67.35% 340.3M 288.4M - 15.25%
granite-4.0-1b 100,352 16,384 - 83.67% 1.632B 1.460B - 10.54%
granite-4.0-1b 100,352 32,768 - 67.35% 1.632B 1.493B - 8.48%
granite-4.0-h-1b 100,352 16,384 - 83.67% 1.462B 1.333B - 8.82%
granite-4.0-h-1b 100,352 32,768 - 67.35% 1.462B 1.358B - 7.10%
Qwen3-0.6B 151,936 16,384 - 89.22% 751.6M 507.6M - 36.93%
Qwen3-0.6B 151,936 32,768 - 78.43% 751.6M 474.0M - 32.47%
Qwen3-1.7B 151,936 16,384 - 89.22% 2.032B 1.543B - 27.33%
Qwen3-1.7B 151,936 32,768 - 78.43% 2.032B 1.476B - 24.02%
gemma-3-270m-it 262,144 16,384 - 93.75% 268.1M 60.8M - 58.67%
gemma-3-270m-it 262,144 32,768 - 87.50% 268.1M 77.6M - 54.76%
gemma-3-1b-it 262,144 16,384 - 93.75% 999.9M 716.7M - 28.31%
gemma-3-1b-it 262,144 32,768 - 87.50% 999.9M 735.6M - 26.43%
SmolLM3-3B 128,256 16,384 - 87.23% 3.075B 2.845B - 7.45%
SmolLM3-3B 128,256 32,768 - 74.45% 3.075B 2.880B - 6.36%

Results

Dutch

EuroEval providing confidence intervals, in the table below, averages in blue indicate that the intervals of the original/trimmed models overlap. If they do not overlap, the better model is shown in green and the worse one in orange. We also provide the figures from runs performed by the EuroEval teams where available.

Results on the Dutch sample of EuroEval
Table 15: Results on the Dutch sample of EuroEval

We can observe that the trimmed models perform slightly worse than the originals but remain within the confidence interval, with the exception of the Qwen3 models where the trimmed model systematically performs worse. Looking more closely, this appears to be due to the MMLU task where we observe a degradation, unlike the other tasks where the original and trimmed models continue to overlap.

We can also observe that our runs appear significantly better than those reported on the leaderboard. This illustrates the problem of replicating evaluations of generative models.

Note that to verify the Qwen3 results, we did some vibechecking to get a sense of the qualitative capabilities of the models. We took the opportunity to do this for all models. You can expand below the outputs we obtain with the prompt Geef me een korte introductie over grote taalmodellen. (Give me a brief introduction to language models.):

granite-4.0-350m: See the original / trimmed comparison

Original Trimmed
Graag! Hier is een kort introductie over grote taalmodellen:

Grote taalmodellen zijn machine learning modelle met grote hoeveelheden data die ze kunnen leren om taal te begrijpen en te genereren. Ze zijn ontworpen door grote datasets om taalvaardigheid te verbeteren.

Sure! Here is a brief introduction to large language models:

Large language models are machine learning models with large amounts of data that allow them to learn to understand and generate language. They are designed using large datasets to improve language proficiency.

Het is een grote taalmodell, die ontworpen werd door een onderzoeker bij een onderzoeksinstelling. Het is een computerprogramma dat kan begrijpen en schrijven in een taal die door mensen kan worden geïntroduceerd. Het is gebaseerd op de principes van natuurlijke taalverwerking en machine learning. Het is een krachtig hulpmiddel dat kan helpen bij het genereren van textuele inhoud, zoals boeken, artikelen en andere documenten. Het is een belangrijk onderdeel van de toekomst van de technologie en kan een veelzijdigere manier bieden om taal te begrijpen en te genereren.

It is a large language model, designed by a researcher at a research institute. It is a computer program capable of understanding and writing in a language that can be introduced by humans. It is based on the principles of natural language processing and machine learning. It is a powerful tool that can help generate textual content, such as books, articles and other documents. It is an important part of the future of technology and can offer a more versatile way to understand and generate language.
granite-4.0-h-350m: See the original / trimmed comparison
Original Trimmed
Grote taalmodellen, ook bekend als machine learning modellen, zijn een klasse van kunstmatige intelligentie die zich specialiseert in het proces van leren van grote hoeveelheden tekstgegevens. Deze modellen worden ontworpen om patronen, contexten en betekenis in de taal te herkennen en te begrijpen, wat leidt tot verbeterde natuurlijke taalverwerking, zoals tekstgeneratie, tekstanalyse en tekstcompressie. Ze zijn gebruikt in verschillende toepassingen, zoals chatbots, virtuele assistenten, tekstgeneratie, natuurlijke taalverwerking en zelfs in de AI zelf.

Large language models, also known as machine learning models, are a class of artificial intelligence that specializes in learning from large amounts of textual data. These models are designed to recognize and understand patterns, contexts and meaning in language, leading to improved natural language processing, including text generation, text analysis and text compression. They are used in various applications, such as chatbots, virtual assistants, text generation, natural language processing and even AI itself.
Grote taalmodellen zijn een klasse van kunstmatige intelligentie die zich specialiseert in het procesen van menselijke taal. Ze zijn ontworpen om de complexiteiten en nuances van de taal te begrijpen en te genereren, waardoor ze in staat zijn om menselijke taal te verwerken en te interpreteren. Deze modellen worden vaak gebruikt in verschillende toepassingen, zoals chatbots, virtuele assistenten, tekstgeneratie, natuurlijke taalverwerking en meer.

Large language models are a class of artificial intelligence that specializes in processing human language. They are designed to understand and reproduce the complexities and nuances of language, enabling them to process and interpret human language. These models are often used in various applications, such as chatbots, virtual assistants, text generation, natural language processing and more.
granite-4.0-1b: See the original / trimmed comparison

Original Trimmed
Grote taalmodellen, ook wel bekend als GPT (Generative Pre-trained Transformer) modellen, zijn een type van machine learning model dat is ontworpen om menselijke taal te begrijpen en te genereren. Deze modellen worden getraind op enorme hoeveelheden tekstgegevens uit verschillende bronnen, waardoor ze een breed scala aan taalstijlen, onderwerpen en contexten kunnen begrijpen en reproduceren.

De belangrijkste kenmerken van grote taalmodellen zijn:

  1. Grootte: Deze modellen bestaan uit miljoenen van parameters, waardoor ze zeer geavanceerde taalverwerking kunnen uitvoeren. Hun grootte maakt het mogelijk om complexe taalpatronen en contexten te begrijpen en te genereren.

  2. Pre-training: Vooraf getraind op grote hoeveelheden tekstgegevens, kunnen deze modellen een breed scala aan informatie en taalstijlen leren. Dit maakt ze zeer flexibel en aanpasbaar aan verschillende taakken.

  3. Generatie: Met behulp van hun getrainde kennis, kunnen grote taalmodellen tekst genereren op basis van de invoer die ze ontvangen. Dit maakt ze nuttig voor een breed scala aan toepassingen, zoals tekstgeneratie, vertaling, antwoorden op vragen en meer.

  4. Contextbegrip: Deze modellen zijn uitstekend in het begrijpen van context en het genereren van geschikte en relevante antwoorden of tekst op basis van de gegeven context.

Grote taalmodellen hebben een significante invloed gehad op de manier waarop we interageren met technologie en hebben het potentieel om de manier waarop we communiceren en informatie consumeren te revolutioneren. Ze worden gebruikt in verschillende toepassingen, van chatbots en virtuele assistenten tot contentgeneratie en taalvertaling.

Large language models, also known as GPT (Generative Pre-trained Transformer) models, are a type of machine learning model designed to understand and generate human language. These models are trained on enormous amounts of textual data from various sources, allowing them to understand and reproduce a wide range of language styles, topics and contexts.

The main characteristics of large language models are:

  1. Size: These models consist of millions of parameters, allowing them to perform very sophisticated language processing. Their size makes it possible to understand and generate complex language patterns and contexts.

  2. Pre-training: Pre-trained on large amounts of textual data, these models can learn a wide range of information and language styles. This makes them very flexible and adaptable to different tasks.

  3. Generation: Using their trained knowledge, large language models can generate text based on the input they receive. This makes them useful for a wide range of applications, such as text generation, translation, question answering and more.

  4. Context understanding: These models excel at understanding context and generating appropriate and relevant responses or text based on the given context.

Large language models have had a significant impact on the way we interact with technology and have the potential to revolutionize the way we communicate and consume information. They are used in various applications, from chatbots and virtual assistants to content generation and language translation.

Grote taalmodellen zijn een type kunstmatige intelligentie die is ontworpen om menselijke taal te begrijpen en te genereren. Ze zijn ontwikkeld met behulp van machine learning technieken, met name die gebaseerd op diepe neurale netwerken. Deze modellen zijn getraind op enorme hoeveelheden tekstdata, waardoor ze een breed scala aan woorden, grammatica en context kunnen begrijpen en gebruiken. Ze kunnen vragen beantwoorden, verhalen genereren, tekst corrigeren en zelfs helpen bij het schrijven van artikelen of het begrijpen van complexe concepten.

Large language models are a type of artificial intelligence designed to understand and generate human language. They have been developed using machine learning techniques, in particular those based on deep neural networks. These models have been trained on enormous amounts of textual data, allowing them to understand and use a wide range of words, grammar and contexts. They can answer questions, generate stories, correct texts and even help write articles or understand complex concepts.
granite-4.0-h-1b: See the original / trimmed comparison
Original Trimmed
Grote taalmodellen, ook wel bekend als transformatormodellen, zijn een soort kunstmatige intelligentie die is ontwikkeld om menselijke taal te begrijpen, genereren en te manipuleren. Deze modellen zijn gebaseerd op een type neurale netwerk dat in staat is om grote hoeveelheden tekstgegevens te verwerken en patronen te herkennen. Ze zijn in staat om complexe taalbehoudingen te begrijpen, zoals context, betekenis en syntaxis, en kunnen daarmee een breed scala aan taakken uitvoeren, zoals tekstgeneratie, vertaling, antwoordgeven op vragen en taalvertaling. Grote taalmodellen zijn steeds meer geavanceerd en worden steeds vaker gebruikt in verschillende toepassingen, zoals chatbots, virtuele assistenten, contentgeneratie en taalonderwijs.

Large language models, also known as transformer models, are a type of artificial intelligence designed to understand, generate and manipulate human language. These models are based on a type of neural network capable of processing large amounts of textual data and recognizing patterns. They are capable of understanding complex linguistic properties, such as context, meaning and syntax, and can thus perform a wide range of tasks, such as text generation, translation, question answering and language translation. Large language models are increasingly sophisticated and are increasingly used in various applications, such as chatbots, virtual assistants, content generation and language education.
Grote taalmodellen zijn geavanceerde kunstmatige intelligentie systemen die zijn ontworpen om menselijke taal te begrijpen en te genereren. Ze zijn gebaseerd op diepgaande leermodellen, zoals die gebruikt worden in transformator-architecturen, die in staat zijn om patronen in grote hoeveelheden tekstgegevens te leren. Deze modellen kunnen een breed scala aan taakken uitvoeren, zoals tekstgeneratie, vertaling, antwoorden op vragen en antwoordgeneratie. Ze zijn in staat om complexe contexten te begrijpen en passende antwoorden te genereren op basis van de gegeven context. Grote taalmodellen hebben een groot aantal parameters, vaak in de miljoenen tot miljarden bereik, wat hen in staat stelt om een zeer grote hoeveelheid informatie te verwerken en te genereren. Ze zijn steeds meer geavanceerder en worden steeds vaker gebruikt in verschillende toepassingen, zoals chatbots, virtuele assistenten en contentgeneratie.

Large language models are advanced artificial intelligence systems designed to understand and generate human language. They are based on deep learning models, such as those used in transformer architectures, capable of learning patterns from large amounts of textual data. These models can perform a wide range of tasks, such as text generation, translation, question answering and response generation. They are capable of understanding complex contexts and generating appropriate responses based on the given context. Large language models have a large number of parameters, often in the millions to billions range, enabling them to process and generate a very large amount of information. They are increasingly sophisticated and increasingly used in various applications, such as chatbots, virtual assistants and content generation.
Qwen3-0.6B: See the original / trimmed comparison
Original Trimmed
Grote taalmodellen zijn grote, complexe modelle die grote hoeveelheden tekst aanpassen, zoals in de Nederlandse taal. Ze zijn ontwikkeld met de behulp van grote datasetten, zoals millions of woorden, en worden geschikt voor verschillende taalgebruiktes, zoals het antwoorden van vragen of het schrijven van teksten. Ze kunnen ook complexe taalfuncties uitvoeren, zoals het onderluiden van stijl of het ontwerpen van stories.

Large language models are large, complex models that process large amounts of text, such as in the Dutch language. They are developed using large datasets, containing millions of words, and are suited for various language use cases, such as answering questions or writing texts. They can also perform complex language functions, such as adapting style or designing stories.
**Intro over grote taalmodellen** – grote taalmodellen zijn AI-technologieën die kunnen leren te weten te spelen met taal, zowel in tekst als in andere medium. Ze zijn ontwikkeld met grote gegevens en kunnen kwalitatief en kwalificeel onderzoek, samen met het verwerken van tekst, uitvoeren. Deze modellen zijn gebruikt in verschillende sectoren om interactieve en luidere communicatie te versterken, samen met hun capaciteit om complexe taalgebruik te ondersteunen.

**Introduction to large language models** – Large language models are AI technologies that can learn to play with language, both in text and in other media. Developed with large data, they can perform qualitative and quantitative research, along with text processing. These models have been used in various sectors to enhance interactive and richer communication, along with their capacity to support complex language use.
Qwen3-1.7B: See the original / trimmed comparison
Original Trimmed
Grote taalmodellen zijn machine learning-roboten die op basis van veel tekstgegevens worden gemaakt. Ze kunnen vaak honderden van duizenden boeken, artikelen of internetinformatie oefenen. Deze modellen gebruiken een specifieke architectuur, zoals de transformer, om lange teksten en context te begrijpen. Ze worden gebruikt voor taalverwijzing, coderen, en inhoudsgeneratie. Hoewel ze veel potentieel bieden, hebben ze ook problemen met bias, ethieken, en het gebruik van rijke resourcebenodigingen.

Large language models are machine learning systems built from large amounts of textual data. They can often be trained on hundreds of thousands of books, articles or internet information. These models use a specific architecture, such as the transformer, to understand long texts and context. They are used for language reference, coding and content generation. Although they offer great potential, they also have issues with bias, ethics and the use of resource-intensive requirements.
Grote taalmodellen (zoals GPT, BERT, en andere modellen) zijn AI-modellen met een grote hoeveelheid parameters (parameters zijn de hoeveelheid informatie die het model leert). Ze worden geïntraineerd op miljarden van de tekstopzoeken om te leren zinnetjes te genereren, te begrijpen en te converseren. Deze modellen zijn van toepassing op veel NLP-taken, zoals het genereren van tekst, het analyseren van taal, en het beantigen van vragen. Zij zijn een belangrijk stap voorwaarts in de AI-ontwikkeling, omdat ze de mogelijkheid bieden om complexe taalproblemen op te lossen en gebruikers ervaring te geven die duidelijk en relevant is.

Large language models (such as GPT, BERT and other models) are AI models with a large number of parameters (parameters being the amount of information the model learns). They are trained on billions of text queries to learn to generate sentences, understand and converse. These models apply to many NLP tasks, such as text generation, language analysis and question answering. They represent an important step forward in AI development, as they offer the ability to solve complex language problems and provide users with a clear and relevant experience.
gemma-3-270m-it: See the original / trimmed comparison

Original Trimmed
Een grote taalmodel (LLM) is een objectief model dat de taal kan begrijpen en genereren. Het is ontworpen om een brede kennis van de wereld te behouden en te verwerken, wat resulteert in een betere en meer intelligentere menselijke interactie met de taal.

A large language model (LLM) is a model capable of understanding and generating language. It is designed to retain and process broad knowledge of the world, resulting in better and more intelligent human interaction with language.
Zeker! Hier is een korte introductie over grote taalmodellen:

Grote taalmodellen (LLM's) zijn een vorm van machine learning die toegang heeft tot enorme hoeveelheden tekstdata. Ze kunnen een breed scala aan taken uitvoeren, zoals tekstgeneratie, verwerking, vraag-antwoord en zelfs tekst-analyse. Deze modellen worden vaak gebruikt voor verschillende toepassingen, van spraakherkenning tot het genereren van content.

Sure! Here is a brief introduction to large language models:

Large language models (LLMs) are a form of machine learning that has access to enormous amounts of textual data. They can perform a wide range of tasks, such as text generation, processing, question answering and even text analysis. These models are often used for various applications, from speech recognition to content generation.

gemma-3-1b-it: See the original / trimmed comparison

Original Trimmed
Oké, hier is een korte introductie over grote taalmodellen (LLM's):

Grote taalmodellen (LLM's) zijn een nieuw soort AI-technologie die zich steeds meer richten op het begrijpen en genereren van menselijke taal. Het zijn complexe computersystemen die zijn getraind op enorme hoeveelheden tekst en code, waardoor ze in staat zijn om:

  • Tekst te begrijpen: Ze kunnen complexe vragen beantwoorden, samenvattingen maken en de betekenis van tekst analyseren.
  • Tekst te genereren: Ze kunnen nieuwsartikelen, gedichten, code, scripts, muziekstukken, e-mails, brieven, enzovoort, op basis van een prompt.
  • Tekst te vertalen: Ze kunnen teksten van de ene taal naar de andere vertalen.
  • Chatten: Ze kunnen interactieve gesprekken voeren.

Denk aan ze als enorme 'boeken' die geleerd hebben hoe taal te gebruiken. Ze zijn niet echt 'denkend' zoals mensen, maar ze simuleren intelligente redenering op basis van de patronen die ze in hun training hebben opgevangen.

Voorbeelden van populaire LLM's zijn: GPT-3, LaMDA, Bard, en Llama 2.

Wil je dat ik je verder uitleg, of specifiekere informatie geeft over een bepaald aspect van LLM's? Bijvoorbeeld, wil je weten hoe ze werken, hoe ze worden gebruikt, of wat hun beperkingen zijn?

Ok, here is a brief introduction to large language models (LLMs):

Large language models (LLMs) are a new type of AI technology that increasingly focus on understanding and generating human language. They are complex computer systems trained on enormous amounts of text and code, enabling them to:

  • Understand text: They can answer complex questions, make summaries and analyze the meaning of text.
  • Generate text: They can produce news articles, poems, code, scripts, music pieces, emails, letters, etc., based on a prompt.
  • Translate text: They can translate texts from one language to another.
  • Chat: They can hold interactive conversations.

Think of them as enormous 'books' that have learned how to use language. They are not really 'thinking' like humans, but they simulate intelligent reasoning based on the patterns they picked up during training.

Examples of popular LLMs are: GPT-3, LaMDA, Bard, and Llama 2.

Would you like me to explain further, or give more specific information about a particular aspect of LLMs? For example, would you like to know how they work, how they are used, or what their limitations are?

Oké, hier is een korte introductie over grote taalmodellen (LLM's):

Grote Taalmodellen (LLM's) zijn een nieuwe klasse van AI-modellen die op enorme hoeveelheden tekstgegevens zijn getraind. Ze kunnen bijvoorbeeld vragen beantwoorden, teksten genereren, vertalen, samenvattingen maken en zelfs code schrijven!

Hier zijn de belangrijkste punten om het te begrijpen, zonder te ingewikkeld te worden (maar het is toch belangrijk):

  • Getraind op enorme datasets van tekst en code van het internet, boeken en meer… Ze eten heel veel tekst!
  • Gebruiken "neurale netwerken" (een soort intelligentie-circuit) om patronen in de data te leren en te voorspellen wat er volgend zal komen in een tekst.
  • Ze kunnen "attenderen" op specifieke delen van een tekst om te beantwoorden, of te genereren, en te reageren op je input. Dit noemen we "context begrijpen".

Voorbeelden van bekende LLM's zijn Google's LaMDA, Meta's LLaMA, en de modellen van Microsoft (bijv. Copype).

Wil je dat ik je iets specifieks over LLM's uitleg? Misschien wil je weten hoe ze werken, of welke toepassingen ze hebben?.

Ok, here is a brief introduction to large language models (LLMs):

Large Language Models (LLMs) are a new class of AI models trained on enormous amounts of textual data. They can for example answer questions, generate texts, translate, make summaries and even write code!

Here are the key points to understand, without getting too complicated (but it is still important):

  • Trained on enormous datasets of text and code from the internet, books and more… They consume huge amounts of text!
  • They use "neural networks" (a kind of intelligence circuit) to learn patterns in the data and predict what will come next in a text.
  • They can "attend" to specific parts of a text to answer, generate, or react to your input. This is what we call "understanding context".

Examples of well-known LLMs are Google's LaMDA, Meta's LLaMA, and Microsoft's models (e.g. Copype).

Would you like me to explain something specific about LLMs? Perhaps you would like to know how they work, or what applications they have?

In terms of form, the outputs from the trimmed models never went off-topic and are therefore similar to the originals.
In terms of substance, whether trimmed or original models, the generated outputs tend to translate technical terms related to LLMs/AI from English into Dutch but do so poorly. We also observed that the models struggle with the plural form of certain words. Concretely, our evaluators indicate that the texts from original and trimmed models alike resemble children's compositions that occasionally use a few sophisticated terms here and there.
On this question of substance, note that the models shown here serve to illustrate the principle of trimming. In practice, these various models are now somewhat dated and are certainly not the most useful. At the time of writing, Granite 4.1, Gemma 4 and Qwen 3.6 have just been released and will probably be better (in the next section we can already see that 3.5 is better than 3).

Warning ⚠️
Our vibechecking experiments also covered reasoning traces and system prompts.
We observe that when a model drops below 300M parameters once trimmed, it tends to respond in English by default (the untrimmed Gemma 3 270M also exhibits this behavior) or else not in the correct language. It is then necessary to specify in the system prompt the language it should use to force generation in that language. Whereas for models larger than 300M parameters, it is not necessary to modify their original system prompt. Regarding reasoning traces, for the trimmed models where it is possible to enable/disable thinking mode, namely Qwen 3 and SmolLM3, behaviors are variable. SmolLM3 will systematically generate its traces in English. For Qwen 3, we observe variable behavior. The model may no longer generate traces, generate them in English, or generate them in the trimmed language. This last point seemed interesting to us as it could allow, for example, generating reasoning traces directly in a target language (instead of having to generate them in English and then translate them). However, this behavior does not seem to occur for all languages. We can observe it for example with French, which is close to English, but we were unable to observe this phenomenon with the non-Latin alphabets tested such as Korean, Arabic or Tamil. It also seems to depend on the difficulty of the task.


Korean
Results on the KoBEST benchmark
Table 16: Results on the KoBEST benchmark

For Korean, we observe that we obtain results similar to the original models with one exception: granite 4.0-h, where we have a 3-point degradation (which we do not have on Dutch). We can see that this is due to the boolq dataset on which we lose 12 points. Analyzing the results, the trimmed model collapses on this task without our being able to modify this behavior (via the prompt, a change of labels to find, or the number of shots).
Out of all our experiments (approximately 90 trimmed models tested), this is the only model where we observe an unexpected degradation. It is therefore an isolated case but one that can illustrate an edge case of trimming.

🧠 Key Takeaways
Trimming also works on decoders, but we invite users to be vigilant regarding potential rare edge cases.

Collections

🎁
A collection gathering 96 models (12 different languages) trimmed from Granite 4.0 is available here.
A collection gathering 332 models (83 different languages) trimmed from Qwen 3 is available here.
A collection gathering 486 models (122 different languages) trimmed from Gemma 3 is available here.
A collection gathering 16 models (8 different languages) trimmed from SmolLM3 is available here.


Multimodal Models (VLM)

Let us now observe the impact of trimming on the performance of VLMs.

Selected Models and Benchmarks

We focused on two VLMs for this section:

  • Qwen 3.5 by Qwen Team (2026) in sizes 0.8B, 2B and 4B;
  • Gemma 3 4B by Google DeepMind (2025).

Not having benchmarks to evaluate these models in languages other than English, we limited ourselves in this section to vibechecking by submitting images to the model and asking it to describe them.
For space reasons, we present one example for French and one for Spanish.

Observed Parameter Reductions

Table 17: Observed parameter reductions for VLM models
Model Vocabulary size (original) Vocabulary size (trimmed) % Vocabulary reduction Parameters (original) Parameters (trimmed) % Parameter reduction
Qwen3.5-0.8B 248,320 16,384 - 93.40% 853.0M 615.5M - 27.84%
Qwen3.5-0.8B 248,320 32,768 - 86.80% 853.0M 632.3M - 25.88%
Qwen3.5-2B 248,320 16,384 - 93.40% 2.213B 1.738B - 21.46%
Qwen3.5-2B 248,320 32,768 - 86.80% 2.213B 1.772B - 19.95%
Qwen3.5-4B 248,320 16,384 - 93.40% 4.539B 3.946B - 13.08%
Qwen3.5-4B 248,320 32,768 - 86.80% 4.539B 3.987B - 12.16%
gemma-3-4b-it 262,208 16,384 - 93.75% 4.300B 3.671B - 14.63%
gemma-3-4b-it 262,208 32,768 - 87.50% 4.300B 3.713B - 13.66%

Results

French
Photo by Merve Noyan
Figure 3: Photo by Merve NOYAN

Prompt used for all models: "Décris cette image en détail." (Describe this image in detail.)

Qwen3.5 0.8B: See the original / trimmed comparison
Original Trimmed
Cette image montre une collection de fleurs de cosmos en rose, avec une bumblebee au centre de la composition. La bumblebee, avec ses taches noires et blanches, est assise sur la fleur principale, qui est l'objet central de l'image. Les autres fleurs de cosmos, de différentes couleurs (rose, rouge, blanche), sont dispersées autour de la fleur principale, certaines étant en fleurs et d'autres en fleurs mortes. Le fond est composé de feuilles vertes et de fleurs de cosmos en fleurs mortes, créant un arrière-plan naturel et coloré.

This image shows a collection of pink cosmos flowers, with a bumblebee at the center of the composition. The bumblebee, with its black and white markings, is sitting on the main flower, which is the central subject of the image. The other cosmos flowers, in different colors (pink, red, white), are scattered around the main flower, some in bloom and others dead. The background is composed of green leaves and dead cosmos flowers, creating a natural and colorful backdrop.
L'image montre une scène de nature avec des fleurs de cosmos en violet pâle, une bumblebee au centre de la composition, et d'autres fleurs de cosmos et de gerberas en arrière-plan. La bumblebee, avec ses taches noires et blanches, est assise sur la fleur principale, probablement en quête de nourriture. Les autres fleurs, certaines encore en fleurs et d'autres déjà séchées, ajoutent de la couleur et de la texture à l'arrière-plan. Le fond est rempli de feuilles vertes et de branches, créant un cadre naturel pour cette image de fleurs et d'insectes.

The image shows a nature scene with pale purple cosmos flowers, a bumblebee at the center of the composition, and other cosmos flowers and gerberas in the background. The bumblebee, with its black and white markings, is sitting on the main flower, probably in search of food. The other flowers, some still in bloom and others already dried, add color and texture to the background. The background is filled with green leaves and branches, creating a natural frame for this image of flowers and insects.
Qwen3.5 2B: See the original / trimmed comparison
Original Trimmed
Cette image capture une scène de jardin florale vibrante, centrée sur une grande fleur rose pâle en pleine floraison. Une abeille noire et jaune est visible au centre de la fleur, collectant du pollen. Autour d'elle, d'autres fleurs colorées — rouges, violettes et blanches — s'ouvrent également, tandis que des tiges et des feuilles vertes forment un arrière-plan naturel et luxuriant. L'ensemble de l'image évoque la vie et la beauté du jardin, avec une composition équilibrée et une ambiance apaisante.

This image captures a vibrant floral garden scene, centered on a large pale pink flower in full bloom. A black and yellow bee is visible at the center of the flower, collecting pollen. Around it, other colorful flowers — red, purple and white — are also opening, while green stems and leaves form a natural and lush background. The overall image evokes the life and beauty of the garden, with a balanced composition and a soothing atmosphere.
Cette image capture une scène de jardin en pleine vie, centrée sur une fleur de cosmos rose vif qui attire l'attention. Au centre, une abeille noire et jaune est en train de collecter du pollen sur le centre de la fleur, ses ailes translucides légèrement déployées. La fleur principale, avec ses pétales larges et douces, contraste avec d'autres fleurs autour : des roses plus pâles, des fleurs rouges éclatantes en arrière-plan, et des tiges de plantes séchées ou en fleurs blanches, suggérant une saison printanière ou estivale. Le fond est flou, mettant en valeur la profondeur de champ et la netteté du sujet. Des feuilles vertes et des tiges fines complètent le cadre naturel, créant une ambiance apaisante et vivante.

This image captures a vibrant garden scene, centered on a bright pink cosmos flower that draws the eye. At the center, a black and yellow bee is collecting pollen from the heart of the flower, its translucent wings slightly spread. The main flower, with its broad soft petals, contrasts with the surrounding flowers: paler roses, bright red flowers in the background, and dried plant stems or white flowers, suggesting a spring or summer season. The background is blurred, highlighting the depth of field and the sharpness of the subject. Green leaves and fine stems complete the natural frame, creating a soothing and vibrant atmosphere.
Qwen3.5 4B: See the original / trimmed comparison
Original Trimmed
Cette image montre une fleur rose vif, probablement un cosmos, au centre de l'encadrement, avec un bourdon noir et blanc en train de butiner son pollen jaune. Autour d'elle, d'autres fleurs roses sont visibles — certaines encore fraîches, d'autres déjà fanées ou séchées, montrant des tiges brunes et des pétales effilochés. En bas à droite, une petite fleur rouge vive contraste avec les tons roses dominants. Le fond est flou, mais on devine des feuilles vertes et d'autres éléments floraux, créant une ambiance naturelle et paisible. L'ensemble évoque un jardin en pleine saison, où la vie et la décomposition coexistent harmonieusement.

This image shows a bright pink flower, probably a cosmos, at the center of the frame, with a black and white bumblebee collecting its yellow pollen. Around it, other pink flowers are visible — some still fresh, others already faded or dried, showing brown stems and frayed petals. In the bottom right, a small bright red flower contrasts with the dominant pink tones. The background is blurred, but green leaves and other floral elements can be made out, creating a natural and peaceful atmosphere. The overall scene evokes a garden in full season, where life and decomposition coexist harmoniously.
Cette image présente une fleur rose vif, probablement un cosmos, au centre de l'encadrement, avec un grand bourdon noir et blanc en train de butiner son pollen jaune. Autour d'elle, d'autres fleurs roses sont visibles — certaines encore fraîches, d'autres déjà fanées et séchées, montrant le cycle naturel de la vie végétale. En bas à droite, une petite fleur rouge écarlate ajoute un contraste vibrant, tandis que des feuilles vertes floues forment un fond doux qui met en valeur les sujets principaux. La composition évoque une scène paisible dans un jardin ou un massif floral, où la nature se déploie avec harmonie et simplicité.

This image presents a bright pink flower, probably a cosmos, at the center of the frame, with a large black and white bumblebee collecting its yellow pollen. Around it, other pink flowers are visible — some still fresh, others already faded and dried, showing the natural cycle of plant life. In the bottom right, a small scarlet red flower adds a vibrant contrast, while blurred green leaves form a soft background that highlights the main subjects. The composition evokes a peaceful scene in a garden or flower bed, where nature unfolds with harmony and simplicity.
Qwen3.5 4B (thinking): See the original / trimmed comparison

Original Trimmed
<think>
The user wants a detailed description of the image in French.

Image Analysis:

  • Main Subject: A large, light pink cosmos flower (Cosmos bipinnatus) in the center.
  • Action: A bumblebee is on the center of the flower, collecting pollen.
  • Surroundings:
    • To the left: Another pink cosmos flower, slightly out of focus.
    • Background: Green foliage, some dried brown flower heads (seed pods).
    • Bottom right: A small red geranium flower.
    • Top right: Some wilting pink flowers.
  • Lighting/Color: Natural daylight, soft focus background (bokeh), vibrant pinks and greens.

Drafting the description (internal monologue/rough notes):

  • Une grande fleur rose, un cosmos, est au centre.
  • Une abeille bourdonne dessus, elle est noire et blanche.
  • Il y a d'autres fleurs roses autour, mais floues.
  • En bas à droite, une petite fleur rouge.
  • Il y a des tiges sèches et des feuilles vertes en arrière-plan.

Refining into French:

  1. Au centre de l'image, une grande fleur cosmos rose pâle s'épanouit avec élégance.
  2. Une abeille bourdonne activement au cœur de cette fleur, recueillant le pollen jaune.
  3. À gauche, une autre fleur rose est visible, bien que légèrement floue, tandis que des tiges séchées et des graines brunes parsèment le champ.
  4. En arrière-plan, on distingue des feuilles vertes et quelques fleurs roses fanées qui ajoutent de la texture à la scène.
  5. En bas à droite, une petite fleur rouge vif contraste avec les tons dominants de rose et de vert.
  6. L'ensemble de la composition capture un instant paisible de la nature, mettant en valeur l'interaction entre l'insecte et la plante.

Final Polish: Au centre de l'image, une grande fleur cosmos rose pâle s'épanouit avec élégance, attirant l'attention par sa couleur douce. Une abeille bourdonne activement au cœur de cette fleur, recueillant le pollen jaune qui contraste avec ses ailes noires et blanches. À gauche, une autre fleur rose est visible, bien que légèrement floue, tandis que des tiges séchées et des graines brunes parsèment le champ. En arrière-plan, on distingue des feuilles vertes et quelques fleurs roses fanées qui ajoutent de la texture à la scène. En bas à droite, une petite fleur rouge vif contraste avec les tons dominants de rose et de vert. L'ensemble de la composition capture un instant paisible de la nature, mettant en valeur l'interaction entre l'insecte et la plante.

Wait, let's make it slightly tighter to ensure it feels natural. Let's go with a 5-sentence version to be safe on the "detailed" part.

  1. Au centre de l'image, une grande fleur cosmos rose pâle s'épanouit avec élégance, attirant l'attention par sa couleur douce.
  2. Une abeille bourdonne activement au cœur de cette fleur, recueillant le pollen jaune qui contraste avec ses ailes noires et blanches.
  3. À gauche, une autre fleur rose est visible, bien que légèrement floue, tandis que des tiges séchées et des graines brunes parsèment le champ.
  4. En arrière-plan, on distingue des feuilles vertes et quelques fleurs roses fanées qui ajoutent de la texture à la scène.
  5. En bas à droite, une petite fleur rouge vif contraste avec les tons dominants de rose et de vert.

This is good. I will use this.

</think>

Au centre de l'image, une grande fleur cosmos rose pâle s'épanouit avec élégance, attirant l'attention par sa couleur douce. Une abeille bourdonne activement au cœur de cette fleur, recueillant le pollen jaune qui contraste avec ses ailes noires et blanches. À gauche, une autre fleur rose est visible, bien que légèrement floue, tandis que des tiges séchées et des graines brunes parsèment le champ. En arrière-plan, on distingue des feuilles vertes et quelques fleurs roses fanées qui ajoutent de la texture à la scène. En bas à droite, une petite fleur rouge vif contraste avec les tons dominants de rose et de vert.

At the center of the image, a large pale pink cosmos flower blooms elegantly, drawing attention with its soft color. A bee buzzes actively at the heart of this flower, collecting the yellow pollen that contrasts with its black and white wings. To the left, another pink flower is visible, though slightly blurred, while dried stems and brown seeds dot the field. In the background, green leaves and a few faded pink flowers can be made out, adding texture to the scene. In the bottom right, a small bright red flower contrasts with the dominant tones of pink and green.

<think>
The user has provided an image and asked for a description in French. The image shows a close-up of a garden scene. Key elements:

  • A large, vibrant light-pink cosmos (or similar dahlia-like) in the center.
  • A bumblebee is on the center of this main bloom, pollination is visible.
  • To the left, another similar light-pink cosmos is visible, but cut off.
  • There are several withered, brown, dead florets around the main bloom.
  • In the background, there are other plants: a large green lily pad or similar large green shape at the top, and some red geraniums (pelargoniums) at the bottom right.
  • There are also some wilting, purplish-pink cosmos in the background.
  • The focus is very deep on the central bloom and the bumblebee.

Structure of the description (4-6 sentences):

  1. Start with the main subject: the central light-pink cosmos and the bumblebee.
  2. Mention the action: the bumblebee is pollination.
  3. Mention the context/surrouding: other cosmos, dead florets.
  4. Mention the background elements: red geraniums, greenery.
  5. Convey the general feeling: natural, garden scene.

Let's draft the French sentences:

  1. Cette image présente un plan rapproché d'une fleur rose pâle, probablement un cosmos, au centre de l'encadrement.
  2. Une bourdon, aux couleurs noires et blanches, est perche sur le cœur jaune de la fleur, en train de butiner le pollen.
  3. Autour de cette floraison principale, on observe plusieurs autres tiges portant des fleurs roses, certaines en pleine fleur et d'autres déjà fanées et brunes.
  4. En arrière-plan, on devine des feuilles vertes floues et quelques fleurs rouges, semblables à des géraniums, qui ajoutent de la couleur à la scène.
  5. L'ensemble dépeint une scène de jardin naturelle et vivante, mettant en valeur l'interaction entre l'insecte et la plante.

Refining for flow and detail:

  • "Cette image capture une scène de jardin en gros plan, centrée sur une magnifique fleur rose pâle, vraisemblablement un cosmos."
  • "Une bourdon, avec son corps noir et blanc caractéristique, est posée au centre de la fleur, occupant le cœur jaune et récoltant le pollen."
  • "Autour de cette floraison principale, plusieurs autres tiges portent des fleurs roses, certaines en pleine beauté tandis que d'autres sont déjà fanées et brunes."
  • "En arrière-plan, on distingue des feuilles vertes floues et quelques fleurs rouges, probablement des géraniums, qui encadrent la composition."
  • "L'ensemble offre une vue détaillée et apaisante de la nature, mettant en évidence l'activité de pollinisation dans un environnement floral varié."

This looks good. It's descriptive, in French, and within the sentence limit.

</think>

Cette image capture une scène de jardin en gros plan, centrée sur une magnifique fleur rose pâle, vraisemblablement un cosmos. Une bourdon, avec son corps noir et blanc caractéristique, est posée au centre de la fleur, occupant le cœur jaune et récoltant le pollen. Autour de cette floraison principale, plusieurs autres tiges portent des fleurs roses, certaines en pleine beauté tandis que d'autres sont déjà fanées et brunes. En arrière-plan, on distingue des feuilles vertes floues et quelques fleurs rouges, probablement des géraniums, qui encadrent la composition. L'ensemble offre une vue détaillée et apaisante de la nature, mettant en évidence l'activité de pollinisation dans un environnement floral varié.

This image captures a close-up garden scene, centered on a magnificent pale pink flower, presumably a cosmos. A bumblebee, with its characteristic black and white body, is perched at the center of the flower, occupying the yellow heart and collecting pollen. Around this main bloom, several other stems bear pink flowers, some in full beauty while others are already faded and brown. In the background, blurred green leaves and a few red flowers, probably geraniums, can be made out, framing the composition. The overall image offers a detailed and soothing view of nature, highlighting pollination activity in a varied floral environment.

Gemma3 4B: See the original / trimmed comparison
Original Trimmed
L'image présente une scène de jardin fleuri, dominée par des cosmos roses vifs. Une grosse abeille noire et jaune est en train de se nourrir sur l'une des fleurs, illustrant l'importance des pollinisateurs dans cet écosystème. Au premier plan, un cosmos particulièrement beau attire l'attention grâce à sa couleur saturée et sa forme élégante. En arrière-plan, on aperçoit d'autres fleurs, dont une rouge, ainsi que des feuilles vertes luxuriantes, créant un tableau coloré et naturel. L'ensemble dégage une impression de calme et de vitalité, typique d'un jardin en pleine floraison.

The image presents a flowering garden scene, dominated by bright pink cosmos flowers. A large black and yellow bee is feeding on one of the flowers, illustrating the importance of pollinators in this ecosystem. In the foreground, a particularly beautiful cosmos draws attention thanks to its saturated color and elegant form. In the background, other flowers can be seen, including a red one, as well as lush green leaves, creating a colorful and natural tableau. The whole conveys an impression of calm and vitality, typical of a garden in full bloom.
L'image montre un champ de cosmos roses en pleine floraison. Une abeille noire et jaune est activement en train de butiner une des fleurs centrales. On observe également quelques fleurs fanées environnantes, de couleur rose et rouge, et des feuilles vertes en arrière-plan. Le tout est entouré d'un fond de végétation dense et luxuriante. La lumière semble douce et diffuse, créant une atmosphère paisible et naturelle.

The image shows a field of pink cosmos in full bloom. A black and yellow bee is actively foraging on one of the central flowers. A few surrounding wilted flowers, pink and red in color, and green leaves in the background can also be observed. The whole is surrounded by a backdrop of dense and lush vegetation. The light appears soft and diffuse, creating a peaceful and natural atmosphere.

We can see that the trimmed models' responses remain similar to the original models.
For the 0.8B model, the output correctly evokes the cosmos in both cases (one pink, the other pale purple), the bumblebee insect (the expected correct answer being a "bourdon" (in French, while in the response it use the English term "bumblebee"), giving us consistency between the two models) and the flowers (dead for one, dried for the other) in the background.
For the 2B, we again have the cosmos (pale pink for one, bright pink for the other), a bee instead of a bumblebee. The trimmed model is slightly more verbose.
For the 4B (non-thinking), we correctly have the cosmos (bright pink for both), the bumblebee (the right animal for both and using the French term and not the English one) and a background description that varies slightly between the two models but converges on the essentials.
For the 4B (thinking), we can observe that the reasoning trace (in English in both cases) of the original model is longer. This does not necessarily indicate better results since the original model refers to a bee rather than a bumblebee (the trimmed model, however, uses the incorrect gender agreement "une" bourdon instead of "un" bourdon).
For gemma-3, the original model is slightly more verbose (adding more adjectives than the trimmed model, which mostly reports the essentials). Both models refer to bees rather than a bumblebee.


Spanish
Photo by Merve Noyan
Figure 4: Photo by Merve NOYAN

Prompt used for all models: "Esto es lo que tengo en mi mano. Si como un caramelo de cada color, ¿cuántos caramelos me quedarán al final?" (Here is what I have in my hand. If I eat one candy of each color, how many candies will I have left at the end?)

Qwen3.5 0.8B: See the original / trimmed comparison
Original Trimmed
Tengo 4 caramelos, 2 de cada color. Si me los da un caramelo de cada color, me quedarán 2 caramelos.

I have 4 candies, 2 of each color. If I am given one of each color, I will have 2 left.
Tengo 4 caramelos de colores diferentes. Si me los come uno por uno, me quedarán 3 caramelos.

I have 4 candies of different colors. If I eat them one by one, I will have 3 left.
Qwen3.5 2B: See the original / trimmed comparison
Original Trimmed
Tienes 4 caramelos en total: 2 verdes, 2 naranjas y 1 verde. Si los comes todos, quedarán 0 caramelos.

You have 4 candies in total: 2 green, 2 orange and 1 green. If you eat them all, there will be no candies left.
Si comes un caramelo de cada color, tendrás 4 caramelos en total. Los caramelos de colores verde, naranja y azul están en tu mano, y el naranja también está visible. Al comer uno de cada color, quedarán 4 caramelos en tu mano.

If you eat one candy of each color, you will have 4 candies in total. The green, orange and blue candies are in your hand, and the orange candy is also visible. By eating one of each color, you will have 4 candies left in your hand.
Qwen3.5 4B: See the original / trimmed comparison
Original Trimmed
Tienes 5 caramelos en total: 2 azules, 2 naranjas y 1 verde. Si comes uno de cada color, comerás 3 caramelos. 5 menos 3 es igual a 2. Por lo tanto, te quedarán 2 caramelos al final.

You have 5 candies in total: 2 blue, 2 orange and 1 green. If you eat one candy of each color, you will have eaten 3. 5 minus 3 equals 2. You will therefore have 2 candies left at the end.
Tienes 5 caramelos en total: 2 azules, 2 naranjas y 1 verde. Si comes uno de cada color, comerás 3 caramelos (uno azul, uno naranja y uno verde). 5 menos 3 es igual a 2. Al final te quedarán 2 caramelos.

You have 5 candies in total: 2 blue, 2 orange and 1 green. If you eat one candy of each color, you will have eaten 3 (one blue, one orange and one green). 5 minus 3 equals 2. You will therefore have 2 candies left at the end.
Qwen3.5 4B (thinking): See the original / trimmed comparison

Original Trimmed
<think>
The user wants to know how many candies are left if they eat one of each color.
  1. Count the candies: I need to look at the image and count the total number of candies.

    • There's a teal one at the top.
    • There's a teal one below it with a leaf symbol.
    • There's an orange one below that.
    • There's an orange one with a symbol.
    • There's a green one at the bottom.
    • Total count: 5 candies.
  2. Identify the colors:

    • Teal (2 candies)
    • Orange (2 candies)
    • Green (1 candy)
  3. Analyze the condition: "Si como un caramelo de cada color" (If I eat one candy of each color).

    • Colors present: Teal, Orange, Green.
    • Action: Eat one of each.
  4. Calculate the remaining candies:

    • Start with 5.
    • Eat 1 Teal -> 4 left.
    • Eat 1 Orange -> 3 left.
    • Eat 1 Green -> 2 left.
    • Alternatively: Total (5) - Colors (3) = 2 remaining.
  5. Formulate the answer:

    • State the total number of candies (5).
    • State the number of colors (3: teal, orange, green).
    • Explain the subtraction (5 - 3 = 2).
    • Conclude with the final number.
  6. Refine the Spanish response:

    • "En la imagen hay un total de 5 caramelos."
    • "Están en tres colores distintos: dos azules turquesa, dos naranjas y uno verde."
    • "Si comes uno de cada color (azul, naranja y verde), restas 3 caramelos del total."
    • "Por lo tanto, te quedarán 2 caramelos al final."
  7. Final Polish: Ensure natural flow. "En la imagen se pueden contar un total de 5 caramelos en la palma de la mano. Estos caramelos están distribuidos en tres colores diferentes: dos de color turquesa, dos de color naranja y uno de color verde. Si decides comer un caramelo de cada uno de los tres colores presentes, estarás comiendo 3 caramelos en total. Al restar esos 3 caramelos del total inicial de 5, te quedarán exactamente 2 dulces."

    Wait, let me double check the count.

    • Top teal
    • Teal with leaf
    • Orange
    • Orange with symbol
    • Green Yes, 5 total. 3 colors. 5 - 3 = 2. Correct.

    Let's try to make it slightly more conversational but still formal enough. "En la foto se observan un total de 5 caramelos sostenidos en la mano. Estos se dividen en tres colores distintos: dos son de color turquesa, dos son naranjas y uno es verde. Si comes un caramelo de cada color, estarás consumiendo 3 dulces en total. Por lo tanto, al restar esos 3 del total inicial, te quedarán 2 caramelos al final."

    This looks good. 4 sentences.

</think>

En la foto se observan un total de 5 caramelos sostenidos en la mano. Estos se dividen en tres colores distintos: dos son de color turquesa, dos son naranjas y uno es verde. Si comes un caramelo de cada color, estarás consumiendo 3 dulces en total. Por lo tanto, al restar esos 3 del total inicial, te quedarán 2 caramelos al final.

In the photo, a total of 5 candies can be seen held in the hand. They are divided into three different colors: two are teal, two are orange and one is green. If you eat one candy of each color, you will have consumed 3 in total. Consequently, by subtracting those 3 candies from the initial number, you will have 2 left at the end.

<think>
El usuario quiere saber cuántos caramelos le quedan si come uno de cada color. Primero, debo contar los caramelos en la imagen. Veo:

  • 2 caramelos azules/verdes (turquesa).
  • 2 caramelos naranjas.
  • 1 caramelo verde brillante. Total: 2 + 2 + 1 = 5 caramelos.

Ahora, la pregunta es: "Si me como un caramelo de cada color, ¿cuántos caramelos me quedarán al final?" Esto es un poco ambiguo. ¿Significa que come un caramelo de cada color disponible (azul, naranja, verde)? Si come un caramelo de cada color:

  • Come 1 azul.
  • Come 1 naranja.
  • Come 1 verde. Total comido: 3 caramelos. Caramelos restantes: 5 (totales) - 3 (comidos) = 2 caramelos.

Otra interpretación: ¿Significa que come un caramelo de cada color que ve? Sí, eso es lo mismo. Hay 3 colores: azul, naranja, verde. Si come uno de cada uno, come 3 caramelos. Le quedan 2.

Voy a responder directamente.

  1. Contar los caramelos: Hay 5 caramelos en total (2 azules, 2 naranjas y 1 verde).
  2. Calcular el consumo: Si comes un caramelo de cada color (azul, naranja y verde), comerás 3 caramelos.
  3. Calcular el resto: 5 menos 3 son 2.
  4. Especificar qué te quedaría: Te quedarían 2 caramelos (probablemente uno azul y uno naranja, o dos naranjas si el azul se come primero, pero la cantidad es 2).

Respuesta final: Tienes 5 caramelos en total (2 azules, 2 naranjas y 1 verde). Si comes un caramelo de cada color, comerás 3 caramelos en total. Por lo tanto, te quedarán 2 caramelos al final.
</think>

Tienes 5 caramelos en total (2 azules, 2 naranjas y 1 verde). Si comes un caramelo de cada color, comerás 3 caramelos en total. Por lo tanto, te quedarán 2 caramelos al final.

You have 5 candies in total (2 blue, 2 orange and 1 green). If you eat one candy of each color, you will have eaten 3 in total. You will therefore have 2 left at the end.

Gemma3 4B: See the original / trimmed comparison
Original Trimmed
En tu mano hay tres caramelos: uno naranja, uno azul y uno verde. Si comes uno de cada color, consumirás tres caramelos en total. Después de comerlos, te quedará cero caramelos en tu mano. ¡Te los has comido todos!

You have three candies in your hand: one orange, one blue and one green. If you eat one of each color, you will have consumed three candies in total. After eating them, you will have zero candies left in your hand. You have eaten them all!
Hay tres caramelos en tu mano: uno naranja, uno turquesa y uno verde. Si comes un caramelo de cada color, consumirás tres caramelos en total. Al final, te quedará cero caramelos. ¡Todo se ha ido!

You have three candies in your hand: one orange, one teal and one green. If you eat one candy of each color, you will have consumed three in total. At the end, you will have zero candies left. It has all gone!

As with French, we can observe that the trimmed models' responses remain similar to the original models.
On the substance of the problem to be solved, Qwen 0.8B and 2B, whether trimmed or not, fail to solve it correctly. The same is true of Gemma3 4B.

The Qwen3.5 4B is capable of giving the correct answer whether or not we activate thinking mode.
An interesting observation concerns the reasoning trace. That of the trimmed model is shorter, but most notably was generated in Spanish rather than English.
This is the phenomenon we described in the previous section for Qwen3.
Note that the reasoning trace from the French Qwen3.5 4B example was generated in English and not in French. Thus, as with decoder models, it appears possible to generate reasoning traces in the language of the trimmed model, but the consistency of this generation is not guaranteed. We would have liked to run more in-depth experiments on this subject by varying languages, question types and model sizes, to determine whether, for example, beyond a certain size, this phenomenon could be achieved systematically.

Warning ⚠️
We are uncomfortable not having a benchmark for this section and thus being limited to vibechecking. We strongly invite the reader to run their own tests.
In particular, to be vigilant regarding low-resource languages or those using a non-Latin alphabet, on which we were not able to run any tests.

🧠 Key Takeaways
It would appear that in the context of VLMs, trimming makes it possible to maintain outputs equivalent to the original models (at minimum for languages using the Latin alphabet and that are well-resourced).
As with decoder models, it is possible to generate reasoning traces in the target language, but how to achieve this methodically remains uncertain.

Collections

🎁
A collection gathering 498 models (83 different languages) trimmed from Qwen 3.5 is available here.
A collection gathering 243 models (122 different languages) trimmed from multimodal Gemma 3 is available here.


Visual Embedding Models (CLIP)

Let us conclude our experiments by looking at visual embedding models.

Selected Models and Benchmarks

In this section of the article, we analyze the following models:

Regarding evaluation, non-English benchmarks being rare, we chose to offer usage notebooks for the models with the Sentence Transformers library, as version 5.4 released in April 2026 makes it easy to use this type of model. We focus in particular on image search and 0-shot image classification.

Observed Parameter Reductions

Table 18: Observed parameter reductions for visual embeddings (CLIP) models
Model Vocabulary size (original) Vocabulary size (trimmed) % Vocabulary reduction Parameters (original) Parameters (trimmed) % Parameter reduction
clip-ViT-B-32-multilingual-v1 (visual encoder included) 119,547 16,384 - 86.30% 223.0M 143.7M - 35.53%
clip-ViT-B-32-multilingual-v1 (visual encoder included) 119,547 32,768 - 72.59% 223.0M 156.3M - 29.89%
clip-ViT-B-32-multilingual-v1 (visual encoder excluded) 119,547 16,384 - 86.30% 135.1M 55.9M - 58.63%
clip-ViT-B-32-multilingual-v1 (visual encoder excluded) 119,547 32,768 - 72.59% 135.1M 68.5M - 49.32%
metaclip-2-worldwide-s16-384 901,629 16,384 - 98.18% 389.7M 49.7M - 87.24%
metaclip-2-worldwide-s16-384 901,629 32,768 - 96.37% 389.7M 56.0M - 85.62%
metaclip-2-worldwide-m16-384 901,629 16,384 - 98.18% 538.5M 85.3M - 84.16%
metaclip-2-worldwide-m16-384 901,629 32,768 - 96.37% 538.5M 93.7M - 82.60%
metaclip-2-worldwide-b16-384 901,629 16,384 - 98.18% 586.2M 133.0M - 77.31%
metaclip-2-worldwide-b16-384 901,629 32,768 - 96.37% 586.2M 141.4M - 75.88%
metaclip-2-worldwide-l14 901,629 16,384 - 98.18% 1.082B 402.3M - 62.83%
metaclip-2-worldwide-l14 901,629 32,768 - 96.37% 1.082B 414.8M - 61.66%
metaclip-2-worldwide-huge-378 901,629 16,384 - 98.18% 1.859B 952.9M - 48.75%
metaclip-2-worldwide-huge-378 901,629 32,768 - 96.37% 1.859B 969.7M - 47.85%
metaclip-2-worldwide-giant-378 901,629 16,384 - 98.18% 3.631B 2.519B - 31.20%
metaclip-2-worldwide-giant-378 901,629 32,768 - 96.37% 3.631B 2.498B - 30.63%

📝 Note
The clip-ViT-B-32-multilingual-v1 and MetaCLIP-2 having a vocabulary size that is not a multiple of 64, trimming appears particularly suitable even in cases where we do not specifically wish to create a monolingual/n-lingual model.

Results

Image Search

For the image search task, you can refer to the following two notebooks created for the Sentence Transformers documentation: the English monolingual version and the multilingual version.
In the first, we compare a search performed using CLIP by RADFORD, KIM et al. (2021) with an original and trimmed MetaCLIP-2. We can observe no loss between the two versions (the results are identical).
In the second, we apply the clip-ViT-B-32-multilingual-v1.


0-Shot Image Classification

For this task, we invite you to consult the following notebook.
An alternative is to use this Space which allows comparing, for all the languages used in this blog post, an original model against its trimmed versions at 32,768 and 16,384 tokens.

The notebook focusing mainly on how to perform 0-shot image classification rather than on trimmed models, we did not list in it two interesting observations that we therefore detail below.

The first is that even if a model is trimmed on a given language, it can retain capabilities in other languages. This is particularly the case for closely related languages sharing the same alphabet.

For example, if we use the clip-ViT-B-32-multilingual-v1 with the classes "Hund" (dog in German), "gato" (cat in Spanish), "Paris de jour" and "Paris at night" on 4 different images, we obtain:

Multilingual 0-shot image classification results
Figure 5: Multilingual 0-shot image classification with clip-ViT-B-32-multilingual-v1

And if we redo the same test but using instead the trimmed version of the model on a Spanish corpus to be optimized for that language, we find that the results are identical.

Multilingual 0-shot image classification results
Figure 6: Multilingual 0-shot image classification with clip-ViT-B-32-multilingual-v1-spa-32768

Let us redo the same experiment on languages using a different alphabet, still with clip-ViT-B-32-multilingual-v1 but with the classes "Hund" (dog in German), "gato" (cat in Spanish), "파리" (Paris in Korean) and "باريس ليلا" (Paris at night in Arabic) on the same 4 images, we obtain:

Multilingual (Latin and non-Latin alphabets) 0-shot image classification with clip-ViT-B-32-multilingual-v1
Figure 7: Multilingual (Latin and non-Latin alphabets) 0-shot image classification with clip-ViT-B-32-multilingual-v1

By applying the trimmed Spanish model again, we can see that performance degrades significantly for languages not using the Latin alphabet (- 18% for the first image, and the answer for the second becomes wrong):

Multilingual (Latin and non-Latin alphabets) 0-shot image classification with clip-ViT-B-32-multilingual-v1-spa-32768
Figure 8: Multilingual (Latin and non-Latin alphabets) 0-shot image classification with clip-ViT-B-32-multilingual-v1-spa-32768

The second observation concerns quantization. Indeed, we saw above that trimming can be an alternative to distillation. What about quantization?

Let us take an example consisting of performing 0-shot classification on the following image:

Image from COCO (Val 2017) by LIN et al. (2014)
Figure 9: Image from COCO (Val 2017) by LIN et al. (2014)

Let us run a test on Tamil with the classes பூனைகள் (Cats), சிங்கங்கள் (Lions) and புலிகள் (Tigers), and the model metaclip-2-worldwide-s16-384. The results are as follows:

Table 19: metaclip-2-worldwide-s16-384 on Tamil
 Configuration  பூனைகள் (Cats) சிங்கங்கள் (Lions) புலிகள் (Tigers) Size
Original FP32  0.992750883102417   0.005200702231377363     0.002048447960987687   1486.42 MB 
Original BF16  0.9921875   0.005218505859375   0.002166748046875   743.21 MB 
Trimming 32768 FP32  0.992750883102417   0.005200702231377363   0.002048447960987687   213.68 MB 
Trimming 32768 BF16  0.9921875   0.005218505859375   0.002166748046875   106.84 MB 
Trimming 16384 FP32  0.992750883102417   0.005200702231377363   0.002048447960987687   189.68 MB 
Trimming 16384 BF16  0.9921875   0.005218505859375   0.002166748046875   94.84 MB 

We deliberately left all the figures to illustrate that trimming is extremely robust on visual embeddings. We can see that there is no degradation between the original model (FP32) and its trimmed variants at 32,768 and 16,384 tokens. The switch to BF16 very slightly degrades the results but not enough to call into question the classification conclusions.
Comparing trimming to quantization, it appears that for these types of models, trimming is more beneficial. Indeed, we go from 1486.42 MB to 189.68 MB with trimming, a reduction of 87.24% in memory size, versus a 50% reduction for BF16 (going from 1486.42 MB to 743.21 MB).
But by combining both approaches, trimming and quantization make it possible to reduce to only 94.84 MB, a 93.62% reduction!

This example suggests that trimming on CLIP models yields results identical to the original model. This is an extreme case with Tamil, which requires few mined tokens for this task (see the Number of tokens to retain section below). Repeating the same thing but this time with French, we obtain:

Table 20: metaclip-2-worldwide-s16-384 on French
Configuration Cats Lions Tigers Size
Original FP32 0.9781160354614258 0.0006950251990929246 0.021188927814364433 1486.42 MB
Original BF16 0.98046875 0.000789642333984375 0.020263671875 743.21 MB
Trimming 32768 FP32 0.9983513355255127 0.0007094039465300739 0.0009392466163262725 213.68 MB
Trimming 32768 BF16 0.98046875 0.000789642333984375 0.020263671875 106.84 MB
Trimming 16384 FP32 0.9984958171844482 0.0013454135041683912 0.00015876817633397877 189.68 MB
Trimming 16384 BF16 0.998419761657715 0.001411437988281 0.000168800354004 94.84 MB

This time we have trimmed models performing better than the original, and the BF16 version performing better than FP32 for the original model. Once again, the combination of trimming and quantization appears to be very relevant.

📝 Note
Two other notebooks are available. One on exact or near-duplicate image deduplication and the other on image clustering.

🧠 Key Takeaways
On CLIP-type visual embedding models, trimming yields performance identical to that of the original model to more than 15 decimal places.
The trimmed models obtained are very lightweight (it is on them that we observe the largest parameter count reductions of all models presented in this article) and remain more performant than models obtained by BF16 quantization. The combination of these two methods appears very competitive.
We can also observe that even if we transform a multilingual model into a monolingual model, it is possible to retain multilingual capabilities, particularly for languages sharing the same alphabet.

Collections

🎁
A collection gathering 98 models (49 different languages) trimmed from clip-ViT-B-32-multilingual-v1 is available here.
A collection gathering 1,152 models (97 different languages) trimmed from MetaCLIP2 is available here.



Open Questions ❓

Number of tokens to retain

The question arose as to what the optimal model size is for the vocabulary of the trimmed model. It turns out that the more we delved into the topic, the more we began to question our understanding of tokenizers. Indeed, it appears that the size to retain depends on many factors:

  1. The original model
Number of tokens to cover a given percentage of the mining dataset for the mmBERT-small model on English
Figure 10: Number of tokens to cover a given percentage of the mining dataset for the mmBERT-small model on English

On this graph, we can see that out of the 200,000 texts from Fineweb-2 used for mining (but also for training mmBERT), 25,908 tokens are sufficient to cover 95% of the tokens seen in the mining database. That is 66,002 tokens for 99% and 154,270 tokens for 100%. Out of the original 256,000 tokens, we can eliminate 101,730 tokens that never appear in English texts, representing a guaranteed reduction of 39M parameters (-27.8%). But do we really need to keep the long tail of 88,268 tokens representing the last percent, which would be a potential additional reduction of 33.9M parameters (- 24.13%)? And why stop at 99% rather than 98% or 97%?

Things get more complicated when we analyze the 14 other architectures:

Number of tokens to cover a given percentage of the mining dataset depending on the architecture considered for English
Table 21: Number of tokens to cover a given percentage of the mining dataset depending on the architecture considered for English

We can see that for a given language there is therefore no universal number of tokens to retain that would work for all architectures. However, you can see that 32,768 always falls between 95% and 99%. Note that for the other Latin-alphabet languages evaluated in this article, we find that with the exception of Gemma models, 32,768 tokens are sufficient to be between 99% and 100%. Our hypothesis being that English is the boundary case since it is the most well-resourced language.

  1. The language

When looking at languages that do not use the Latin alphabet, we can observe significantly lower vocabulary sizes. For example, if we take mmBERT but this time focusing on Korean rather than English:

Number of tokens to cover a given percentage of the mining dataset for the mmBERT-small model on Korean
Figure 11: Number of tokens to cover a given percentage of the mining dataset for the mmBERT-small model on Korean

The number of tokens to cover 99% of the database is 5,859 (3,956 for Arabic and barely 471 for Tamil). However, the tail of the last percent is very large, with 143,161 additional tokens needed to reach 100% (108,697 for Arabic and 113,391 for Tamil).

  1. The task

Beyond the number of tokens in the vocabulary needed to cover the mining database, it is the final downstream task that should be the most determining factor. In section 3.1.4 of the SAM3-LiteText paper by ZENG et al. (2026), the authors observe that only 17,300 tokens from the text encoder vocabulary are actually relevant, since the others never appear in common object segmentation datasets. It is this observation that led us to also offer model versions with 16,384 tokens.
This language / model / downstream task combination appears feasible for a specific use case, in which we can have a model with a customized vocabulary. If one wants a general-purpose model, this seems more complicated since we do not necessarily know all downstream tasks in advance.
In the case of n-lingual rather than monolingual models, the interaction between the different languages can also be tricky. One must ensure a balance between languages so that one does not dominate the others, anticipate that tokens will be shared across n languages, etc.

Finally, to conclude this point, one path we would like to explore is the tokenizers of ASR models. Indeed, Whisper by RADFORD, KIM et al. (2022) handles 99 languages with a tokenizer of 51,866 tokens, averaging 524 tokens per language. The authors of vocabtrimmer observe (and we were able to replicate this) that below 10,000 tokens, model performance on downstream tasks collapses.
How is it that ASR models work with so few tokens while text models do not, when text is simply a written form of speech?


Models including code

In this article we focused exclusively on models trimmed for languages.
Since code generation is a frequent use case for decoders/VLMs, it would be necessary to verify that trimming works well with programming languages in order to build user language / specific programming language(s) models tailored to a user working in IT in a given country.


Order of trimming

In the section on encoders we find that it is better to perform VT (vocabulary trimming) then FT (finetuning) to avoid a performance collapse (i.e. models that no longer work at all). It turns out that we observed the same results for encoder-decoders, decoders and VLMs (the performance not necessarily collapsing to the point where the model is no longer usable, but with significant performance losses of around 10 to 20 points depending on the task).
The performance gains of the MTEB-NL authors on trimmed then finetuned embedding models, which we also reproduced, also seem to support adopting this order of operations.
Finally, considering the fact that a trimmed model finetunes faster than the original model because it is smaller, we recommend, and this is our personal view only, to always trim a model first and then finetune it.

Warning ⚠️
Nevertheless, we must note that the authors of vocabtrimmer (in tables 1 and 2 of their paper) find contradictory results depending on the task or language considered, but also on the model (tests on mT5, mBART and XLM-RoBERTa).
Note however that all these comparisons are not made with the same number of tokens per language and per model.


Impact of trimming on biases

In their paper, the authors of vocabtrimmer (see section 6 in particular) find that an English model derived from a trimmed multilingual model appears to exhibit fewer social biases (particularly racial ones) than an English model trained from scratch. This result is in line with earlier work (AHN and OH (2021)). The authors' intuition is that a model trained with more languages is exposed to greater cultural diversity, and that divergent viewpoints thus balance each other out.

Note that this observation has not been verified for languages other than English, and concerns only one type of model, namely an encoder (a trimmed XLM-RoBERTa vs. a RoBERTa).
This hypothesis would then need to be tested on other languages (the question of the existence of benchmarks then arises) and extended to include more models.


The code for trimming

In this blog post we presented more than 16 different models. We offer monolingual versions of each of them through 5,526 models available here. Nevertheless, what if you wish to build your own trimmed model covering l languages of your choice retaining t tokens of your choice?
We are currently working on the design of a library to make this easy. A prerequisite to resolve is the management of these models on Hugging Face and via the transformers library. Indeed, we currently offer 5,526 different models representing a total of approximately 10 TB of storage. This approach is absolutely not efficient. It unnecessarily duplicates the backbone weights, whereas between the original and the trimmed model, only the output (and possibly input) embedding layers are modified, along with the vocabulary files.
To limit storage, it would suffice to load the backbone of the original model on one side and the config files for applying the desired trimming on the other.
It would also probably be necessary to revise our current code somewhat in order to offer, in addition to the option of a user-defined number of tokens, the option of automatically finding the nearest power of 64 to a coverage threshold set by the user. As mentioned earlier in this section, the coverage threshold (for example 95% or 99%) being the number of tokens in the vocabulary needed to cover x% of the tokens in the mining database. This would allow for vocabulary sizes more specific to each language.
While we converge towards a solution, you can make specific requests here where, following the approach taken for quantization (with mradermacher for example), we will be able to respond to popular requests.



Conclusion

We have provided an overview of the trimming method, which allows very simply reducing the size of a model in terms of both parameter count and memory footprint.
This method, which only works if the embedding layer is the last in the network, has many advantages:

  • It does not require retraining and therefore no GPU. All the trimmed models we obtained were produced on CPU.
  • We observe equivalent or even improved performance, possibly due to the removal of spurious tokens.
  • In the case of a slight degradation of results, it is sufficient to finetune the trimmed model.
  • At an equivalent compute budget, it is even better to trim a model and then finetune it in order to benefit from the training acceleration obtained from the model reduction.
  • The method is competitive with distillation to the point that we recommend preferring trimming over the latter when the desired parameter reduction is equivalent.
  • In the same way, trimming can be more relevant than quantization, and the two methods complement each other well.
  • An interesting behavior we observe but that we are currently unable to replicate automatically is that a model trimmed on a given language can apparently generate reasoning traces in that target language, thus avoiding the need to generate them in English and then translate them.
  • Finally, although we trim a model on a given language, depending on the final downstream task, the model can still work in other languages.

Open questions remain, such as the number of tokens to retain for a given language / modality / task, the order of trimming, and the impact on biases.

We are working on making this method easier to use via a tool. In the meantime, we invite you to explore the 5,526 models available via this Space.



References



Citation

@misc{hf_blogpost_trimming,
      title={Introduction to Trimming}, 
      author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI},
      year={2026},
      url={https://huggingface.co/blog/lbourdois/introduction-to-trimming}, 
}

Models mentioned in this article 10

Collections mentioned in this article 9

Community

wow 🤯

Sign up or log in to comment

Models mentioned in this article 10

Collections mentioned in this article 9