Tokenizer: How to suppress preceding whitespace when giving pre-tokenized list[list[str]] input

saarlandi · October 2, 2025, 8:12am

I am passing pre-tokenized input into a GPT2 tokenizer, in the form of a list of lists of strings. The reason for this is that I have set token boundaries that I will need to map back to later, and Japanese has no whitespace and ambiguous word boundaries.

tokenizer = GPT2Tokenizer.from_pretrained("sberbank-ai/mGPT", clean_up_tokenization_spaces=True)
tokenizer.add_bos_token = True

sents = [
    ["犯人", "は", "検挙", "さ", "れ", "て", "おら", "ず" "、" "2012", "年", "8", "月", "現在", "未", "解決"],
    ["まず", "は", "下記", "を", "ご覧", "下さい", "。"],
]

toks = tokenizer(sents, return_tensors="pt", is_split_into_words=True, padding=True)

However, what I notice is that it is systematically inserting a whitespace character in front of the tokens, which, with Japanese, are then treated as separate sub-tokens.

>>>print(*(tokenizer.decode(tok) for tok in toks["input_ids"][0]), sep="|")

<s>| |犯|人| は| |検|挙| |さ| |れ| |て| |お|ら| |ず|、|2012| |年| 8| |月| |現在| |未| |解決

This is a problem because whitespaces are usually not there at all in Japanese, and they will mess up the predictions. At the same time, Japanese is not the only language I’m working with, and others, such as English, will need whitespace separation on some, but not all, tokens.

So I’m trying to get the tokenizer to do two things:

Suppress the whitespace insertion behaviour with pre-given tokens. The start of a token should just be the first character, unless (2)
Represent the whitespace if and only if it is there at the beginning of the token.

I’ve been trawling the documentation but I haven’t found a way to do this, and I feel like it should be simple. Any help would be much appreciated.

John6666 · October 2, 2025, 12:21pm

I thought there might be a simple way, but it seems like there really isn’t…
It seems to be a constraint in models using Transformers’ byte-level BPE tokenizers.

saarlandi · October 2, 2025, 2:31pm

I feared this might be the case. I’ve been trying to track down in the codebase exactly where the extra whitespace is added in is_split_into_tokens mode, but I can’t find it.

For the time being I will just use the Japanese with the whitespace inserted and see what the results are like, but this solution might be a good enough workaround. I’m just worried about aligning back to the original values because the two tokenizations may well cross boundaries.

Thanks for your help!

Topic		Replies	Views
Tokenizer ignores repeated whitespaces 🤗Tokenizers	3	3532	May 19, 2022
How to avoid PreTrainedTokenizerFast.decode to add space between tokens 🤗Transformers	3	128	April 22, 2025
T5Tokenizer add a whitespace token after added special tokens 🤗Tokenizers	0	359	November 22, 2023
Pre_tokenization 🤗Transformers	0	358	April 13, 2023
Training a tokenizer clarification question 🤗Tokenizers	3	61	December 4, 2025

Tokenizer: How to suppress preceding whitespace when giving pre-tokenized list[list[str]] input

Related topics