Tokenizer: How to suppress preceding whitespace when giving pre-tokenized list[list[str]] input

I am passing pre-tokenized input into a GPT2 tokenizer, in the form of a list of lists of strings. The reason for this is that I have set token boundaries that I will need to map back to later, and Japanese has no whitespace and ambiguous word boundaries.

tokenizer = GPT2Tokenizer.from_pretrained("sberbank-ai/mGPT", clean_up_tokenization_spaces=True)
tokenizer.add_bos_token = True

sents = [
    ["犯人", "は", "検挙", "さ", "れ", "て", "おら", "ず" "、" "2012", "年", "8", "月", "現在", "未", "解決"],
    ["まず", "は", "下記", "を", "ご覧", "下さい", "。"],
]

toks = tokenizer(sents, return_tensors="pt", is_split_into_words=True, padding=True)

However, what I notice is that it is systematically inserting a whitespace character in front of the tokens, which, with Japanese, are then treated as separate sub-tokens.

>>>print(*(tokenizer.decode(tok) for tok in toks["input_ids"][0]), sep="|")

<s>| |犯|人| は| |検|挙| |さ| |れ| |て| |お|ら| |ず|、|2012| |年| 8| |月| |現在| |未| |解決

This is a problem because whitespaces are usually not there at all in Japanese, and they will mess up the predictions. At the same time, Japanese is not the only language I’m working with, and others, such as English, will need whitespace separation on some, but not all, tokens.

So I’m trying to get the tokenizer to do two things:

  1. Suppress the whitespace insertion behaviour with pre-given tokens. The start of a token should just be the first character, unless (2)
  2. Represent the whitespace if and only if it is there at the beginning of the token.

I’ve been trawling the documentation but I haven’t found a way to do this, and I feel like it should be simple. Any help would be much appreciated.

I thought there might be a simple way, but it seems like there really isn’t
It seems to be a constraint in models using Transformers’ byte-level BPE tokenizers.

I feared this might be the case. I’ve been trying to track down in the codebase exactly where the extra whitespace is added in is_split_into_tokens mode, but I can’t find it.

For the time being I will just use the Japanese with the whitespace inserted and see what the results are like, but this solution might be a good enough workaround. I’m just worried about aligning back to the original values because the two tokenizations may well cross boundaries.

Thanks for your help!