I am passing pre-tokenized input into a GPT2 tokenizer, in the form of a list of lists of strings. The reason for this is that I have set token boundaries that I will need to map back to later, and Japanese has no whitespace and ambiguous word boundaries.
tokenizer = GPT2Tokenizer.from_pretrained("sberbank-ai/mGPT", clean_up_tokenization_spaces=True)
tokenizer.add_bos_token = True
sents = [
["犯人", "は", "検挙", "さ", "れ", "て", "おら", "ず" "、" "2012", "年", "8", "月", "現在", "未", "解決"],
["まず", "は", "下記", "を", "ご覧", "下さい", "。"],
]
toks = tokenizer(sents, return_tensors="pt", is_split_into_words=True, padding=True)
However, what I notice is that it is systematically inserting a whitespace character in front of the tokens, which, with Japanese, are then treated as separate sub-tokens.
>>>print(*(tokenizer.decode(tok) for tok in toks["input_ids"][0]), sep="|")
<s>| |犯|人| は| |検|挙| |さ| |れ| |て| |お|ら| |ず|、|2012| |年| 8| |月| |現在| |未| |解決
This is a problem because whitespaces are usually not there at all in Japanese, and they will mess up the predictions. At the same time, Japanese is not the only language I’m working with, and others, such as English, will need whitespace separation on some, but not all, tokens.
So I’m trying to get the tokenizer to do two things:
- Suppress the whitespace insertion behaviour with pre-given tokens. The start of a token should just be the first character, unless (2)
- Represent the whitespace if and only if it is there at the beginning of the token.
I’ve been trawling the documentation but I haven’t found a way to do this, and I feel like it should be simple. Any help would be much appreciated.