Tokenization

OpenNMT provides generic tokenization utilities to quickly process new training data. The goal of the tokenization is to convert raw sentences into sequences of tokens. In that process two main operations are performed in sequence:

  • normalization - which applies some uniform transformation on the source sequences to identify and protect some specific sequences (for instance url), normalize characters (for instance all types of quotes, unicode variants) or even to normalize some variants (like dates) into unique representation simpler for the translation process
  • the tokenization itself - which transforms the actual normalized sentence into a sequence of space-separated tokens together with possible features (case).

Normalization

Normalization is performed by user commandline tool which has to work in "pipeline" mode: sentences from standard input are normalized and produced on the standard output. For instance, the following python script is normalizing unicode representation (using NFC representation), turns French quotes «» into English quotes “”, and protect "hashtags" sequences:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import unicodedata
import re
import sys

for line in sys.stdin:
  line = line.strip()
  line = unicodedata.normalize('NFC', line.encode('utf8'))
  line = line.replace(u"«", u"“").replace(u"»", u"”")
  line = line.encode('utf8').sub(r'(^|[^S\w])#([A-Za-z0-9_]+)', '\\1⦅#\\2⦆')
  print(line)

Normalization script is called as part of tokenization adding the option -normalize_cmd "normalize.py".

Tokenization

To tokenize a corpus:

th tools/tokenize.lua OPTIONS < file > file.tok

Available tokenization modes are defined here.

In order to perform detokenization, tokenization can introduce a joiner annotation mark .

Detokenization

If you activate -joiner_annotate marker, the tokenization is reversible. Just use:

th tools/detokenize.lua OPTIONS < file.tok > file.detok

Special characters

  • (U+FFE8) is the feature separator symbol. If such character is used in source text, it is replaced by its non presentation form (U+2502).
  • (U+FFED) is the default joiner marker (generated in -joiner_annotate marker mode). If such character is used in source text, it is replaced by its non presentation form (U+25A0)
  • ⦅...⦆ (U+FF5F, U+FF60) are marking a sequence as protected - it won't be tokenized and its case feature is N. Protected sequences can be used to code placeholders - typically document format tags and may have additional fields:
    • if the protected sequence content is containing the character (U+FF1A) - the first part of the protected sequence is a placeholder and will be used as the vocab during the translation. For instance, in ⦅URL:http://www.opennmt.net⦆, the protected sequence will be seen as ⦅URL⦆ which is a placeholder for the complete URL.
    • if the protected sequence is containing the character (U+FF03) - then the sequence is considered as unique and is particularily useful to enforce translation of tags using GBS.

Mixed casing words

-segment_case feature enables tokenizer to segment words into subwords with one of 3 casing types (truecase ('House'), uppercase ('HOUSE') or lowercase ('house')), which helps restore right casing during detokenization. This feature is especially useful for texts with a signficant number of words with mixed casing ('WiFi' -> 'Wi' and 'Fi').

WiFi --> wi│C fi│C
TVs --> tv│U s│L

Alphabet Segmentation

Two options provide specific tokenization depending on alphabet:

  • -segment_alphabet_change: tokenize a sequence between two letters when their alphabets differ - for instance between a Latin alphabet character and a Han character.
  • -segment_alphabet Alphabet: tokenize all words of the indicated alphabet into characters - for instance to split a chinese sentence into characters, use -segment_alphabet Han:
君子之心不胜其小,而气量涵盖一世。 --> 君 子 之 心 不 胜 其 小 , 而 气 量 涵 盖 一 世 。

Number Segmentation

The option -segment_number tokenizes numbers by digits. This option is interesting for full handling of numeric entities conversion/translation by neural networks.

1984 --> 1 9 8 4

BPE

OpenNMT's BPE module fully supports the original BPE as default mode:

tools/learn_bpe.lua -size 30000 -save_bpe codes < input_tokenized
tools/tokenize.lua -bpe_model codes < input_tokenized

with three additional features:

1. Accept raw text as input and use OpenNMT's tokenizer for pre-tokenization before BPE training

tools/learn_bpe.lua -size 30000 -save_bpe codes -tok_mode aggressive -tok_segment_alphabet_change [ OTHER_TOK_OPTIONS ] [ OTHER_BPE_TRAINING_OPTIONS ] < input_raw
tools/tokenize.lua -bpe_model codes -mode aggressive -segment_alphabet_change [ SAME_TOK_OPTIONS ] < input_raw

Note

All TOK_OPTIONS for learn_bpe.lua have their equivalent for tokenize.lua without the prefix tok_

Warning

When applying BPE for any data set, the same TOK_OPTIONS should be used for learn_bpe.lua and tokenize.lua

2. Add BPE_TRAINING_OPTION for different modes of handling prefixes and/or suffixes: -bpe_mode

  • suffix: BPE merge operations are learnt to distinguish sub-tokens like "ent" in the middle of a word and "ent<\w>" at the end of a word. "<\w>" is an artificial marker appended to the end of each token input and treated as a single unit before doing statistics on bigrams. This is the default mode which is useful for most of the languages.
  • prefix: BPE merge operations are learnt to distinguish sub-tokens like "ent" in the middle of a word and "<w>ent" at the beginning of a word. "<w>" is an artificial marker appended to the beginning of each token input and treated as a single unit before doing statistics on bigrams.
  • both: suffix + prefix
  • none: No artificial marker is appended to input tokens, a sub-token is treated equally whether it is in the middle or at the beginning or at the end of a token.

Warning

When a BPE_TRAINING_OPTION is used for learn_bpe.lua, the resulting BPE model already contains all the necessary BPE_INFERENCE_OPTIONS that tokenize.lua can automatically read from the model's header (BPE_INFERENCE_OPTIONS are those of Tokenizer options with the prefix bpe_), therefore if these options when passed to cmd for tokenize.lua, they will be OVERRIDDEN by those contained in the BPE model's header.

3. Add case insensitive BPE in combination of case feature

OpenNMT's tokenization flow first applies BPE then add the case feature for each input token. With the standard BPE, "Constitution" and "constitution" may result in the different sequences of sub-tokens:

Constitution --> con│C sti│l tu│l tion│l
constitution --> consti│l tu│l tion│l

If you want a caseless split so that you can take the best from using case feature, and you can achieve that with the following command lines:

# We don't need BPE to care about case
tools/learn_bpe.lua -size 30000 -save_bpe codes_lc -tok_case_feature [ OTHER_TOK_OPTIONS ] [ OTHER_BPE_TRAINING_OPTIONS ] < input_raw

# The case information is preserved in the true case input
tools/tokenize.lua -bpe_model codes_lc -case_feature [ SAME_TOK_OPTIONS ] < input_raw

The output of the previous example would be:

Constitution --> con│C sti│l tu│l tion│l
constitution --> con│l sti│l tu│l tion│l

Warning

The BPE_INFERENCE_OPTIONS '-bpe_case_insensitive' in tokenize.lua is there only for the cases where the BPE models are not trained by lear_bpe.lua (typically by the original learn_bpe.py), but one still wants to use it for tokenization with case feature. It's not recommanded because it's tricky to handle 'manually' the compatibility issue between the tokenization used to learn BPE models and the tokenization on top of which these learnt models are applied.

LET WHAT'S BPE TO BPE!