OpenNMT provides generic tokenization utilities to quickly process new training data.
For LuaJIT users, tokenization tools require the
To tokenize a corpus:
th tools/tokenize.lua OPTIONS < file > file.tok
If you activate
-joiner_annotate marker, the tokenization is reversible. Just use:
th tools/detokenize.lua OPTIONS < file.tok > file.detok
￨is the feature separator symbol. If such character is used in source text, it is replaced by its non presentation form
￭is the default joiner marker (generated in
-joiner_annotate markermode). If such character is used in source text, it is replaced by its non presentation form
Mixed casing words¶
-segment_case feature enables tokenizer to segment words into subwords with one of 3 casing types (truecase ('House'), uppercase ('HOUSE') or lowercase ('house')), which helps restore right casing during detokenization. This feature is especially useful for texts with a signficant number of words with mixed casing ('WiFi' -> 'Wi' and 'Fi').
WiFi --> wi￨C fi￨C TVs --> tv￨U s￨L
Two options provide specific tokenization depending on alphabet:
-segment_alphabet_change: tokenize a sequence between two letters when their alphabets differ - for instance between a Latin alphabet character and a Han character.
-segment_alphabet Alphabet: tokenize all words of the indicated alphabet into characters - for instance to split a chinese sentence into characters, use
君子之心不胜其小，而气量涵盖一世。 --> 君 子 之 心 不 胜 其 小 ， 而 气 量 涵 盖 一 世 。
OpenNMT's BPE module fully supports the original BPE as default mode:
tools/learn_bpe.lua -size 30000 -save_bpe codes < input tools/tokenize.lua -bpe_model codes < input
with two additional features:
1. Add support for different modes of handling prefixes and/or suffixes:
suffix: BPE merge operations are learnt to distinguish sub-tokens like "ent" in the middle of a word and "ent<\w>" at the end of a word. "<\w>" is an artificial marker appended to the end of each token input and treated as a single unit before doing statistics on bigrams. This is the default mode which is useful for most of the languages.
prefix: BPE merge operations are learnt to distinguish sub-tokens like "ent" in the middle of a word and "<w>ent" at the beginning of a word. "<w>" is an artificial marker appended to the beginning of each token input and treated as a single unit before doing statistics on bigrams.
none: No artificial marker is appended to input tokens, a sub-token is treated equally whether it is in the middle or at the beginning or at the end of a token.
2. Add support for BPE in addition to the case feature:
OpenNMT's tokenization flow first applies BPE then add the case feature for each input token. With the standard BPE, "Constitution" and "constitution" may result in the different sequences of sub-tokens:
Constitution --> con￨C sti￨l tu￨l tion￨l constitution --> consti￨l tu￨l tion￨l
If you want a caseless split so that you can take the best from using case feature, and you can achieve that with the following command lines:
# We don't need BPE to care about case tools/learn_bpe.lua -size 30000 -save_bpe codes_lc < input_lowercased # The case information is preserved in the true case input tools/tokenize.lua -bpe_model codes_lc -bpe_case_insensitive < input
The output of the previous example would be:
Constitution --> con￨C sti￨l tu￨l tion￨l constitution --> con￨l sti￨l tu￨l tion￨l
Use Lua 5.2 if you encounter any memory issue while using
-size is too big). Otherwise, stay with LuaJIT for better efficiency.