tools/tokenize.lua
tokenize.lua options:
-h [<boolean>](default:false)
This help.-md [<boolean>](default:false)
Dump help in Markdown format.-config <string>(default:'')
Load options from this file.-save_config <string>(default:'')
Save options to this file.
Tokenizer options¶
-mode <string>(accepted:space,conservative,aggressive; default:conservative)
Define how aggressive should the tokenization be.aggressiveonly keeps sequences of letters/numbers,conservativeallows a mix of alphanumeric as in: "2,000", "E65", "soft-landing", etc.spaceis doing space tokenization.-joiner_annotate [<boolean>](default:false)
Include joiner annotation using-joinercharacter.-joiner <string>(default:■)
Character used to annotate joiners.-joiner_new [<boolean>](default:false)
In-joiner_annotatemode,-joineris an independent token.-case_feature [<boolean>](default:false)
Generate case feature.-segment_case [<boolean>](default:false)
Segment case feature, splits AbC to Ab C to be able to restore case-segment_alphabet <table>(accepted:Tagalog,Hanunoo,Limbu,Yi,Hebrew,Latin,Devanagari,Thaana,Lao,Sinhala,Georgian,Kannada,Cherokee,Kanbun,Buhid,Malayalam,Han,Thai,Katakana,Telugu,Greek,Myanmar,Armenian,Hangul,Cyrillic,Ethiopic,Tagbanwa,Gurmukhi,Ogham,Khmer,Arabic,Oriya,Hiragana,Mongolian,Kangxi,Syriac,Gujarati,Braille,Bengali,Tamil,Bopomofo,Tibetan)
Segment all letters from indicated alphabet.-segment_numbers [<boolean>](default:false)
Segment numbers into single digits.-segment_alphabet_change [<boolean>](default:false)
Segment if alphabet change between 2 letters.-bpe_model <string>(default:'')
Apply Byte Pair Encoding if the BPE model path is given. If the option is used, BPE related options will be overridden/set automatically if the BPE model specified by-bpe_modelis learnt usinglearn_bpe.lua.-bpe_EOT_marker <string>(default:</w>)
Marker used to mark the End of Token while applying BPE in mode 'prefix' or 'both'.-bpe_BOT_marker <string>(default:<w>)
Marker used to mark the Beginning of Token while applying BPE in mode 'suffix' or 'both'.-bpe_case_insensitive [<boolean>](default:false)
Apply BPE internally in lowercase, but still output the truecase units. This option will be overridden/set automatically if the BPE model specified by-bpe_modelis learnt usinglearn_bpe.lua.-bpe_mode <string>(accepted:suffix,prefix,both,none; default:suffix)
Define the BPE mode. This option will be overridden/set automatically if the BPE model specified by-bpe_modelis learnt usinglearn_bpe.lua.prefix: append-bpe_BOT_markerto the begining of each word to learn prefix-oriented pair statistics;suffix: append-bpe_EOT_markerto the end of each word to learn suffix-oriented pair statistics, as in the original Python script;both:suffixandprefix;none: nosuffixnorprefix.
Other options¶
-nparallel <number>(default:1)
Number of parallel thread to run the tokenization-batchsize <number>(default:1000)
Size of each parallel batch - you should not change except if low memory
HookManager options¶
-hook_file <string>(default:'')
Pointer to a lua file registering hooks for the current process