tools/tokenize.lua
tokenize.lua
options:
-h [<boolean>]
(default:false
)
This help.-md [<boolean>]
(default:false
)
Dump help in Markdown format.-config <string>
(default:''
)
Load options from this file.-save_config <string>
(default:''
)
Save options to this file.
Tokenizer options¶
-mode <string>
(accepted:space
,conservative
,aggressive
; default:conservative
)
Define how aggressive should the tokenization be.aggressive
only keeps sequences of letters/numbers,conservative
allows a mix of alphanumeric as in: "2,000", "E65", "soft-landing", etc.space
is doing space tokenization.-joiner_annotate [<boolean>]
(default:false
)
Include joiner annotation using-joiner
character.-joiner <string>
(default:■
)
Character used to annotate joiners.-joiner_new [<boolean>]
(default:false
)
In-joiner_annotate
mode,-joiner
is an independent token.-case_feature [<boolean>]
(default:false
)
Generate case feature.-segment_case [<boolean>]
(default:false
)
Segment case feature, splits AbC to Ab C to be able to restore case-segment_alphabet <table>
(accepted:Tagalog
,Hanunoo
,Limbu
,Yi
,Hebrew
,Latin
,Devanagari
,Thaana
,Lao
,Sinhala
,Georgian
,Kannada
,Cherokee
,Kanbun
,Buhid
,Malayalam
,Han
,Thai
,Katakana
,Telugu
,Greek
,Myanmar
,Armenian
,Hangul
,Cyrillic
,Ethiopic
,Tagbanwa
,Gurmukhi
,Ogham
,Khmer
,Arabic
,Oriya
,Hiragana
,Mongolian
,Kangxi
,Syriac
,Gujarati
,Braille
,Bengali
,Tamil
,Bopomofo
,Tibetan
)
Segment all letters from indicated alphabet.-segment_numbers [<boolean>]
(default:false
)
Segment numbers into single digits.-segment_alphabet_change [<boolean>]
(default:false
)
Segment if alphabet change between 2 letters.-bpe_model <string>
(default:''
)
Apply Byte Pair Encoding if the BPE model path is given. If the option is used, BPE related options will be overridden/set automatically if the BPE model specified by-bpe_model
is learnt usinglearn_bpe.lua
.-bpe_EOT_marker <string>
(default:</w>
)
Marker used to mark the End of Token while applying BPE in mode 'prefix' or 'both'.-bpe_BOT_marker <string>
(default:<w>
)
Marker used to mark the Beginning of Token while applying BPE in mode 'suffix' or 'both'.-bpe_case_insensitive [<boolean>]
(default:false
)
Apply BPE internally in lowercase, but still output the truecase units. This option will be overridden/set automatically if the BPE model specified by-bpe_model
is learnt usinglearn_bpe.lua
.-bpe_mode <string>
(accepted:suffix
,prefix
,both
,none
; default:suffix
)
Define the BPE mode. This option will be overridden/set automatically if the BPE model specified by-bpe_model
is learnt usinglearn_bpe.lua
.prefix
: append-bpe_BOT_marker
to the begining of each word to learn prefix-oriented pair statistics;suffix
: append-bpe_EOT_marker
to the end of each word to learn suffix-oriented pair statistics, as in the original Python script;both
:suffix
andprefix
;none
: nosuffix
norprefix
.
Other options¶
-nparallel <number>
(default:1
)
Number of parallel thread to run the tokenization-batchsize <number>
(default:1000
)
Size of each parallel batch - you should not change except if low memory
HookManager options¶
-hook_file <string>
(default:''
)
Pointer to a lua file registering hooks for the current process