tools/learn_bpe.lua
learn_bpe.lua
options:
-h [<boolean>]
(default:false
)
This help.-md [<boolean>]
(default:false
)
Dump help in Markdown format.-config <string>
(default:''
)
Load options from this file.-save_config <string>
(default:''
)
Save options to this file.
BPE options¶
-size <string>
(default:30000
)
The number of merge operations to learn.-bpe_mode <string>
(accepted:suffix
,prefix
,both
,none
; default:suffix
)
Define the BPE mode.prefix
: append<w>
to the begining of each word to learn prefix-oriented pair statistics;suffix
: append</w>
to the end of each word to learn suffix-oriented pair statistics, as in the original Python script;both
:suffix
andprefix
;none
: nosuffix
norprefix
.-bpe_EOT_marker <string>
(default:</w>
)
Marker used to mark the End of Token while applying BPE in mode 'prefix' or 'both'.-bpe_BOT_marker <string>
(default:<w>
)
Marker used to mark the Beginning of Token while applying BPE in mode 'suffix' or 'both'.-save_bpe <string>
(required)
Path to save the output model.
Tokenizer options¶
-tok_mode <string>
(accepted:space
,conservative
,aggressive
; default:space
)
Define how aggressive should the tokenization be.aggressive
only keeps sequences of letters/numbers,conservative
allows a mix of alphanumeric as in: "2,000", "E65", "soft-landing", etc.space
is doing space tokenization.-tok_joiner_annotate [<boolean>]
(default:false
)
Include joiner annotation using-joiner
character.-tok_joiner <string>
(default:■
)
Character used to annotate joiners.-tok_joiner_new [<boolean>]
(default:false
)
In-joiner_annotate
mode,-joiner
is an independent token.-tok_case_feature [<boolean>]
(default:false
)
Generate case feature.-tok_segment_case [<boolean>]
(default:false
)
Segment case feature, splits AbC to Ab C to be able to restore case-tok_segment_alphabet <table>
(accepted:Tagalog
,Hanunoo
,Limbu
,Yi
,Hebrew
,Latin
,Devanagari
,Thaana
,Lao
,Sinhala
,Georgian
,Kannada
,Cherokee
,Kanbun
,Buhid
,Malayalam
,Han
,Thai
,Katakana
,Telugu
,Greek
,Myanmar
,Armenian
,Hangul
,Cyrillic
,Ethiopic
,Tagbanwa
,Gurmukhi
,Ogham
,Khmer
,Arabic
,Oriya
,Hiragana
,Mongolian
,Kangxi
,Syriac
,Gujarati
,Braille
,Bengali
,Tamil
,Bopomofo
,Tibetan
)
Segment all letters from indicated alphabet.-tok_segment_numbers [<boolean>]
(default:false
)
Segment numbers into single digits.-tok_segment_alphabet_change [<boolean>]
(default:false
)
Segment if alphabet change between 2 letters.
HookManager options¶
-hook_file <string>
(default:''
)
Pointer to a lua file registering hooks for the current process
Logger options¶
-log_file <string>
(default:''
)
Output logs to a file under this path instead of stdout - if file name ending with json, output structure json.-disable_logs [<boolean>]
(default:false
)
If set, output nothing.-log_level <string>
(accepted:DEBUG
,INFO
,WARNING
,ERROR
,NONE
; default:INFO
)
Output logs at this level and above.