tools/learn_bpe.lua
learn_bpe.lua options:
-h [<boolean>](default:false)
This help.-md [<boolean>](default:false)
Dump help in Markdown format.-config <string>(default:'')
Load options from this file.-save_config <string>(default:'')
Save options to this file.
BPE options¶
-size <string>(default:30000)
The number of merge operations to learn.-bpe_mode <string>(accepted:suffix,prefix,both,none; default:suffix)
Define the BPE mode.prefix: append<w>to the begining of each word to learn prefix-oriented pair statistics;suffix: append</w>to the end of each word to learn suffix-oriented pair statistics, as in the original Python script;both:suffixandprefix;none: nosuffixnorprefix.-bpe_EOT_marker <string>(default:</w>)
Marker used to mark the End of Token while applying BPE in mode 'prefix' or 'both'.-bpe_BOT_marker <string>(default:<w>)
Marker used to mark the Beginning of Token while applying BPE in mode 'suffix' or 'both'.-save_bpe <string>(required)
Path to save the output model.
Tokenizer options¶
-tok_mode <string>(accepted:space,conservative,aggressive; default:space)
Define how aggressive should the tokenization be.aggressiveonly keeps sequences of letters/numbers,conservativeallows a mix of alphanumeric as in: "2,000", "E65", "soft-landing", etc.spaceis doing space tokenization.-tok_joiner_annotate [<boolean>](default:false)
Include joiner annotation using-joinercharacter.-tok_joiner <string>(default:■)
Character used to annotate joiners.-tok_joiner_new [<boolean>](default:false)
In-joiner_annotatemode,-joineris an independent token.-tok_case_feature [<boolean>](default:false)
Generate case feature.-tok_segment_case [<boolean>](default:false)
Segment case feature, splits AbC to Ab C to be able to restore case-tok_segment_alphabet <table>(accepted:Tagalog,Hanunoo,Limbu,Yi,Hebrew,Latin,Devanagari,Thaana,Lao,Sinhala,Georgian,Kannada,Cherokee,Kanbun,Buhid,Malayalam,Han,Thai,Katakana,Telugu,Greek,Myanmar,Armenian,Hangul,Cyrillic,Ethiopic,Tagbanwa,Gurmukhi,Ogham,Khmer,Arabic,Oriya,Hiragana,Mongolian,Kangxi,Syriac,Gujarati,Braille,Bengali,Tamil,Bopomofo,Tibetan)
Segment all letters from indicated alphabet.-tok_segment_numbers [<boolean>](default:false)
Segment numbers into single digits.-tok_segment_alphabet_change [<boolean>](default:false)
Segment if alphabet change between 2 letters.
HookManager options¶
-hook_file <string>(default:'')
Pointer to a lua file registering hooks for the current process
Logger options¶
-log_file <string>(default:'')
Output logs to a file under this path instead of stdout - if file name ending with json, output structure json.-disable_logs [<boolean>](default:false)
If set, output nothing.-log_level <string>(accepted:DEBUG,INFO,WARNING,ERROR,NONE; default:INFO)
Output logs at this level and above.