preprocess.lua
preprocess.lua options:
-h [<boolean>](default:false)
This help.-md [<boolean>](default:false)
Dump help in Markdown format.-config <string>(default:'')
Load options from this file.-save_config <string>(default:'')
Save options to this file.
Preprocess options¶
-data_type <string>(accepted:bitext,monotext,feattext; default:bitext)
Type of data to preprocess. Use 'monotext' for monolingual data. This option impacts all options choices.-dry_run [<boolean>](default:false)
If set, this will only prepare the preprocessor. Useful when using file sampling to test distribution rules.-save_data <string>(default:'')
Output file for the prepared data.
Data options¶
-train_dir <string>(default:'')
Path to training files directory.-train_src <string>(default:'')
Path to the training source data.-train_tgt <string>(default:'')
Path to the training target data.-valid_src <string>(default:'')
Path to the validation source data.-valid_tgt <string>(default:'')
Path to the validation target data.-src_vocab <string>(default:'')
Path to an existing source vocabulary.-src_suffix <string>(default:.src)
Suffix for source files in train/valid directories.-src_vocab_size <table>(default:50000)
List of source vocabularies size:word[ feat1[ feat2[ ...] ] ]. If = 0, vocabularies are not pruned.-src_words_min_frequency <table>(default:0)
List of source words min frequency:word[ feat1[ feat2[ ...] ] ]. If = 0, vocabularies are pruned by size.-tgt_vocab <string>(default:'')
Path to an existing target vocabulary.-tgt_suffix <string>(default:.tgt)
Suffix for target files in train/valid directories.-tgt_vocab_size <table>(default:50000)
List of target vocabularies size:word[ feat1[ feat2[ ...] ] ]. If = 0, vocabularies are not pruned.-tgt_words_min_frequency <table>(default:0)
List of target words min frequency:word[ feat1[ feat2[ ...] ] ]. If = 0, vocabularies are pruned by size.-src_seq_length <number>(default:50)
Maximum source sequence length.-tgt_seq_length <number>(default:50)
Maximum target sequence length.-check_plength [<boolean>](default:false)
Check source and target have same length (for seq tagging).-features_vocabs_prefix <string>(default:'')
Path prefix to existing features vocabularies.-time_shift_feature [<boolean>](default:true)
Time shift features on the decoder side.-keep_frequency [<boolean>](default:false)
Keep frequency of words in dictionary.-gsample <number>(default:0)
If not zero, extract a new sample from the corpus. In training mode, file sampling is done at each epoch. Values between 0 and 1 indicate ratio, values higher than 1 indicate data size-gsample_dist <string>(default:'')
Configuration file with data class distribution to use for sampling training corpus. If not set, sampling is uniform.-sort [<boolean>](default:true)
If set, sort the sequences by size to build batches without source padding.-shuffle [<boolean>](default:true)
If set, shuffle the data (prior sorting).-idx_files [<boolean>](default:false)
If set, source and target files are 'key value' with key match between source and target.-report_progress_every <number>(default:100000)
Report status every this many sentences.-preprocess_pthreads <number>(default:4)
Number of parallel threads for preprocessing.
Tokenizer options¶
-tok_{src,tgt}_mode <string>(accepted:conservative,aggressive,space; default:space)
Define how aggressive should the tokenization be.spaceis space-tokenization.-tok_{src,tgt}_joiner_annotate [<boolean>](default:false)
Include joiner annotation using-joinercharacter.-tok_{src,tgt}_joiner <string>(default:■)
Character used to annotate joiners.-tok_{src,tgt}_joiner_new [<boolean>](default:false)
In-joiner_annotatemode,-joineris an independent token.-tok_{src,tgt}_case_feature [<boolean>](default:false)
Generate case feature.-tok_{src,tgt}_segment_case [<boolean>](default:false)
Segment case feature, splits AbC to Ab C to be able to restore case-tok_{src,tgt}_segment_alphabet <table>(accepted:Tagalog,Hanunoo,Limbu,Yi,Hebrew,Latin,Devanagari,Thaana,Lao,Sinhala,Georgian,Kannada,Cherokee,Kanbun,Buhid,Malayalam,Han,Thai,Katakana,Telugu,Greek,Myanmar,Armenian,Hangul,Cyrillic,Ethiopic,Tagbanwa,Gurmukhi,Ogham,Khmer,Arabic,Oriya,Hiragana,Mongolian,Kangxi,Syriac,Gujarati,Braille,Bengali,Tamil,Bopomofo,Tibetan)
Segment all letters from indicated alphabet.-tok_{src,tgt}_segment_numbers [<boolean>](default:false)
Segment numbers into single digits.-tok_{src,tgt}_segment_alphabet_change [<boolean>](default:false)
Segment if alphabet change between 2 letters.-tok_{src,tgt}_bpe_model <string>(default:'')
Apply Byte Pair Encoding if the BPE model path is given. If the option is used, BPE related options will be overridden/set automatically if the BPE model specified by-bpe_modelis learnt usinglearn_bpe.lua.-tok_{src,tgt}_bpe_EOT_marker <string>(default:</w>)
Marker used to mark the End of Token while applying BPE in mode 'prefix' or 'both'.-tok_{src,tgt}_bpe_BOT_marker <string>(default:<w>)
Marker used to mark the Beginning of Token while applying BPE in mode 'suffix' or 'both'.-tok_{src,tgt}_bpe_case_insensitive [<boolean>](default:false)
Apply BPE internally in lowercase, but still output the truecase units. This option will be overridden/set automatically if the BPE model specified by-bpe_modelis learnt usinglearn_bpe.lua.-tok_{src,tgt}_bpe_mode <string>(accepted:suffix,prefix,both,none; default:suffix)
Define the BPE mode. This option will be overridden/set automatically if the BPE model specified by-bpe_modelis learnt usinglearn_bpe.lua.prefix: append-bpe_BOT_markerto the begining of each word to learn prefix-oriented pair statistics;suffix: append-bpe_EOT_markerto the end of each word to learn suffix-oriented pair statistics, as in the original Python script;both:suffixandprefix;none: nosuffixnorprefix.
HookManager options¶
-hook_file <string>(default:'')
Pointer to a lua file registering hooks for the current process
Logger options¶
-log_file <string>(default:'')
Output logs to a file under this path instead of stdout - if file name ending with json, output structure json.-disable_logs [<boolean>](default:false)
If set, output nothing.-log_level <string>(accepted:DEBUG,INFO,WARNING,ERROR,NONE; default:INFO)
Output logs at this level and above.
Other options¶
-seed <number>(default:3425)
Random seed.