preprocess.lua
preprocess.lua
options:
-h [<boolean>]
(default:false
)
This help.-md [<boolean>]
(default:false
)
Dump help in Markdown format.-config <string>
(default:''
)
Load options from this file.-save_config <string>
(default:''
)
Save options to this file.
Preprocess options¶
-data_type <string>
(accepted:bitext
,monotext
,feattext
; default:bitext
)
Type of data to preprocess. Use 'monotext' for monolingual data. This option impacts all options choices.-dry_run [<boolean>]
(default:false
)
If set, this will only prepare the preprocessor. Useful when using file sampling to test distribution rules.-save_data <string>
(default:''
)
Output file for the prepared data.
Data options¶
-train_dir <string>
(default:''
)
Path to training files directory.-train_src <string>
(default:''
)
Path to the training source data.-train_tgt <string>
(default:''
)
Path to the training target data.-valid_src <string>
(default:''
)
Path to the validation source data.-valid_tgt <string>
(default:''
)
Path to the validation target data.-src_vocab <string>
(default:''
)
Path to an existing source vocabulary.-src_suffix <string>
(default:.src
)
Suffix for source files in train/valid directories.-src_vocab_size <table>
(default:50000
)
List of source vocabularies size:word[ feat1[ feat2[ ...] ] ]
. If = 0, vocabularies are not pruned.-src_words_min_frequency <table>
(default:0
)
List of source words min frequency:word[ feat1[ feat2[ ...] ] ]
. If = 0, vocabularies are pruned by size.-tgt_vocab <string>
(default:''
)
Path to an existing target vocabulary.-tgt_suffix <string>
(default:.tgt
)
Suffix for target files in train/valid directories.-tgt_vocab_size <table>
(default:50000
)
List of target vocabularies size:word[ feat1[ feat2[ ...] ] ]
. If = 0, vocabularies are not pruned.-tgt_words_min_frequency <table>
(default:0
)
List of target words min frequency:word[ feat1[ feat2[ ...] ] ]
. If = 0, vocabularies are pruned by size.-src_seq_length <number>
(default:50
)
Maximum source sequence length.-tgt_seq_length <number>
(default:50
)
Maximum target sequence length.-check_plength [<boolean>]
(default:false
)
Check source and target have same length (for seq tagging).-features_vocabs_prefix <string>
(default:''
)
Path prefix to existing features vocabularies.-time_shift_feature [<boolean>]
(default:true
)
Time shift features on the decoder side.-keep_frequency [<boolean>]
(default:false
)
Keep frequency of words in dictionary.-gsample <number>
(default:0
)
If not zero, extract a new sample from the corpus. In training mode, file sampling is done at each epoch. Values between 0 and 1 indicate ratio, values higher than 1 indicate data size-gsample_dist <string>
(default:''
)
Configuration file with data class distribution to use for sampling training corpus. If not set, sampling is uniform.-sort [<boolean>]
(default:true
)
If set, sort the sequences by size to build batches without source padding.-shuffle [<boolean>]
(default:true
)
If set, shuffle the data (prior sorting).-idx_files [<boolean>]
(default:false
)
If set, source and target files are 'key value' with key match between source and target.-report_progress_every <number>
(default:100000
)
Report status every this many sentences.-preprocess_pthreads <number>
(default:4
)
Number of parallel threads for preprocessing.
Tokenizer options¶
-tok_{src,tgt}_mode <string>
(accepted:conservative
,aggressive
,space
; default:space
)
Define how aggressive should the tokenization be.space
is space-tokenization.-tok_{src,tgt}_joiner_annotate [<boolean>]
(default:false
)
Include joiner annotation using-joiner
character.-tok_{src,tgt}_joiner <string>
(default:■
)
Character used to annotate joiners.-tok_{src,tgt}_joiner_new [<boolean>]
(default:false
)
In-joiner_annotate
mode,-joiner
is an independent token.-tok_{src,tgt}_case_feature [<boolean>]
(default:false
)
Generate case feature.-tok_{src,tgt}_segment_case [<boolean>]
(default:false
)
Segment case feature, splits AbC to Ab C to be able to restore case-tok_{src,tgt}_segment_alphabet <table>
(accepted:Tagalog
,Hanunoo
,Limbu
,Yi
,Hebrew
,Latin
,Devanagari
,Thaana
,Lao
,Sinhala
,Georgian
,Kannada
,Cherokee
,Kanbun
,Buhid
,Malayalam
,Han
,Thai
,Katakana
,Telugu
,Greek
,Myanmar
,Armenian
,Hangul
,Cyrillic
,Ethiopic
,Tagbanwa
,Gurmukhi
,Ogham
,Khmer
,Arabic
,Oriya
,Hiragana
,Mongolian
,Kangxi
,Syriac
,Gujarati
,Braille
,Bengali
,Tamil
,Bopomofo
,Tibetan
)
Segment all letters from indicated alphabet.-tok_{src,tgt}_segment_numbers [<boolean>]
(default:false
)
Segment numbers into single digits.-tok_{src,tgt}_segment_alphabet_change [<boolean>]
(default:false
)
Segment if alphabet change between 2 letters.-tok_{src,tgt}_bpe_model <string>
(default:''
)
Apply Byte Pair Encoding if the BPE model path is given. If the option is used, BPE related options will be overridden/set automatically if the BPE model specified by-bpe_model
is learnt usinglearn_bpe.lua
.-tok_{src,tgt}_bpe_EOT_marker <string>
(default:</w>
)
Marker used to mark the End of Token while applying BPE in mode 'prefix' or 'both'.-tok_{src,tgt}_bpe_BOT_marker <string>
(default:<w>
)
Marker used to mark the Beginning of Token while applying BPE in mode 'suffix' or 'both'.-tok_{src,tgt}_bpe_case_insensitive [<boolean>]
(default:false
)
Apply BPE internally in lowercase, but still output the truecase units. This option will be overridden/set automatically if the BPE model specified by-bpe_model
is learnt usinglearn_bpe.lua
.-tok_{src,tgt}_bpe_mode <string>
(accepted:suffix
,prefix
,both
,none
; default:suffix
)
Define the BPE mode. This option will be overridden/set automatically if the BPE model specified by-bpe_model
is learnt usinglearn_bpe.lua
.prefix
: append-bpe_BOT_marker
to the begining of each word to learn prefix-oriented pair statistics;suffix
: append-bpe_EOT_marker
to the end of each word to learn suffix-oriented pair statistics, as in the original Python script;both
:suffix
andprefix
;none
: nosuffix
norprefix
.
HookManager options¶
-hook_file <string>
(default:''
)
Pointer to a lua file registering hooks for the current process
Logger options¶
-log_file <string>
(default:''
)
Output logs to a file under this path instead of stdout - if file name ending with json, output structure json.-disable_logs [<boolean>]
(default:false
)
If set, output nothing.-log_level <string>
(accepted:DEBUG
,INFO
,WARNING
,ERROR
,NONE
; default:INFO
)
Output logs at this level and above.
Other options¶
-seed <number>
(default:3425
)
Random seed.