

translate.lua

translate.lua options:

-h [<boolean>] (default: false)
This help.
-md [<boolean>] (default: false)
Dump help in Markdown format.
-config <string> (default: '')
Load options from this file.
-save_config <string> (default: '')
Save options to this file.

Data options¶

-src <string> (required)
Source sequences to translate.
-tgt <string> (default: '')
Optional true target sequences.
-output <string> (default: pred.txt)
Output file.
-save_attention <string> (default: '')
Optional attention output file.
-batch_size <number> (default: 30)
Batch size.
-idx_files [<boolean>] (default: false)
If set, source and target files are 'key value' with key match between source and target.
-detokenize_output [<boolean>] (default: false)
Detokenize output.

Translator options¶

-model <string> (default: '')
Path to the serialized model file.
-lm_model <string> (default: '')
Path to serialized language model file.
-lm_weight <number> (default: 0.1)
Relative weight of language model.
-beam_size <number> (default: 5)
Beam size.
-max_sent_length <number> (default: 250)
Maximum output sentence length.
-replace_unk [<boolean>] (default: false)
Replace the generated tokens with the source token that has the highest attention weight. If -phrase_table is provided, it will lookup the identified source token and give the corresponding target token. If it is not provided (or the identified source token does not exist in the table) then it will copy the source token
-replace_unk_tagged [<boolean>] (default: false)
The same as -replace_unk, but wrap the replaced token in ｟unk:xxxxx｠ if it is not found in the phrase table.
-lexical_constraints [<boolean>] (default: false)
Force the beam search to apply the translations from the phrase table.
-limit_lexical_constraints [<boolean>] (default: false)
Prevents producing each lexical constraint more than required.
-placeholder_constraints [<boolean>] (default: false)
Force the beam search to reproduce placeholders in the translation.
-phrase_table <string> (default: '')
Path to source-target dictionary to replace <unk> tokens.
-n_best <number> (default: 1)
If > 1, it will also output an n-best list of decoded sentences.
-max_num_unks <number> (default: inf)
All sequences with more <unk>s than this will be ignored during beam search.
-target_subdict <string> (default: '')
Path to target words dictionary corresponding to the source.
-pre_filter_factor <number> (default: 1)
Optional, set this only if filter is being used. Before applying filters, hypotheses with top beam_size * pre_filter_factor scores will be considered. If the returned hypotheses voilate filters, then set this to a larger value to consider more.
-length_norm <number> (default: 0)
Length normalization coefficient (alpha). If set to 0, no length normalization.
-coverage_norm <number> (default: 0)
Coverage normalization coefficient (beta). An extra coverage term multiplied by beta is added to hypotheses scores. If is set to 0, no coverage normalization.
-eos_norm <number> (default: 0)
End of sentence normalization coefficient (gamma). If set to 0, no EOS normalization.
-dump_input_encoding [<boolean>] (default: false)
Instead of generating target tokens conditional on the source tokens, we print the representation (encoding/embedding) of the input.
-save_beam_to <string> (default: '')
Path to a file where the beam search exploration will be saved in a JSON format. Requires the dkjson package.

Tokenizer options¶

-tok_{src,tgt}_mode <string> (accepted: space, conservative, aggressive; default: conservative)
Define how aggressive should the tokenization be. aggressive only keeps sequences of letters/numbers, conservative allows a mix of alphanumeric as in: "2,000", "E65", "soft-landing", etc. space is doing space tokenization.
-tok_{src,tgt}_joiner_annotate [<boolean>] (default: false)
Include joiner annotation using -joiner character.
-tok_{src,tgt}_joiner <string> (default: ￭)
Character used to annotate joiners.
-tok_{src,tgt}_joiner_new [<boolean>] (default: false)
In -joiner_annotate mode, -joiner is an independent token.
-tok_{src,tgt}_case_feature [<boolean>] (default: false)
Generate case feature.
-tok_{src,tgt}_segment_case [<boolean>] (default: false)
Segment case feature, splits AbC to Ab C to be able to restore case
-tok_{src,tgt}_segment_alphabet <table> (accepted: Tagalog, Hanunoo, Limbu, Yi, Hebrew, Latin, Devanagari, Thaana, Lao, Sinhala, Georgian, Kannada, Cherokee, Kanbun, Buhid, Malayalam, Han, Thai, Katakana, Telugu, Greek, Myanmar, Armenian, Hangul, Cyrillic, Ethiopic, Tagbanwa, Gurmukhi, Ogham, Khmer, Arabic, Oriya, Hiragana, Mongolian, Kangxi, Syriac, Gujarati, Braille, Bengali, Tamil, Bopomofo, Tibetan)
Segment all letters from indicated alphabet.
-tok_{src,tgt}_segment_numbers [<boolean>] (default: false)
Segment numbers into single digits.
-tok_{src,tgt}_segment_alphabet_change [<boolean>] (default: false)
Segment if alphabet change between 2 letters.
-tok_{src,tgt}_bpe_model <string> (default: '')
Apply Byte Pair Encoding if the BPE model path is given. If the option is used, BPE related options will be overridden/set automatically if the BPE model specified by -bpe_model is learnt using learn_bpe.lua.
-tok_{src,tgt}_bpe_EOT_marker <string> (default: </w>)
Marker used to mark the End of Token while applying BPE in mode 'prefix' or 'both'.
-tok_{src,tgt}_bpe_BOT_marker <string> (default: <w>)
Marker used to mark the Beginning of Token while applying BPE in mode 'suffix' or 'both'.
-tok_{src,tgt}_bpe_case_insensitive [<boolean>] (default: false)
Apply BPE internally in lowercase, but still output the truecase units. This option will be overridden/set automatically if the BPE model specified by -bpe_model is learnt using learn_bpe.lua.
-tok_{src,tgt}_bpe_mode <string> (accepted: suffix, prefix, both, none; default: suffix)
Define the BPE mode. This option will be overridden/set automatically if the BPE model specified by -bpe_model is learnt using learn_bpe.lua. prefix: append -bpe_BOT_marker to the begining of each word to learn prefix-oriented pair statistics; suffix: append -bpe_EOT_marker to the end of each word to learn suffix-oriented pair statistics, as in the original Python script; both: suffix and prefix; none: no suffix nor prefix.

Cuda options¶

-gpuid <table> (default: 0)
List of GPU identifiers (1-indexed). CPU is used when set to 0.
-fallback_to_cpu [<boolean>] (default: false)
If GPU can't be used, rollback on the CPU.
-fp16 [<boolean>] (default: false)
Use half-precision float on GPU.
-no_nccl [<boolean>] (default: false)
Disable usage of nccl in parallel mode.

HookManager options¶

-hook_file <string> (default: '')
Pointer to a lua file registering hooks for the current process

Logger options¶

-log_file <string> (default: '')
Output logs to a file under this path instead of stdout - if file name ending with json, output structure json.
-disable_logs [<boolean>] (default: false)
If set, output nothing.
-log_level <string> (accepted: DEBUG, INFO, WARNING, ERROR, NONE; default: INFO)
Output logs at this level and above.

Other options¶

-time [<boolean>] (default: false)
Measure average translation time.