usage: [-h] [-config CONFIG] [-save_config SAVE_CONFIG]
                     [--data_type DATA_TYPE] --train_src TRAIN_SRC
                     [TRAIN_SRC ...] --train_tgt TRAIN_TGT [TRAIN_TGT ...]
                     [--train_align TRAIN_ALIGN [TRAIN_ALIGN ...]]
                     [--train_ids TRAIN_IDS [TRAIN_IDS ...]]
                     [--valid_src VALID_SRC] [--valid_tgt VALID_TGT]
                     [--valid_align VALID_ALIGN] [--src_dir SRC_DIR]
                     --save_data SAVE_DATA [--max_shard_size MAX_SHARD_SIZE]
                     [--shard_size SHARD_SIZE] [--num_threads NUM_THREADS]
                     [--overwrite] [--src_vocab SRC_VOCAB]
                     [--tgt_vocab TGT_VOCAB]
                     [--features_vocabs_prefix FEATURES_VOCABS_PREFIX]
                     [--src_vocab_size SRC_VOCAB_SIZE]
                     [--tgt_vocab_size TGT_VOCAB_SIZE]
                     [--vocab_size_multiple VOCAB_SIZE_MULTIPLE]
                     [--src_words_min_frequency SRC_WORDS_MIN_FREQUENCY]
                     [--tgt_words_min_frequency TGT_WORDS_MIN_FREQUENCY]
                     [--dynamic_dict] [--share_vocab]
                     [--src_seq_length SRC_SEQ_LENGTH]
                     [--src_seq_length_trunc SRC_SEQ_LENGTH_TRUNC]
                     [--tgt_seq_length TGT_SEQ_LENGTH]
                     [--tgt_seq_length_trunc TGT_SEQ_LENGTH_TRUNC] [--lower]
                     [--filter_valid] [--shuffle SHUFFLE] [--seed SEED]
                     [--report_every REPORT_EVERY] [--log_file LOG_FILE]
                     [--log_file_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET,50,40,30,20,10,0}]
                     [--sample_rate SAMPLE_RATE] [--window_size WINDOW_SIZE]
                     [--window_stride WINDOW_STRIDE] [--window WINDOW]
                     [--image_channel_size {3,1}]
                     [--subword_prefix SUBWORD_PREFIX]

Named Arguments

-config, --config

config file path

-save_config, --save_config

config file save path


--data_type, -data_type

Type of the source input. Options are [text|img|audio|vec].

Default: “text”

--train_src, -train_src

Path(s) to the training source data

--train_tgt, -train_tgt

Path(s) to the training target data

--train_align, -train_align

Path(s) to the training src-tgt alignment

Default: [None]

--train_ids, -train_ids

ids to name training shards, used for corpus weighting

Default: [None]

--valid_src, -valid_src

Path to the validation source data

--valid_tgt, -valid_tgt

Path to the validation target data

--valid_align, -valid_align

Path(s) to the validation src-tgt alignment

--src_dir, -src_dir

Source directory for image or audio files.

Default: “”

--save_data, -save_data

Output file for the prepared data

--max_shard_size, -max_shard_size

Deprecated use shard_size instead

Default: 0

--shard_size, -shard_size

Divide src_corpus and tgt_corpus into smaller multiple src_copus and tgt corpus files, then build shards, each shard will have opt.shard_size samples except last shard. shard_size=0 means no segmentation shard_size>0 means segment dataset into multiple shards, each shard has shard_size samples

Default: 1000000

--num_threads, -num_threads

Number of shards to build in parallel.

Default: 1

--overwrite, -overwrite

Overwrite existing shards if any.

Default: False


--src_vocab, -src_vocab

Path to an existing source vocabulary. Format: one word per line.

Default: “”

--tgt_vocab, -tgt_vocab

Path to an existing target vocabulary. Format: one word per line.

Default: “”

--features_vocabs_prefix, -features_vocabs_prefix

Path prefix to existing features vocabularies

Default: “”

--src_vocab_size, -src_vocab_size

Size of the source vocabulary

Default: 50000

--tgt_vocab_size, -tgt_vocab_size

Size of the target vocabulary

Default: 50000

--vocab_size_multiple, -vocab_size_multiple

Make the vocabulary size a multiple of this value

Default: 1

--src_words_min_frequency, -src_words_min_frequency

Default: 0

--tgt_words_min_frequency, -tgt_words_min_frequency

Default: 0

--dynamic_dict, -dynamic_dict

Create dynamic dictionaries

Default: False

--share_vocab, -share_vocab

Share source and target vocabulary

Default: False


--src_seq_length, -src_seq_length

Maximum source sequence length

Default: 50

--src_seq_length_trunc, -src_seq_length_trunc

Truncate source sequence length.

--tgt_seq_length, -tgt_seq_length

Maximum target sequence length to keep.

Default: 50

--tgt_seq_length_trunc, -tgt_seq_length_trunc

Truncate target sequence length.

--lower, -lower

lowercase data

Default: False

--filter_valid, -filter_valid

Filter validation data by src and/or tgt length

Default: False


--shuffle, -shuffle

Shuffle data

Default: 0

--seed, -seed

Random seed

Default: 3435


--report_every, -report_every

Report status every this many sentences

Default: 100000

--log_file, -log_file

Output logs to a file under this path.

Default: “”

--log_file_level, -log_file_level

Possible choices: CRITICAL, ERROR, WARNING, INFO, DEBUG, NOTSET, 50, 40, 30, 20, 10, 0

Default: “0”


--sample_rate, -sample_rate

Sample rate.

Default: 16000

--window_size, -window_size

Window size for spectrogram in seconds.

Default: 0.02

--window_stride, -window_stride

Window stride for spectrogram in seconds.

Default: 0.01

--window, -window

Window type for spectrogram generation.

Default: “hamming”

--image_channel_size, -image_channel_size

Possible choices: 3, 1

Using grayscale image can training model faster and smaller

Default: 3


--subword_prefix, -subword_prefix

subword prefix to build wordstart mask

Default: “▁”

--subword_prefix_is_joiner, -subword_prefix_is_joiner

mask will need to be inverted if prefix is joiner

Default: False