- -data_type [text] Type of the source input. Options are [text|img].
- -train_src  Path to the training source data
- -train_tgt  Path to the training target data
- -valid_src  Path to the validation source data
- -valid_tgt  Path to the validation target data
- -src_dir  Source directory for image or audio files.
- -save_data  Output file for the prepared data
- -max_shard_size  For text corpus of large volume, it will be divided into shards of this size to preprocess. If 0, the data will be handled as a whole. The unit is in bytes. Optimal value should be multiples of 64 bytes. A commonly used sharding value is
- It is recommended to ensure the corpus is shuffled before sharding.
- -shard_size  Divide src_corpus and tgt_corpus into smaller multiple src_copus and tgt corpus files, then build shards, each shard will have opt.shard_size samples except last shard. shard_size=0 means no segmentation shard_size>0 means segment dataset into multiple shards, each shard has shard_size samples
- -src_vocab  Path to an existing source vocabulary. Format: one word per line.
- -tgt_vocab  Path to an existing target vocabulary. Format: one word per line.
- -features_vocabs_prefix  Path prefix to existing features vocabularies
- -src_vocab_size  Size of the source vocabulary
- -tgt_vocab_size  Size of the target vocabulary
- -src_words_min_frequency 
- -tgt_words_min_frequency 
- -dynamic_dict  Create dynamic dictionaries
- -share_vocab  Share source and target vocabulary
- -src_seq_length  Maximum source sequence length
- -src_seq_length_trunc  Truncate source sequence length.
- -tgt_seq_length  Maximum target sequence length to keep.
- -tgt_seq_length_trunc  Truncate target sequence length.
- -lower  lowercase data
- -shuffle  Shuffle data
- -seed  Random seed
- -report_every  Report status every this many sentences
- -log_file  Output logs to a file under this path.
- -sample_rate  Sample rate.
- -window_size [0.02] Window size for spectrogram in seconds.
- -window_stride [0.01] Window stride for spectrogram in seconds.
- -window [hamming] Window type for spectrogram generation.
- -image_channel_size  Using grayscale image can training model faster and smaller