Build Vocab

build_vocab.py

usage: build_vocab.py [-h] [-config CONFIG] [-save_config SAVE_CONFIG] -data
                      DATA [-skip_empty_level {silent,warning,error}]
                      [-transforms {filtertoolong,prefix,bart,switchout,tokendrop,tokenmask,sentencepiece,bpe,onmt_tokenize} [{filtertoolong,prefix,bart,switchout,tokendrop,tokenmask,sentencepiece,bpe,onmt_tokenize} ...]]
                      -save_data SAVE_DATA [-overwrite] [-n_sample N_SAMPLE]
                      [-dump_samples] [-num_threads NUM_THREADS]
                      [-vocab_sample_queue_size VOCAB_SAMPLE_QUEUE_SIZE]
                      -src_vocab SRC_VOCAB [-tgt_vocab TGT_VOCAB]
                      [-share_vocab] [--src_seq_length SRC_SEQ_LENGTH]
                      [--tgt_seq_length TGT_SEQ_LENGTH]
                      [--permute_sent_ratio PERMUTE_SENT_RATIO]
                      [--rotate_ratio ROTATE_RATIO]
                      [--insert_ratio INSERT_RATIO]
                      [--random_ratio RANDOM_RATIO] [--mask_ratio MASK_RATIO]
                      [--mask_length {subword,word,span-poisson}]
                      [--poisson_lambda POISSON_LAMBDA]
                      [--replace_length {-1,0,1}]
                      [-switchout_temperature SWITCHOUT_TEMPERATURE]
                      [-tokendrop_temperature TOKENDROP_TEMPERATURE]
                      [-tokenmask_temperature TOKENMASK_TEMPERATURE]
                      [-src_subword_model SRC_SUBWORD_MODEL]
                      [-tgt_subword_model TGT_SUBWORD_MODEL]
                      [-src_subword_nbest SRC_SUBWORD_NBEST]
                      [-tgt_subword_nbest TGT_SUBWORD_NBEST]
                      [-src_subword_alpha SRC_SUBWORD_ALPHA]
                      [-tgt_subword_alpha TGT_SUBWORD_ALPHA]
                      [-src_subword_vocab SRC_SUBWORD_VOCAB]
                      [-tgt_subword_vocab TGT_SUBWORD_VOCAB]
                      [-src_vocab_threshold SRC_VOCAB_THRESHOLD]
                      [-tgt_vocab_threshold TGT_VOCAB_THRESHOLD]
                      [-src_subword_type {none,sentencepiece,bpe}]
                      [-tgt_subword_type {none,sentencepiece,bpe}]
                      [-src_onmttok_kwargs SRC_ONMTTOK_KWARGS]
                      [-tgt_onmttok_kwargs TGT_ONMTTOK_KWARGS] [--seed SEED]

Configuration

-config, --config

Path of the main YAML config file.

-save_config, --save_config

Path where to save the config.

Data

-data, --data

List of datasets and their specifications. See examples/*.yaml for further details.

-skip_empty_level, --skip_empty_level

Possible choices: silent, warning, error

Security level when encounter empty examples.silent: silently ignore/skip empty example;warning: warning when ignore/skip empty example;error: raise error & stop excution when encouter empty.)

Default: “warning”

-transforms, --transforms

Possible choices: filtertoolong, prefix, bart, switchout, tokendrop, tokenmask, sentencepiece, bpe, onmt_tokenize

Default transform pipeline to apply to data. Can be specified in each corpus of data to override.

Default: []

-save_data, --save_data

Output base path for objects that will be saved (vocab, transforms, embeddings, …).

-overwrite, --overwrite

Overwrite existing objects if any.

Default: False

-n_sample, --n_sample

Build vocab using this number of transformed samples/corpus. Can be [-1, 0, N>0]. Set to -1 to go full corpus, 0 to skip.

Default: 5000

-dump_samples, --dump_samples

Dump samples when building vocab. Warning: this may slow down the process.

Default: False

-num_threads, --num_threads

Number of parallel threads to build the vocab.

Default: 1

-vocab_sample_queue_size, --vocab_sample_queue_size

Size of queues used in the build_vocab dump path.

Default: 100

Vocab

-src_vocab, --src_vocab

Path to save src (or shared) vocabulary file. Format: one <word> or <word> <count> per line.

-tgt_vocab, --tgt_vocab

Path to save tgt vocabulary file. Format: one <word> or <word> <count> per line.

-share_vocab, --share_vocab

Share source and target vocabulary.

Default: False

Transform/Filter

--src_seq_length, -src_seq_length

Maximum source sequence length.

Default: 200

--tgt_seq_length, -tgt_seq_length

Maximum target sequence length.

Default: 200

Transform/BART

Caution

This transform will not take effect when building vocabulary.

--permute_sent_ratio, -permute_sent_ratio

Permute this proportion of sentences (boundaries defined by [‘.’, ‘?’, ‘!’]) in all inputs.

Default: 0.0

--rotate_ratio, -rotate_ratio

Rotate this proportion of inputs.

Default: 0.0

--insert_ratio, -insert_ratio

Insert this percentage of additional random tokens.

Default: 0.0

--random_ratio, -random_ratio

Instead of using <mask>, use random token this often.

Default: 0.0

--mask_ratio, -mask_ratio

Fraction of words/subwords that will be masked.

Default: 0.0

--mask_length, -mask_length

Possible choices: subword, word, span-poisson

Length of masking window to apply.

Default: “subword”

--poisson_lambda, -poisson_lambda

Lambda for Poisson distribution to sample span length if -mask_length set to span-poisson.

Default: 0.0

--replace_length, -replace_length

Possible choices: -1, 0, 1

When masking N tokens, replace with 0, 1, or N tokens. (use -1 for N)

Default: -1

Transform/SwitchOut

Caution

This transform will not take effect when building vocabulary.

-switchout_temperature, --switchout_temperature

Sampling temperature for SwitchOut. \(\tau^{-1}\) in [WPDN18]. Smaller value makes data more diverse.

Default: 1.0

Transform/Token_Drop

-tokendrop_temperature, --tokendrop_temperature

Sampling temperature for token deletion.

Default: 1.0

Transform/Token_Mask

-tokenmask_temperature, --tokenmask_temperature

Sampling temperature for token masking.

Default: 1.0

Transform/Subword/Common

Attention

Common options shared by all subword transforms. Including options for indicate subword model path, Subword Regularization/BPE-Dropout, and Vocabulary Restriction.

-src_subword_model, --src_subword_model

Path of subword model for src (or shared).

-tgt_subword_model, --tgt_subword_model

Path of subword model for tgt.

-src_subword_nbest, --src_subword_nbest

Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)

Default: 1

-tgt_subword_nbest, --tgt_subword_nbest

Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)

Default: 1

-src_subword_alpha, --src_subword_alpha

Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)

Default: 0

-tgt_subword_alpha, --tgt_subword_alpha

Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)

Default: 0

-src_subword_vocab, --src_subword_vocab

Path to the vocabulary file for src subword. Format: <word> <count> per line.

Default: “”

-tgt_subword_vocab, --tgt_subword_vocab

Path to the vocabulary file for tgt subword. Format: <word> <count> per line.

Default: “”

-src_vocab_threshold, --src_vocab_threshold

Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.

Default: 0

-tgt_vocab_threshold, --tgt_vocab_threshold

Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.

Default: 0

Transform/Subword/ONMTTOK

-src_subword_type, --src_subword_type

Possible choices: none, sentencepiece, bpe

Type of subword model for src (or shared) in pyonmttok.

Default: “none”

-tgt_subword_type, --tgt_subword_type

Possible choices: none, sentencepiece, bpe

Type of subword model for tgt in pyonmttok.

Default: “none”

-src_onmttok_kwargs, --src_onmttok_kwargs

Other pyonmttok options for src in dict string, except subword related options listed earlier.

Default: “{‘mode’: ‘none’}”

-tgt_onmttok_kwargs, --tgt_onmttok_kwargs

Other pyonmttok options for tgt in dict string, except subword related options listed earlier.

Default: “{‘mode’: ‘none’}”

Reproducibility

--seed, -seed

Set random seed used for better reproducibility between experiments.

Default: -1