Build Vocab

build_vocab.py

usage: build_vocab.py [-h] [-config CONFIG] [-save_config SAVE_CONFIG] -data
                      DATA [-skip_empty_level {silent,warning,error}]
                      [-transforms {switchout,tokendrop,tokenmask,docify,insert_mask_before_placeholder,uppercase,fuzzymatch,inlinetags,clean,sentencepiece,bpe,onmt_tokenize,normalize,inferfeats,filtertoolong,prefix,suffix,terminology,bart} [{switchout,tokendrop,tokenmask,docify,insert_mask_before_placeholder,uppercase,fuzzymatch,inlinetags,clean,sentencepiece,bpe,onmt_tokenize,normalize,inferfeats,filtertoolong,prefix,suffix,terminology,bart} ...]]
                      -save_data SAVE_DATA [-overwrite] [-n_sample N_SAMPLE]
                      [-dump_samples] [-num_threads NUM_THREADS]
                      [-learn_subwords]
                      [-learn_subwords_size LEARN_SUBWORDS_SIZE]
                      [-vocab_sample_queue_size VOCAB_SAMPLE_QUEUE_SIZE]
                      -src_vocab SRC_VOCAB [-tgt_vocab TGT_VOCAB]
                      [-share_vocab]
                      [--decoder_start_token DECODER_START_TOKEN]
                      [--default_specials DEFAULT_SPECIALS [DEFAULT_SPECIALS ...]]
                      [-n_src_feats N_SRC_FEATS]
                      [-src_feats_defaults SRC_FEATS_DEFAULTS]
                      [-switchout_temperature SWITCHOUT_TEMPERATURE]
                      [-tokendrop_temperature TOKENDROP_TEMPERATURE]
                      [-tokenmask_temperature TOKENMASK_TEMPERATURE]
                      [--doc_length DOC_LENGTH] [--max_context MAX_CONTEXT]
                      [--response_pattern RESPONSE_PATTERN]
                      [--upper_corpus_ratio UPPER_CORPUS_RATIO]
                      [--tm_path TM_PATH]
                      [--fuzzy_corpus_ratio FUZZY_CORPUS_RATIO]
                      [--fuzzy_threshold FUZZY_THRESHOLD]
                      [--tm_delimiter TM_DELIMITER]
                      [--fuzzy_token FUZZY_TOKEN]
                      [--fuzzymatch_min_length FUZZYMATCH_MIN_LENGTH]
                      [--fuzzymatch_max_length FUZZYMATCH_MAX_LENGTH]
                      [--tags_dictionary_path TAGS_DICTIONARY_PATH]
                      [--tags_corpus_ratio TAGS_CORPUS_RATIO]
                      [--max_tags MAX_TAGS] [--paired_stag PAIRED_STAG]
                      [--paired_etag PAIRED_ETAG]
                      [--isolated_tag ISOLATED_TAG]
                      [--src_delimiter SRC_DELIMITER] [--src_eq_tgt]
                      [--same_char] [--same_word]
                      [--scripts_ok [SCRIPTS_OK [SCRIPTS_OK ...]]]
                      [--scripts_nok [SCRIPTS_NOK [SCRIPTS_NOK ...]]]
                      [--src_tgt_ratio SRC_TGT_RATIO]
                      [--avg_tok_min AVG_TOK_MIN] [--avg_tok_max AVG_TOK_MAX]
                      [--langid [LANGID [LANGID ...]]]
                      [-src_subword_model SRC_SUBWORD_MODEL]
                      [-tgt_subword_model TGT_SUBWORD_MODEL]
                      [-src_subword_nbest SRC_SUBWORD_NBEST]
                      [-tgt_subword_nbest TGT_SUBWORD_NBEST]
                      [-src_subword_alpha SRC_SUBWORD_ALPHA]
                      [-tgt_subword_alpha TGT_SUBWORD_ALPHA]
                      [-src_subword_vocab SRC_SUBWORD_VOCAB]
                      [-tgt_subword_vocab TGT_SUBWORD_VOCAB]
                      [-src_vocab_threshold SRC_VOCAB_THRESHOLD]
                      [-tgt_vocab_threshold TGT_VOCAB_THRESHOLD]
                      [-src_subword_type {none,sentencepiece,bpe}]
                      [-tgt_subword_type {none,sentencepiece,bpe}]
                      [-src_onmttok_kwargs SRC_ONMTTOK_KWARGS]
                      [-tgt_onmttok_kwargs TGT_ONMTTOK_KWARGS] [--gpt2_pretok]
                      [--src_lang SRC_LANG] [--tgt_lang TGT_LANG]
                      [--penn PENN] [--norm_quote_commas NORM_QUOTE_COMMAS]
                      [--norm_numbers NORM_NUMBERS]
                      [--pre_replace_unicode_punct PRE_REPLACE_UNICODE_PUNCT]
                      [--post_remove_control_chars POST_REMOVE_CONTROL_CHARS]
                      [--reversible_tokenization {joiner,spacer}]
                      [--src_seq_length SRC_SEQ_LENGTH]
                      [--tgt_seq_length TGT_SEQ_LENGTH]
                      [--src_prefix SRC_PREFIX] [--tgt_prefix TGT_PREFIX]
                      [--src_suffix SRC_SUFFIX] [--tgt_suffix TGT_SUFFIX]
                      [--termbase_path TERMBASE_PATH]
                      [--src_spacy_language_model SRC_SPACY_LANGUAGE_MODEL]
                      [--tgt_spacy_language_model TGT_SPACY_LANGUAGE_MODEL]
                      [--term_corpus_ratio TERM_CORPUS_RATIO]
                      [--term_example_ratio TERM_EXAMPLE_RATIO]
                      [--src_term_stoken SRC_TERM_STOKEN]
                      [--tgt_term_stoken TGT_TERM_STOKEN]
                      [--tgt_term_etoken TGT_TERM_ETOKEN]
                      [--term_source_delimiter TERM_SOURCE_DELIMITER]
                      [--permute_sent_ratio PERMUTE_SENT_RATIO]
                      [--rotate_ratio ROTATE_RATIO]
                      [--insert_ratio INSERT_RATIO]
                      [--random_ratio RANDOM_RATIO] [--mask_ratio MASK_RATIO]
                      [--mask_length {subword,word,span-poisson}]
                      [--poisson_lambda POISSON_LAMBDA]
                      [--replace_length {-1,0,1}] [--seed SEED]

Configuration

-config, --config

Path of the main YAML config file.

-save_config, --save_config

Path where to save the config.

Data

-data, --data

List of datasets and their specifications. See examples/*.yaml for further details.

-skip_empty_level, --skip_empty_level

Possible choices: silent, warning, error

Security level when encounter empty examples.silent: silently ignore/skip empty example;warning: warning when ignore/skip empty example;error: raise error & stop execution when encouter empty.

Default: “warning”

-transforms, --transforms

Possible choices: switchout, tokendrop, tokenmask, docify, insert_mask_before_placeholder, uppercase, fuzzymatch, inlinetags, clean, sentencepiece, bpe, onmt_tokenize, normalize, inferfeats, filtertoolong, prefix, suffix, terminology, bart

Default transform pipeline to apply to data. Can be specified in each corpus of data to override.

Default: []

-save_data, --save_data

Output base path for objects that will be saved (vocab, transforms, embeddings, …).

-overwrite, --overwrite

Overwrite existing objects if any.

Default: False

-n_sample, --n_sample

Build vocab using this number of transformed samples/corpus. Can be [-1, 0, N>0]. Set to -1 to go full corpus, 0 to skip.

Default: 5000

-dump_samples, --dump_samples

Dump samples when building vocab. Warning: this may slow down the process.

Default: False

-num_threads, --num_threads

Number of parallel threads to build the vocab.

Default: 1

-learn_subwords, --learn_subwords

Learn subwords prior to building vocab

Default: False

-learn_subwords_size, --learn_subwords_size

Learn subwords operations

Default: 32000

-vocab_sample_queue_size, --vocab_sample_queue_size

Size of queues used in the build_vocab dump path.

Default: 20

Vocab

-src_vocab, --src_vocab

Path to save src (or shared) vocabulary file. Format: one <word> or <word> <count> per line.

-tgt_vocab, --tgt_vocab

Path to save tgt vocabulary file. Format: one <word> or <word> <count> per line.

-share_vocab, --share_vocab

Share source and target vocabulary.

Default: False

--decoder_start_token, -decoder_start_token

Default decoder start token for most ONMT models it is <s> = BOS it happens that for some Fairseq model it requires </s>

Default: “<s>”

--default_specials, -default_specials

default specials used for Vocab initialization UNK, PAD, BOS, EOS will take IDs 0, 1, 2, 3 typically <unk> <blank> <s> </s>

Default: [‘<unk>’, ‘<blank>’, ‘<s>’, ‘</s>’]

Features

-n_src_feats, --n_src_feats

Number of source feats.

Default: 0

-src_feats_defaults, --src_feats_defaults

Default features to apply in source in case there are not annotated

Transform/SwitchOut

Caution

This transform will not take effect when building vocabulary.

-switchout_temperature, --switchout_temperature

Sampling temperature for SwitchOut. \(\tau^{-1}\) in [WPDN18]. Smaller value makes data more diverse.

Default: 1.0

Transform/Token_Drop

-tokendrop_temperature, --tokendrop_temperature

Sampling temperature for token deletion.

Default: 1.0

Transform/Token_Mask

-tokenmask_temperature, --tokenmask_temperature

Sampling temperature for token masking.

Default: 1.0

Transform/Docify

--doc_length, -doc_length

Number of tokens per doc.

Default: 200

--max_context, -max_context

Max context segments.

Default: 1

Transform/InsertMaskBeforePlaceholdersTransform

--response_pattern, -response_pattern

Response patten to locate the end of the prompt

Default: “Response : ⦅newline⦆”

Transform/Uppercase

--upper_corpus_ratio, -upper_corpus_ratio

Corpus ratio to apply uppercasing.

Default: 0.01

Transform/FuzzyMatching

--tm_path, -tm_path

Path to a flat text TM.

--fuzzy_corpus_ratio, -fuzzy_corpus_ratio

Ratio of corpus to augment with fuzzy matches.

Default: 0.1

--fuzzy_threshold, -fuzzy_threshold

The fuzzy matching threshold.

Default: 70

--tm_delimiter, -tm_delimiter

The delimiter used in the flat text TM.

Default: “ “

--fuzzy_token, -fuzzy_token

The fuzzy token to be added with the matches.

Default: “⦅fuzzy⦆”

--fuzzymatch_min_length, -fuzzymatch_min_length

Min length for TM entries and examples to match.

Default: 4

--fuzzymatch_max_length, -fuzzymatch_max_length

Max length for TM entries and examples to match.

Default: 70

Transform/InlineTags

--tags_dictionary_path, -tags_dictionary_path

Path to a flat term dictionary.

--tags_corpus_ratio, -tags_corpus_ratio

Ratio of corpus to augment with tags.

Default: 0.1

--max_tags, -max_tags

Maximum number of tags that can be added to a single sentence.

Default: 12

--paired_stag, -paired_stag

The format of an opening paired inline tag. Must include the character #.

Default: “⦅ph_#_beg⦆”

--paired_etag, -paired_etag

The format of a closing paired inline tag. Must include the character #.

Default: “⦅ph_#_end⦆”

--isolated_tag, -isolated_tag

The format of an isolated inline tag. Must include the character #.

Default: “⦅ph_#_std⦆”

--src_delimiter, -src_delimiter

Any special token used for augmented src sentences. The default is the fuzzy token used in the FuzzyMatch transform.

Default: “⦅fuzzy⦆”

Transform/Clean

--src_eq_tgt, -src_eq_tgt

Remove ex src==tgt

Default: False

--same_char, -same_char

Remove ex with same char more than 4 times

Default: False

--same_word, -same_word

Remove ex with same word more than 3 times

Default: False

--scripts_ok, -scripts_ok

list of unicodata scripts accepted

Default: [‘Latin’, ‘Common’]

--scripts_nok, -scripts_nok

list of unicodata scripts not accepted

Default: []

--src_tgt_ratio, -src_tgt_ratio

ratio between src and tgt

Default: 2

--avg_tok_min, -avg_tok_min

average length of tokens min

Default: 3

--avg_tok_max, -avg_tok_max

average length of tokens max

Default: 20

--langid, -langid

list of languages accepted

Default: []

Transform/Subword/Common

Attention

Common options shared by all subword transforms. Including options for indicate subword model path, Subword Regularization/BPE-Dropout, and Vocabulary Restriction.

-src_subword_model, --src_subword_model

Path of subword model for src (or shared).

-tgt_subword_model, --tgt_subword_model

Path of subword model for tgt.

-src_subword_nbest, --src_subword_nbest

Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)

Default: 1

-tgt_subword_nbest, --tgt_subword_nbest

Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)

Default: 1

-src_subword_alpha, --src_subword_alpha

Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)

Default: 0

-tgt_subword_alpha, --tgt_subword_alpha

Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)

Default: 0

-src_subword_vocab, --src_subword_vocab

Path to the vocabulary file for src subword. Format: <word> <count> per line.

Default: “”

-tgt_subword_vocab, --tgt_subword_vocab

Path to the vocabulary file for tgt subword. Format: <word> <count> per line.

Default: “”

-src_vocab_threshold, --src_vocab_threshold

Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.

Default: 0

-tgt_vocab_threshold, --tgt_vocab_threshold

Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.

Default: 0

Transform/Subword/ONMTTOK

-src_subword_type, --src_subword_type

Possible choices: none, sentencepiece, bpe

Type of subword model for src (or shared) in pyonmttok.

Default: “none”

-tgt_subword_type, --tgt_subword_type

Possible choices: none, sentencepiece, bpe

Type of subword model for tgt in pyonmttok.

Default: “none”

-src_onmttok_kwargs, --src_onmttok_kwargs

Other pyonmttok options for src in dict string, except subword related options listed earlier.

Default: “{‘mode’: ‘none’}”

-tgt_onmttok_kwargs, --tgt_onmttok_kwargs

Other pyonmttok options for tgt in dict string, except subword related options listed earlier.

Default: “{‘mode’: ‘none’}”

--gpt2_pretok, -gpt2_pretok

Preprocess sentence with byte-level mapping

Default: False

Transform/Normalize

--src_lang, -src_lang

Source language code

Default: “”

--tgt_lang, -tgt_lang

Target language code

Default: “”

--penn, -penn

Penn substitution

Default: True

--norm_quote_commas, -norm_quote_commas

Normalize quotations and commas

Default: True

--norm_numbers, -norm_numbers

Normalize numbers

Default: True

--pre_replace_unicode_punct, -pre_replace_unicode_punct

Replace unicode punct

Default: False

--post_remove_control_chars, -post_remove_control_chars

Remove control chars

Default: False

Transform/InferFeats

--reversible_tokenization, -reversible_tokenization

Possible choices: joiner, spacer

Type of reversible tokenization applied on the tokenizer.

Default: “joiner”

Transform/Filter

--src_seq_length, -src_seq_length

Maximum source sequence length.

Default: 192

--tgt_seq_length, -tgt_seq_length

Maximum target sequence length.

Default: 192

Transform/Prefix

--src_prefix, -src_prefix

String to prepend to all source example.

Default: “”

--tgt_prefix, -tgt_prefix

String to prepend to all target example.

Default: “”

Transform/Suffix

--src_suffix, -src_suffix

String to append to all source example.

Default: “”

--tgt_suffix, -tgt_suffix

String to append to all target example.

Default: “”

Transform/Terminology

--termbase_path, -termbase_path

Path to a dictionary file with terms.

--src_spacy_language_model, -src_spacy_language_model

Name of the spacy language model for the source corpus.

--tgt_spacy_language_model, -tgt_spacy_language_model

Name of the spacy language model for the target corpus.

--term_corpus_ratio, -term_corpus_ratio

Ratio of corpus to augment with terms.

Default: 0.3

--term_example_ratio, -term_example_ratio

Max terms allowed in an example.

Default: 0.2

--src_term_stoken, -src_term_stoken

The source term start token.

Default: “⦅src_term_start⦆”

--tgt_term_stoken, -tgt_term_stoken

The target term start token.

Default: “⦅tgt_term_start⦆”

--tgt_term_etoken, -tgt_term_etoken

The target term end token.

Default: “⦅tgt_term_end⦆”

--term_source_delimiter, -term_source_delimiter

Any special token used for augmented source sentences. The default is the fuzzy token used in the FuzzyMatch transform.

Default: “⦅fuzzy⦆”

Transform/BART

Caution

This transform will not take effect when building vocabulary.

--permute_sent_ratio, -permute_sent_ratio

Permute this proportion of sentences (boundaries defined by [‘.’, ‘?’, ‘!’]) in all inputs.

Default: 0.0

--rotate_ratio, -rotate_ratio

Rotate this proportion of inputs.

Default: 0.0

--insert_ratio, -insert_ratio

Insert this percentage of additional random tokens.

Default: 0.0

--random_ratio, -random_ratio

Instead of using <mask>, use random token this often.

Default: 0.0

--mask_ratio, -mask_ratio

Fraction of words/subwords that will be masked.

Default: 0.0

--mask_length, -mask_length

Possible choices: subword, word, span-poisson

Length of masking window to apply.

Default: “subword”

--poisson_lambda, -poisson_lambda

Lambda for Poisson distribution to sample span length if -mask_length set to span-poisson.

Default: 3.0

--replace_length, -replace_length

Possible choices: -1, 0, 1

When masking N tokens, replace with 0, 1, or N tokens. (use -1 for N)

Default: -1

Reproducibility

--seed, -seed

Set random seed used for better reproducibility between experiments.

Default: -1