Build Vocab¶
build_vocab.py
usage: build_vocab.py [-h] [-config CONFIG] [-save_config SAVE_CONFIG] -data
DATA [-skip_empty_level {silent,warning,error}]
[-transforms {insert_mask_before_placeholder,uppercase,inlinetags,bart,terminology,docify,inferfeats,filtertoolong,prefix,suffix,fuzzymatch,clean,switchout,tokendrop,tokenmask,sentencepiece,bpe,onmt_tokenize,normalize} [{insert_mask_before_placeholder,uppercase,inlinetags,bart,terminology,docify,inferfeats,filtertoolong,prefix,suffix,fuzzymatch,clean,switchout,tokendrop,tokenmask,sentencepiece,bpe,onmt_tokenize,normalize} ...]]
-save_data SAVE_DATA [-overwrite] [-n_sample N_SAMPLE]
[-dump_samples] [-num_threads NUM_THREADS]
[-learn_subwords]
[-learn_subwords_size LEARN_SUBWORDS_SIZE]
[-vocab_sample_queue_size VOCAB_SAMPLE_QUEUE_SIZE]
-src_vocab SRC_VOCAB [-tgt_vocab TGT_VOCAB]
[-share_vocab]
[--decoder_start_token DECODER_START_TOKEN]
[--default_specials DEFAULT_SPECIALS [DEFAULT_SPECIALS ...]]
[-n_src_feats N_SRC_FEATS]
[-src_feats_defaults SRC_FEATS_DEFAULTS]
[--response_patterns RESPONSE_PATTERNS [RESPONSE_PATTERNS ...]]
[--upper_corpus_ratio UPPER_CORPUS_RATIO]
[--tags_dictionary_path TAGS_DICTIONARY_PATH]
[--tags_corpus_ratio TAGS_CORPUS_RATIO]
[--max_tags MAX_TAGS] [--paired_stag PAIRED_STAG]
[--paired_etag PAIRED_ETAG]
[--isolated_tag ISOLATED_TAG]
[--src_delimiter SRC_DELIMITER]
[--permute_sent_ratio PERMUTE_SENT_RATIO]
[--rotate_ratio ROTATE_RATIO]
[--insert_ratio INSERT_RATIO]
[--random_ratio RANDOM_RATIO] [--mask_ratio MASK_RATIO]
[--mask_length {subword,word,span-poisson}]
[--poisson_lambda POISSON_LAMBDA]
[--replace_length {-1,0,1}]
[--termbase_path TERMBASE_PATH]
[--src_spacy_language_model SRC_SPACY_LANGUAGE_MODEL]
[--tgt_spacy_language_model TGT_SPACY_LANGUAGE_MODEL]
[--term_corpus_ratio TERM_CORPUS_RATIO]
[--term_example_ratio TERM_EXAMPLE_RATIO]
[--src_term_stoken SRC_TERM_STOKEN]
[--tgt_term_stoken TGT_TERM_STOKEN]
[--tgt_term_etoken TGT_TERM_ETOKEN]
[--term_source_delimiter TERM_SOURCE_DELIMITER]
[--doc_length DOC_LENGTH] [--max_context MAX_CONTEXT]
[--reversible_tokenization {joiner,spacer}]
[--src_seq_length SRC_SEQ_LENGTH]
[--tgt_seq_length TGT_SEQ_LENGTH]
[--src_prefix SRC_PREFIX] [--tgt_prefix TGT_PREFIX]
[--src_suffix SRC_SUFFIX] [--tgt_suffix TGT_SUFFIX]
[--tm_path TM_PATH]
[--fuzzy_corpus_ratio FUZZY_CORPUS_RATIO]
[--fuzzy_threshold FUZZY_THRESHOLD]
[--tm_delimiter TM_DELIMITER]
[--fuzzy_token FUZZY_TOKEN]
[--fuzzymatch_min_length FUZZYMATCH_MIN_LENGTH]
[--fuzzymatch_max_length FUZZYMATCH_MAX_LENGTH]
[--src_eq_tgt] [--same_char] [--same_word]
[--scripts_ok [SCRIPTS_OK ...]]
[--scripts_nok [SCRIPTS_NOK ...]]
[--src_tgt_ratio SRC_TGT_RATIO]
[--avg_tok_min AVG_TOK_MIN] [--avg_tok_max AVG_TOK_MAX]
[--langid [LANGID ...]]
[-switchout_temperature SWITCHOUT_TEMPERATURE]
[-tokendrop_temperature TOKENDROP_TEMPERATURE]
[-tokenmask_temperature TOKENMASK_TEMPERATURE]
[-src_subword_model SRC_SUBWORD_MODEL]
[-tgt_subword_model TGT_SUBWORD_MODEL]
[-src_subword_nbest SRC_SUBWORD_NBEST]
[-tgt_subword_nbest TGT_SUBWORD_NBEST]
[-src_subword_alpha SRC_SUBWORD_ALPHA]
[-tgt_subword_alpha TGT_SUBWORD_ALPHA]
[-src_subword_vocab SRC_SUBWORD_VOCAB]
[-tgt_subword_vocab TGT_SUBWORD_VOCAB]
[-src_vocab_threshold SRC_VOCAB_THRESHOLD]
[-tgt_vocab_threshold TGT_VOCAB_THRESHOLD]
[-src_subword_type {none,sentencepiece,bpe}]
[-tgt_subword_type {none,sentencepiece,bpe}]
[-src_onmttok_kwargs SRC_ONMTTOK_KWARGS]
[-tgt_onmttok_kwargs TGT_ONMTTOK_KWARGS] [--gpt2_pretok]
[--src_lang SRC_LANG] [--tgt_lang TGT_LANG]
[--penn PENN] [--norm_quote_commas NORM_QUOTE_COMMAS]
[--norm_numbers NORM_NUMBERS]
[--pre_replace_unicode_punct PRE_REPLACE_UNICODE_PUNCT]
[--post_remove_control_chars POST_REMOVE_CONTROL_CHARS]
[--seed SEED]
Configuration¶
- -config, --config
Path of the main YAML config file.
- -save_config, --save_config
Path where to save the config.
Data¶
- -data, --data
List of datasets and their specifications. See examples/*.yaml for further details.
- -skip_empty_level, --skip_empty_level
Possible choices: silent, warning, error
Security level when encounter empty examples.silent: silently ignore/skip empty example;warning: warning when ignore/skip empty example;error: raise error & stop execution when encouter empty.
Default: “warning”
- -transforms, --transforms
Possible choices: insert_mask_before_placeholder, uppercase, inlinetags, bart, terminology, docify, inferfeats, filtertoolong, prefix, suffix, fuzzymatch, clean, switchout, tokendrop, tokenmask, sentencepiece, bpe, onmt_tokenize, normalize
Default transform pipeline to apply to data. Can be specified in each corpus of data to override.
Default: []
- -save_data, --save_data
Output base path for objects that will be saved (vocab, transforms, embeddings, …).
- -overwrite, --overwrite
Overwrite existing objects if any.
Default: False
- -n_sample, --n_sample
Build vocab using this number of transformed samples/corpus. Can be [-1, 0, N>0]. Set to -1 to go full corpus, 0 to skip.
Default: 5000
- -dump_samples, --dump_samples
Dump samples when building vocab. Warning: this may slow down the process.
Default: False
- -num_threads, --num_threads
Number of parallel threads to build the vocab.
Default: 1
- -learn_subwords, --learn_subwords
Learn subwords prior to building vocab
Default: False
- -learn_subwords_size, --learn_subwords_size
Learn subwords operations
Default: 32000
- -vocab_sample_queue_size, --vocab_sample_queue_size
Size of queues used in the build_vocab dump path.
Default: 20
Vocab¶
- -src_vocab, --src_vocab
Path to save src (or shared) vocabulary file. Format: one <word> or <word> <count> per line.
- -tgt_vocab, --tgt_vocab
Path to save tgt vocabulary file. Format: one <word> or <word> <count> per line.
- -share_vocab, --share_vocab
Share source and target vocabulary.
Default: False
- --decoder_start_token, -decoder_start_token
Default decoder start token for most ONMT models it is <s> = BOS it happens that for some Fairseq model it requires </s>
Default: “<s>”
- --default_specials, -default_specials
default specials used for Vocab initialization UNK, PAD, BOS, EOS will take IDs 0, 1, 2, 3 typically <unk> <blank> <s> </s>
Default: [‘<unk>’, ‘<blank>’, ‘<s>’, ‘</s>’]
Features¶
- -n_src_feats, --n_src_feats
Number of source feats.
Default: 0
- -src_feats_defaults, --src_feats_defaults
Default features to apply in source in case there are not annotated
Transform/InsertMaskBeforePlaceholdersTransform¶
- --response_patterns, -response_patterns
Response patten to locate the end of the prompt
Default: [‘Response : ⦅newline⦆’]
Transform/Uppercase¶
- --upper_corpus_ratio, -upper_corpus_ratio
Corpus ratio to apply uppercasing.
Default: 0.01
Transform/InlineTags¶
- --tags_dictionary_path, -tags_dictionary_path
Path to a flat term dictionary.
- --tags_corpus_ratio, -tags_corpus_ratio
Ratio of corpus to augment with tags.
Default: 0.1
- --max_tags, -max_tags
Maximum number of tags that can be added to a single sentence.
Default: 12
- --paired_stag, -paired_stag
The format of an opening paired inline tag. Must include the character #.
Default: “⦅ph_#_beg⦆”
- --paired_etag, -paired_etag
The format of a closing paired inline tag. Must include the character #.
Default: “⦅ph_#_end⦆”
- --isolated_tag, -isolated_tag
The format of an isolated inline tag. Must include the character #.
Default: “⦅ph_#_std⦆”
- --src_delimiter, -src_delimiter
Any special token used for augmented src sentences. The default is the fuzzy token used in the FuzzyMatch transform.
Default: “⦅fuzzy⦆”
Transform/BART¶
Caution
This transform will not take effect when building vocabulary.
- --permute_sent_ratio, -permute_sent_ratio
Permute this proportion of sentences (boundaries defined by [‘.’, ‘?’, ‘!’]) in all inputs.
Default: 0.0
- --rotate_ratio, -rotate_ratio
Rotate this proportion of inputs.
Default: 0.0
- --insert_ratio, -insert_ratio
Insert this percentage of additional random tokens.
Default: 0.0
- --random_ratio, -random_ratio
Instead of using <mask>, use random token this often.
Default: 0.0
- --mask_ratio, -mask_ratio
Fraction of words/subwords that will be masked.
Default: 0.0
- --mask_length, -mask_length
Possible choices: subword, word, span-poisson
Length of masking window to apply.
Default: “subword”
- --poisson_lambda, -poisson_lambda
Lambda for Poisson distribution to sample span length if -mask_length set to span-poisson.
Default: 3.0
- --replace_length, -replace_length
Possible choices: -1, 0, 1
When masking N tokens, replace with 0, 1, or N tokens. (use -1 for N)
Default: -1
Transform/Terminology¶
- --termbase_path, -termbase_path
Path to a dictionary file with terms.
- --src_spacy_language_model, -src_spacy_language_model
Name of the spacy language model for the source corpus.
- --tgt_spacy_language_model, -tgt_spacy_language_model
Name of the spacy language model for the target corpus.
- --term_corpus_ratio, -term_corpus_ratio
Ratio of corpus to augment with terms.
Default: 0.3
- --term_example_ratio, -term_example_ratio
Max terms allowed in an example.
Default: 0.2
- --src_term_stoken, -src_term_stoken
The source term start token.
Default: “⦅src_term_start⦆”
- --tgt_term_stoken, -tgt_term_stoken
The target term start token.
Default: “⦅tgt_term_start⦆”
- --tgt_term_etoken, -tgt_term_etoken
The target term end token.
Default: “⦅tgt_term_end⦆”
- --term_source_delimiter, -term_source_delimiter
Any special token used for augmented source sentences. The default is the fuzzy token used in the FuzzyMatch transform.
Default: “⦅fuzzy⦆”
Transform/Docify¶
- --doc_length, -doc_length
Number of tokens per doc.
Default: 200
- --max_context, -max_context
Max context segments.
Default: 1
Transform/InferFeats¶
- --reversible_tokenization, -reversible_tokenization
Possible choices: joiner, spacer
Type of reversible tokenization applied on the tokenizer.
Default: “joiner”
Transform/Filter¶
- --src_seq_length, -src_seq_length
Maximum source sequence length.
Default: 192
- --tgt_seq_length, -tgt_seq_length
Maximum target sequence length.
Default: 192
Transform/Prefix¶
- --src_prefix, -src_prefix
String to prepend to all source example.
Default: “”
- --tgt_prefix, -tgt_prefix
String to prepend to all target example.
Default: “”
Transform/Suffix¶
- --src_suffix, -src_suffix
String to append to all source example.
Default: “”
- --tgt_suffix, -tgt_suffix
String to append to all target example.
Default: “”
Transform/FuzzyMatching¶
- --tm_path, -tm_path
Path to a flat text TM.
- --fuzzy_corpus_ratio, -fuzzy_corpus_ratio
Ratio of corpus to augment with fuzzy matches.
Default: 0.1
- --fuzzy_threshold, -fuzzy_threshold
The fuzzy matching threshold.
Default: 70
- --tm_delimiter, -tm_delimiter
The delimiter used in the flat text TM.
Default: “ “
- --fuzzy_token, -fuzzy_token
The fuzzy token to be added with the matches.
Default: “⦅fuzzy⦆”
- --fuzzymatch_min_length, -fuzzymatch_min_length
Min length for TM entries and examples to match.
Default: 4
- --fuzzymatch_max_length, -fuzzymatch_max_length
Max length for TM entries and examples to match.
Default: 70
Transform/Clean¶
- --src_eq_tgt, -src_eq_tgt
Remove ex src==tgt
Default: False
- --same_char, -same_char
Remove ex with same char more than 4 times
Default: False
- --same_word, -same_word
Remove ex with same word more than 3 times
Default: False
- --scripts_ok, -scripts_ok
list of unicodata scripts accepted
Default: [‘Latin’, ‘Common’]
- --scripts_nok, -scripts_nok
list of unicodata scripts not accepted
Default: []
- --src_tgt_ratio, -src_tgt_ratio
ratio between src and tgt
Default: 2
- --avg_tok_min, -avg_tok_min
average length of tokens min
Default: 3
- --avg_tok_max, -avg_tok_max
average length of tokens max
Default: 20
- --langid, -langid
list of languages accepted
Default: []
Transform/SwitchOut¶
Caution
This transform will not take effect when building vocabulary.
- -switchout_temperature, --switchout_temperature
Sampling temperature for SwitchOut. \(\tau^{-1}\) in [WPDN18]. Smaller value makes data more diverse.
Default: 1.0
Transform/Token_Drop¶
- -tokendrop_temperature, --tokendrop_temperature
Sampling temperature for token deletion.
Default: 1.0
Transform/Token_Mask¶
- -tokenmask_temperature, --tokenmask_temperature
Sampling temperature for token masking.
Default: 1.0
Transform/Subword/Common¶
Attention
Common options shared by all subword transforms. Including options for indicate subword model path, Subword Regularization/BPE-Dropout, and Vocabulary Restriction.
- -src_subword_model, --src_subword_model
Path of subword model for src (or shared).
- -tgt_subword_model, --tgt_subword_model
Path of subword model for tgt.
- -src_subword_nbest, --src_subword_nbest
Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)
Default: 1
- -tgt_subword_nbest, --tgt_subword_nbest
Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)
Default: 1
- -src_subword_alpha, --src_subword_alpha
Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)
Default: 0
- -tgt_subword_alpha, --tgt_subword_alpha
Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)
Default: 0
- -src_subword_vocab, --src_subword_vocab
Path to the vocabulary file for src subword. Format: <word> <count> per line.
Default: “”
- -tgt_subword_vocab, --tgt_subword_vocab
Path to the vocabulary file for tgt subword. Format: <word> <count> per line.
Default: “”
- -src_vocab_threshold, --src_vocab_threshold
Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.
Default: 0
- -tgt_vocab_threshold, --tgt_vocab_threshold
Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.
Default: 0
Transform/Subword/ONMTTOK¶
- -src_subword_type, --src_subword_type
Possible choices: none, sentencepiece, bpe
Type of subword model for src (or shared) in pyonmttok.
Default: “none”
- -tgt_subword_type, --tgt_subword_type
Possible choices: none, sentencepiece, bpe
Type of subword model for tgt in pyonmttok.
Default: “none”
- -src_onmttok_kwargs, --src_onmttok_kwargs
Other pyonmttok options for src in dict string, except subword related options listed earlier.
Default: “{‘mode’: ‘none’}”
- -tgt_onmttok_kwargs, --tgt_onmttok_kwargs
Other pyonmttok options for tgt in dict string, except subword related options listed earlier.
Default: “{‘mode’: ‘none’}”
- --gpt2_pretok, -gpt2_pretok
Preprocess sentence with byte-level mapping
Default: False
Transform/Normalize¶
- --src_lang, -src_lang
Source language code
Default: “”
- --tgt_lang, -tgt_lang
Target language code
Default: “”
- --penn, -penn
Penn substitution
Default: True
- --norm_quote_commas, -norm_quote_commas
Normalize quotations and commas
Default: True
- --norm_numbers, -norm_numbers
Normalize numbers
Default: True
- --pre_replace_unicode_punct, -pre_replace_unicode_punct
Replace unicode punct
Default: False
- --post_remove_control_chars, -post_remove_control_chars
Remove control chars
Default: False
Reproducibility¶
- --seed, -seed
Set random seed used for better reproducibility between experiments.
Default: -1