Vocab
- class opennmt.data.Vocab(special_tokens=None)[source]
Vocabulary class.
Example
>>> vocab = opennmt.data.Vocab.from_file("wmtende.vocab") >>> len(vocab) 32000 >>> "be" in vocab True >>> vocab.lookup("be") 377 >>> vocab.lookup(377) 'be'
Inherits from:
builtins.object
- __init__(special_tokens=None)[source]
Initializes a vocabulary.
- Parameters
special_tokens – A list of special tokens (e.g. start of sentence).
- classmethod from_file(path, file_format='default')[source]
Creates from a vocabulary file.
- Parameters
path – The path to the vocabulary file.
file_format – Define the format of the vocabulary file. Can be: default, sentencepiece. “default” is simply one token per line.
- Raises
ValueError – if
file_format
is invalid.
- property size
Returns the number of entries of the vocabulary.
- property words
Returns the list of words.
- add_from_text(filename, tokenizer=None)[source]
Fills the vocabulary from a text file.
- Parameters
filename – The file to load from.
tokenizer – A callable to tokenize a line of text.
- serialize(path)[source]
Writes the vocabulary on disk.
- Parameters
path – The path where the vocabulary will be saved.
- load(path, file_format='default')[source]
Loads a serialized vocabulary.
- Parameters
path – The path to the vocabulary to load.
file_format – Define the format of the vocabulary file. Can be: default, sentencepiece. “default” is simply one token per line.
- Raises
ValueError – if
file_format
is invalid.
- lookup(identifier, default=None)[source]
Lookups in the vocabulary.
- Parameters
identifier – A string or an index to lookup.
default – The value to return if
identifier
is not found.
- Returns
The value associated with
identifier
ordefault
.
- prune(max_size=0, min_frequency=1)[source]
Creates a pruned version of the vocabulary.
- Parameters
max_size – The maximum vocabulary size.
min_frequency – The minimum frequency of each entry.
- Returns
A new vocabulary.
- pad_to_multiple(multiple, num_oov_buckets=1)[source]
Pads the vocabulary size to a multiple value.
More specically, this method ensures that:
(vocab_size + num_oov_buckets) % multiple == 0
- Parameters
multiple – The multiple value.
num_oov_buckets – The number of OOV buckets added during the training. Usually just 1 for the <unk> token.