Vocab

class opennmt.data.Vocab(special_tokens=None)[source]

Vocabulary class.

Example

>>> vocab = opennmt.data.Vocab.from_file("wmtende.vocab")
>>> len(vocab)
32000
>>> "be" in vocab
True
>>> vocab.lookup("be")
377
>>> vocab.lookup(377)
'be'

Inherits from: builtins.object

__init__(special_tokens=None)[source]

Initializes a vocabulary.

Parameters: special_tokens – A list of special tokens (e.g. start of sentence).

classmethod from_file(path, file_format='default')[source]

Creates from a vocabulary file.

Parameters

path – The path to the vocabulary file.
file_format – Define the format of the vocabulary file. Can be: default, sentencepiece. “default” is simply one token per line.

Raises

ValueError – if file_format is invalid.

property size: Returns the number of entries of the vocabulary.

property words: Returns the list of words.

__len__()[source]: Returns the number of entries of the vocabulary.

__contains__(token)[source]: Returns True if the vocabulary contains token.

add_from_text(filename, tokenizer=None)[source]

Fills the vocabulary from a text file.

Parameters

filename – The file to load from.
tokenizer – A callable to tokenize a line of text.

serialize(path)[source]

Writes the vocabulary on disk.

Parameters: path – The path where the vocabulary will be saved.

load(path, file_format='default')[source]

Loads a serialized vocabulary.

Parameters

path – The path to the vocabulary to load.
file_format – Define the format of the vocabulary file. Can be: default, sentencepiece. “default” is simply one token per line.

Raises

ValueError – if file_format is invalid.

add(token)[source]

Adds a token or increases its frequency.

Parameters: token – The string to add.

lookup(identifier, default=None)[source]

Lookups in the vocabulary.

Parameters

identifier – A string or an index to lookup.
default – The value to return if identifier is not found.

Returns

The value associated with identifier or default.

prune(max_size=0, min_frequency=1)[source]

Creates a pruned version of the vocabulary.

Parameters

max_size – The maximum vocabulary size.
min_frequency – The minimum frequency of each entry.

Returns

A new vocabulary.

pad_to_multiple(multiple, num_oov_buckets=1)[source]

Pads the vocabulary size to a multiple value.

More specically, this method ensures that:

(vocab_size + num_oov_buckets) % multiple == 0

Parameters

multiple – The multiple value.
num_oov_buckets – The number of OOV buckets added during the training. Usually just 1 for the <unk> token.