opennmt.inputters.text_inputter module

Define word-based embedders.

opennmt.inputters.text_inputter.visualize_embeddings(log_dir, embedding_var, vocabulary_file, num_oov_buckets=1)[source]

Registers an embedding variable for visualization in TensorBoard.

This function registers embedding_var in the projector_config.pbtxt file and generates metadata from vocabulary_file to attach a label to each word ID.

Parameters:
  • log_dir – The active log directory.
  • embedding_var – The embedding variable to visualize.
  • vocabulary_file – The associated vocabulary file.
  • num_oov_buckets – The number of additional unknown tokens.
opennmt.inputters.text_inputter.load_pretrained_embeddings(embedding_file, vocabulary_file, num_oov_buckets=1, with_header=True, case_insensitive_embeddings=True)[source]

Returns pretrained embeddings relative to the vocabulary.

The embedding_file must have the following format:

N M
word1 val1 val2 ... valM
word2 val1 val2 ... valM
...
wordN val1 val2 ... valM

or if with_header is False:

word1 val1 val2 ... valM
word2 val1 val2 ... valM
...
wordN val1 val2 ... valM

This function will iterate on each embedding in embedding_file and assign the pretrained vector to the associated word in vocabulary_file if found. Otherwise, the embedding is ignored.

If case_insensitive_embeddings is True, word embeddings are assumed to be trained on lowercase data. In that case, word alignments are case insensitive meaning the pretrained word embedding for “the” will be assigned to “the”, “The”, “THE”, or any other case variants included in vocabulary_file.

Parameters:
  • embedding_file – Path the embedding file. Entries will be matched against vocabulary_file.
  • vocabulary_file – The vocabulary file containing one word per line.
  • num_oov_buckets – The number of additional unknown tokens.
  • with_headerTrue if the embedding file starts with a header line like in GloVe embedding files.
  • case_insensitive_embeddingsTrue if embeddings are trained on lowercase data.
Returns:

A Numpy array of shape [vocabulary_size + num_oov_buckets, embedding_size].

opennmt.inputters.text_inputter.tokens_to_chars(tokens)[source]

Splits a list of tokens into unicode characters.

This is an in-graph transformation.

Parameters:tokens – A sequence of tokens.
Returns:The characters as a tf.Tensor of shape [sequence_length, max_word_length] and the length of each word.
class opennmt.inputters.text_inputter.TextInputter(tokenizer=<opennmt.tokenizers.tokenizer.SpaceTokenizer object>, dtype=tf.float32)[source]

Bases: opennmt.inputters.inputter.Inputter

An abstract inputter that processes text.

get_length(data)[source]

Returns the length of the input data, if defined.

make_dataset(data_file)[source]

Creates the dataset required by this inputter.

Parameters:data_file – The data file.
Returns:A tf.data.Dataset.
get_dataset_size(data_file)[source]

Returns the size of the dataset.

Parameters:data_file – The data file.
Returns:The total size.
initialize(metadata, asset_dir=None, asset_prefix='')[source]

Initializes the inputter within the current graph.

For example, one can create lookup tables in this method for their initializer to be added to the current graph TABLE_INITIALIZERS collection.

Parameters:
  • metadata – A dictionary containing additional metadata set by the user.
  • asset_dir – The directory where assets can be written. If None, no assets are returned.
  • asset_prefix – The prefix to attach to assets filename.
Returns:

A dictionary containing additional assets used by the inputter.

transform(inputs, mode)[source]

Transforms inputs.

Parameters:
  • inputs – A (possible nested structure of) tf.Tensor which depends on the inputter.
  • mode – A tf.estimator.ModeKeys mode.
Returns:

The transformed input.

class opennmt.inputters.text_inputter.WordEmbedder(vocabulary_file_key, embedding_size=None, embedding_file_key=None, embedding_file_with_header=True, case_insensitive_embeddings=True, trainable=True, dropout=0.0, tokenizer=<opennmt.tokenizers.tokenizer.SpaceTokenizer object>, dtype=tf.float32)[source]

Bases: opennmt.inputters.text_inputter.TextInputter

Simple word embedder.

__init__(vocabulary_file_key, embedding_size=None, embedding_file_key=None, embedding_file_with_header=True, case_insensitive_embeddings=True, trainable=True, dropout=0.0, tokenizer=<opennmt.tokenizers.tokenizer.SpaceTokenizer object>, dtype=tf.float32)[source]

Initializes the parameters of the word embedder.

Parameters:
  • vocabulary_file_key – The data configuration key of the vocabulary file containing one word per line.
  • embedding_size – The size of the resulting embedding. If None, an embedding file must be provided.
  • embedding_file_key – The data configuration key of the embedding file.
  • embedding_file_with_headerTrue if the embedding file starts with a header line like in GloVe embedding files.
  • case_insensitive_embeddingsTrue if embeddings are trained on lowercase data.
  • trainable – If False, do not optimize embeddings.
  • dropout – The probability to drop units in the embedding.
  • tokenizer – An optional opennmt.tokenizers.tokenizer.Tokenizer to tokenize the input text.
  • dtype – The embedding type.
Raises:

ValueError – if neither embedding_size nor embedding_file_key are set.

See also

The opennmt.inputters.text_inputter.load_pretrained_embeddings() function for details about the pretrained embedding format and behavior.

initialize(metadata, asset_dir=None, asset_prefix='')[source]

Initializes the inputter within the current graph.

For example, one can create lookup tables in this method for their initializer to be added to the current graph TABLE_INITIALIZERS collection.

Parameters:
  • metadata – A dictionary containing additional metadata set by the user.
  • asset_dir – The directory where assets can be written. If None, no assets are returned.
  • asset_prefix – The prefix to attach to assets filename.
Returns:

A dictionary containing additional assets used by the inputter.

visualize(log_dir)[source]

Visualizes the transformation, usually embeddings.

Parameters:log_dir – The active log directory.
transform(inputs, mode)[source]

Transforms inputs.

Parameters:
  • inputs – A (possible nested structure of) tf.Tensor which depends on the inputter.
  • mode – A tf.estimator.ModeKeys mode.
Returns:

The transformed input.

class opennmt.inputters.text_inputter.CharEmbedder(vocabulary_file_key, embedding_size, dropout=0.0, tokenizer=<opennmt.tokenizers.tokenizer.SpaceTokenizer object>, dtype=tf.float32)[source]

Bases: opennmt.inputters.text_inputter.TextInputter

Base class for character-aware inputters.

__init__(vocabulary_file_key, embedding_size, dropout=0.0, tokenizer=<opennmt.tokenizers.tokenizer.SpaceTokenizer object>, dtype=tf.float32)[source]

Initializes the parameters of the character embedder.

Parameters:
  • vocabulary_file_key – The meta configuration key of the vocabulary file containing one character per line.
  • embedding_size – The size of the character embedding.
  • dropout – The probability to drop units in the embedding.
  • tokenizer – An optional opennmt.tokenizers.tokenizer.Tokenizer to tokenize the input text.
  • dtype – The embedding type.
initialize(metadata, asset_dir=None, asset_prefix='')[source]

Initializes the inputter within the current graph.

For example, one can create lookup tables in this method for their initializer to be added to the current graph TABLE_INITIALIZERS collection.

Parameters:
  • metadata – A dictionary containing additional metadata set by the user.
  • asset_dir – The directory where assets can be written. If None, no assets are returned.
  • asset_prefix – The prefix to attach to assets filename.
Returns:

A dictionary containing additional assets used by the inputter.

visualize(log_dir)[source]

Visualizes the transformation, usually embeddings.

Parameters:log_dir – The active log directory.
transform(inputs, mode)[source]

Transforms inputs.

Parameters:
  • inputs – A (possible nested structure of) tf.Tensor which depends on the inputter.
  • mode – A tf.estimator.ModeKeys mode.
Returns:

The transformed input.

class opennmt.inputters.text_inputter.CharConvEmbedder(vocabulary_file_key, embedding_size, num_outputs, kernel_size=5, stride=3, dropout=0.0, tokenizer=<opennmt.tokenizers.tokenizer.SpaceTokenizer object>, dtype=tf.float32)[source]

Bases: opennmt.inputters.text_inputter.CharEmbedder

An inputter that applies a convolution on characters embeddings.

__init__(vocabulary_file_key, embedding_size, num_outputs, kernel_size=5, stride=3, dropout=0.0, tokenizer=<opennmt.tokenizers.tokenizer.SpaceTokenizer object>, dtype=tf.float32)[source]

Initializes the parameters of the character convolution embedder.

Parameters:
  • vocabulary_file_key – The meta configuration key of the vocabulary file containing one character per line.
  • embedding_size – The size of the character embedding.
  • num_outputs – The dimension of the convolution output space.
  • kernel_size – Length of the convolution window.
  • stride – Length of the convolution stride.
  • dropout – The probability to drop units in the embedding.
  • tokenizer – An optional opennmt.tokenizers.tokenizer.Tokenizer to tokenize the input text.
  • dtype – The embedding type.
transform(inputs, mode)[source]

Transforms inputs.

Parameters:
  • inputs – A (possible nested structure of) tf.Tensor which depends on the inputter.
  • mode – A tf.estimator.ModeKeys mode.
Returns:

The transformed input.

class opennmt.inputters.text_inputter.CharRNNEmbedder(vocabulary_file_key, embedding_size, num_units, dropout=0.2, encoding='average', cell_class=<class 'tensorflow.python.ops.rnn_cell_impl.LSTMCell'>, tokenizer=<opennmt.tokenizers.tokenizer.SpaceTokenizer object>, dtype=tf.float32)[source]

Bases: opennmt.inputters.text_inputter.CharEmbedder

An inputter that runs a single RNN layer over character embeddings.

__init__(vocabulary_file_key, embedding_size, num_units, dropout=0.2, encoding='average', cell_class=<class 'tensorflow.python.ops.rnn_cell_impl.LSTMCell'>, tokenizer=<opennmt.tokenizers.tokenizer.SpaceTokenizer object>, dtype=tf.float32)[source]

Initializes the parameters of the character RNN embedder.

Parameters:
  • vocabulary_file_key – The meta configuration key of the vocabulary file containing one character per line.
  • embedding_size – The size of the character embedding.
  • num_units – The number of units in the RNN layer.
  • dropout – The probability to drop units in the embedding and the RNN outputs.
  • encoding – “average” or “last” (case insensitive), the encoding vector to extract from the RNN outputs.
  • cell_class – The inner cell class or a callable taking num_units as argument and returning a cell.
  • tokenizer – An optional opennmt.tokenizers.tokenizer.Tokenizer to tokenize the input text.
  • dtype – The embedding type.
Raises:

ValueError – if encoding is invalid.

transform(inputs, mode)[source]

Transforms inputs.

Parameters:
  • inputs – A (possible nested structure of) tf.Tensor which depends on the inputter.
  • mode – A tf.estimator.ModeKeys mode.
Returns:

The transformed input.