ParallelInputter

class opennmt.inputters.ParallelInputter(*args, **kwargs)[source]

A multi inputter that processes parallel data.

Inherits from: opennmt.inputters.MultiInputter

Extended by:

opennmt.inputters.ExampleInputter

__init__(inputters, reducer=None, share_parameters=False, combine_features=True)[source]

Initializes a parallel inputter.

Parameters

inputters – A list of opennmt.inputters.Inputter.
reducer – A opennmt.layers.Reducer to merge all inputs. If set, parallel inputs are assumed to have the same length.
share_parameters – Share the inputters parameters.
combine_features – Combine each inputter features in a single dict or return them separately. This is typically True for multi source inputs but False for features/labels parallel data.

make_dataset(data_file, training=None)[source]

Creates the base dataset required by this inputter.

Parameters

data_file – The data file.
training – Run in training mode.

Returns

A tf.data.Dataset instance or a list of tf.data.Dataset instances.

get_dataset_size(data_file)[source]

Returns the dataset size.

If the inputter can efficiently compute the dataset size from a training file on disk, it can optionally override this method. Otherwise, we may compute the size later with a generic and slower approach (iterating over the dataset instance).

Parameters: data_file – The data file.
Returns: The dataset size or None.

input_signature()[source]: Returns the input signature of this inputter.

get_length(features, ignore_special_tokens=False)[source]

Returns the length of the input features, if defined.

Parameters

features – The dictionary of input features.
ignore_special_tokens – Ignore special tokens that were added by the inputter (e.g. <s> and/or </s>).

Returns

The length.

get_padded_shapes(element_spec, maximum_length=None)[source]

Returns the padded shapes for dataset elements.

For example, this is used during batch size autotuning to pad all batches to the maximum sequence length.

Parameters

element_spec – A nested structure of tf.TensorSpec.
maximum_length – Pad batches to this maximum length.

Returns

A nested structure of tf.TensorShape.

make_features(element=None, features=None, training=None)[source]

Creates features from data.

This is typically called in a data pipeline (such as Dataset.map). Common transformation includes tokenization, parsing, vocabulary lookup, etc.

This method accepts both a single element from the dataset or a partially built dictionary of features.

Parameters

element – An element from the dataset returned by opennmt.inputters.Inputter.make_dataset().
features – An optional and possibly partial dictionary of features to augment.
training – Run in training mode.

Returns

A dictionary of tf.Tensor.

keep_for_training(features, maximum_length=None)[source]

Returns True if this example should be kept for training.

Parameters

features – A dictionary of tf.Tensor.
maximum_length – The maximum length used for training.

Returns

A boolean.

build(input_shape)[source]

Creates the variables of the layer (for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call. It is invoked automatically before the first execution of call().

This is typically used to create the weights of Layer subclasses (at the discretion of the subclass implementer).

Parameters: input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(features, training=None)[source]

Creates the model input from the features (e.g. word embeddings).

Parameters

features – A dictionary of tf.Tensor, the output of opennmt.inputters.Inputter.make_features().
training – Run in training mode.

Returns

The model input.