Inputter

class opennmt.inputters.Inputter(*args, **kwargs)[source]

Base class for inputters.

Inherits from: keras.src.engine.base_layer.Layer

Extended by:

property asset_prefix

The asset prefix is used to differentiate resources of parallel inputters. The most basic examples are the “source_” and “target_” prefixes.

  • When reading the data configuration, the inputter will read fields that start with this prefix (e.g. “source_vocabulary”).

  • Assets exported by this inputter start with this prefix.

property num_outputs

The number of parallel outputs produced by this inputter.

initialize(data_config)[source]

Initializes the inputter.

Parameters

data_config – A dictionary containing the data configuration set by the user.

export_assets(asset_dir)[source]

Exports assets used by this tokenizer.

Parameters

asset_dir – The directory where assets can be written.

Returns

A dictionary containing additional assets used by the inputter.

abstract make_dataset(data_file, training=None)[source]

Creates the base dataset required by this inputter.

Parameters
  • data_file – The data file.

  • training – Run in training mode.

Returns

A tf.data.Dataset instance or a list of tf.data.Dataset instances.

get_dataset_size(data_file)[source]

Returns the dataset size.

If the inputter can efficiently compute the dataset size from a training file on disk, it can optionally override this method. Otherwise, we may compute the size later with a generic and slower approach (iterating over the dataset instance).

Parameters

data_file – The data file.

Returns

The dataset size or None.

make_inference_dataset(features_file, batch_size, batch_type='examples', length_bucket_width=None, num_threads=1, prefetch_buffer_size=None)[source]

Builds a dataset to be used for inference.

For evaluation and training datasets, see opennmt.inputters.ExampleInputter.

Parameters
  • features_file – The test file.

  • batch_size – The batch size to use.

  • batch_type – The batching strategy to use: can be “examples” or “tokens”.

  • length_bucket_width – The width of the length buckets to select batch candidates from (for efficiency). Set None to not constrain batch formation.

  • num_threads – The number of elements processed in parallel.

  • prefetch_buffer_size – The number of batches to prefetch asynchronously. If None, use an automatically tuned value.

Returns

A tf.data.Dataset.

abstract input_signature()[source]

Returns the input signature of this inputter.

get_length(features, ignore_special_tokens=False)[source]

Returns the length of the input features, if defined.

Parameters
  • features – The dictionary of input features.

  • ignore_special_tokens – Ignore special tokens that were added by the inputter (e.g. <s> and/or </s>).

Returns

The length.

get_padded_shapes(element_spec, maximum_length=None)[source]

Returns the padded shapes for dataset elements.

For example, this is used during batch size autotuning to pad all batches to the maximum sequence length.

Parameters
  • element_spec – A nested structure of tf.TensorSpec.

  • maximum_length – Pad batches to this maximum length.

Returns

A nested structure of tf.TensorShape.

has_prepare_step()[source]

Returns True if this inputter implements a data preparation step in method opennmt.inputters.Inputter.prepare_elements().

prepare_elements(elements, training=None)[source]

Prepares dataset elements.

This method is called on a batch of dataset elements. For example, it can be overriden to apply an external pre-tokenization.

Note that the results of the method are unbatched and then passed to method opennmt.inputters.Inputter.make_features().

Parameters
  • elements – A batch of dataset elements.

  • training – Run in training mode.

Returns

A (possibly nested) structure of tf.Tensor.

abstract make_features(element=None, features=None, training=None)[source]

Creates features from data.

This is typically called in a data pipeline (such as Dataset.map). Common transformation includes tokenization, parsing, vocabulary lookup, etc.

This method accepts both a single element from the dataset or a partially built dictionary of features.

Parameters
  • element – An element from the dataset returned by opennmt.inputters.Inputter.make_dataset().

  • features – An optional and possibly partial dictionary of features to augment.

  • training – Run in training mode.

Returns

A dictionary of tf.Tensor.

keep_for_training(features, maximum_length=None)[source]

Returns True if this example should be kept for training.

Parameters
  • features – A dictionary of tf.Tensor.

  • maximum_length – The maximum length used for training.

Returns

A boolean.

call(features, training=None)[source]

Creates the model input from the features (e.g. word embeddings).

Parameters
Returns

The model input.

visualize(model_root, log_dir)[source]

Visualizes the transformation, usually embeddings.

Parameters
  • model_root – The root model object.

  • log_dir – The active log directory.