opennmt.inputters.inputter module

Define generic inputters.

class opennmt.inputters.inputter.Inputter(dtype=tf.float32)[source]

Bases: tensorflow.python.keras.engine.base_layer.Layer

Base class for inputters.

num_outputs

How many parallel outputs does this inputter produce.

initialize(metadata, asset_dir=None, asset_prefix='')[source]

Initializes the inputter.

Parameters:
  • metadata – A dictionary containing additional metadata set by the user.
  • asset_dir – The directory where assets can be written. If None, no assets are returned.
  • asset_prefix – The prefix to attach to assets filename.
Returns:

A dictionary containing additional assets used by the inputter.

export_assets(asset_dir, asset_prefix='')[source]

Exports assets used by this tokenizer.

Parameters:
  • asset_dir – The directory where assets can be written.
  • asset_prefix – The prefix to attach to assets filename.
Returns:

A dictionary containing additional assets used by the inputter.

make_dataset(data_file, training=None)[source]

Creates the base dataset required by this inputter.

Parameters:
  • data_file – The data file.
  • training – Run in training mode.
Returns:

A tf.data.Dataset.

make_inference_dataset(features_file, batch_size, bucket_width=None, num_threads=1, prefetch_buffer_size=None)[source]

Builds a dataset to be used for inference.

For evaluation and training datasets, see opennmt.inputters.inputter.ExampleInputter.

Parameters:
  • features_file – The test file.
  • batch_size – The batch size to use.
  • bucket_width – The width of the length buckets to select batch candidates from (for efficiency). Set None to not constrain batch formation.
  • num_threads – The number of elements processed in parallel.
  • prefetch_buffer_size – The number of batches to prefetch asynchronously. If None, use an automatically tuned value on TensorFlow 1.8+ and 1 on older versions.
Returns:

A tf.data.Dataset.

get_dataset_size(data_file)[source]

Returns the size of the dataset.

Parameters:data_file – The data file.
Returns:The total size.
get_serving_input_receiver()[source]

Returns a serving input receiver for this inputter.

Returns:A tf.estimator.export.ServingInputReceiver.
get_receiver_tensors()[source]

Returns the input placeholders for serving.

get_length(features)[source]

Returns the length of the input features, if defined.

make_features(element=None, features=None, training=None)[source]

Creates features from data.

Parameters:
  • element – An element from the dataset.
  • features – An optional dictionary of features to augment.
  • training – Run in training mode.
Returns:

A dictionary of tf.Tensor.

call(features, training=None)[source]

Forwards call to make_inputs().

make_inputs(features, training=None)[source]

Creates the model input from the features.

Parameters:
  • features – A dictionary of tf.Tensor.
  • training – Run in training mode.
Returns:

The model input.

visualize(log_dir)[source]

Visualizes the transformation, usually embeddings.

Parameters:log_dir – The active log directory.
set_data_field(data, key, value, volatile=False)[source]

Sets a data field.

Parameters:
  • data – The data dictionary.
  • key – The value key.
  • value – The value to assign.
  • volatile – If True, the key/value pair will be removed once the processing done.
Returns:

The updated data dictionary.

remove_data_field(data, key)[source]

Removes a data field.

Parameters:
  • data – The data dictionary.
  • key – The value key.
Returns:

The updated data dictionary.

add_process_hooks(hooks)[source]

Adds processing hooks.

Processing hooks are additional and model specific data processing functions applied after calling this inputter opennmt.inputters.inputter.Inputter.process() function.

Parameters:hooks – A list of callables with the signature (inputter, data) -> data.
process(data, training=None)[source]

Prepares raw data.

Parameters:
  • data – The raw data.
  • training – Run in training mode.
Returns:

A dictionary of tf.Tensor.

transform_data(data, mode='train', log_dir=None)[source]

Transforms the processed data to an input.

This is usually a simple forward of a data field to opennmt.inputters.inputter.Inputter.transform().

See also process.

Parameters:
  • data – A dictionary of data fields.
  • mode – A tf.estimator.ModeKeys mode.
  • log_dir – The log directory. If set, visualization will be setup.
Returns:

The transformed input.

class opennmt.inputters.inputter.MultiInputter(inputters, reducer=None)[source]

Bases: opennmt.inputters.inputter.Inputter

An inputter that gathers multiple inputters, possibly nested.

num_outputs

How many parallel outputs does this inputter produce.

get_leaf_inputters()[source]

Returns a list of all leaf Inputter instances.

initialize(metadata, asset_dir=None, asset_prefix='')[source]

Initializes the inputter.

Parameters:
  • metadata – A dictionary containing additional metadata set by the user.
  • asset_dir – The directory where assets can be written. If None, no assets are returned.
  • asset_prefix – The prefix to attach to assets filename.
Returns:

A dictionary containing additional assets used by the inputter.

export_assets(asset_dir, asset_prefix='')[source]

Exports assets used by this tokenizer.

Parameters:
  • asset_dir – The directory where assets can be written.
  • asset_prefix – The prefix to attach to assets filename.
Returns:

A dictionary containing additional assets used by the inputter.

make_dataset(data_file, training=None)[source]

Creates the base dataset required by this inputter.

Parameters:
  • data_file – The data file.
  • training – Run in training mode.
Returns:

A tf.data.Dataset.

get_dataset_size(data_file)[source]

Returns the size of the dataset.

Parameters:data_file – The data file.
Returns:The total size.
visualize(log_dir)[source]

Visualizes the transformation, usually embeddings.

Parameters:log_dir – The active log directory.
class opennmt.inputters.inputter.ParallelInputter(inputters, reducer=None, share_parameters=False, combine_features=True)[source]

Bases: opennmt.inputters.inputter.MultiInputter

An multi inputter that process parallel data.

__init__(inputters, reducer=None, share_parameters=False, combine_features=True)[source]

Initializes a parallel inputter.

Parameters:
  • inputters – A list of opennmt.inputters.inputter.Inputter.
  • reducer – A opennmt.layers.reducer.Reducer to merge all inputs. If set, parallel inputs are assumed to have the same length.
  • share_parameters – Share the inputters parameters.
  • combine_features – Combine each inputter features in a single dict or return them separately. This is typically True for multi source inputs but False for features/labels parallel data.
Raises:

ValueError – if share_parameters is set but the child inputters are not of the same type.

make_dataset(data_file, training=None)[source]

Creates the base dataset required by this inputter.

Parameters:
  • data_file – The data file.
  • training – Run in training mode.
Returns:

A tf.data.Dataset.

get_dataset_size(data_file)[source]

Returns the size of the dataset.

Parameters:data_file – The data file.
Returns:The total size.
get_receiver_tensors()[source]

Returns the input placeholders for serving.

get_length(features)[source]

Returns the length of the input features, if defined.

make_features(element=None, features=None, training=None)[source]

Creates features from data.

Parameters:
  • element – An element from the dataset.
  • features – An optional dictionary of features to augment.
  • training – Run in training mode.
Returns:

A dictionary of tf.Tensor.

build(input_shape=None)[source]

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters:input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).
make_inputs(features, training=None)[source]

Creates the model input from the features.

Parameters:
  • features – A dictionary of tf.Tensor.
  • training – Run in training mode.
Returns:

The model input.

class opennmt.inputters.inputter.MixedInputter(inputters, reducer=<opennmt.layers.reducer.ConcatReducer object>, dropout=0.0)[source]

Bases: opennmt.inputters.inputter.MultiInputter

An multi inputter that applies several transformation on the same data.

__init__(inputters, reducer=<opennmt.layers.reducer.ConcatReducer object>, dropout=0.0)[source]

Initializes a mixed inputter.

Parameters:
make_dataset(data_file, training=None)[source]

Creates the base dataset required by this inputter.

Parameters:
  • data_file – The data file.
  • training – Run in training mode.
Returns:

A tf.data.Dataset.

get_dataset_size(data_file)[source]

Returns the size of the dataset.

Parameters:data_file – The data file.
Returns:The total size.
get_receiver_tensors()[source]

Returns the input placeholders for serving.

get_length(features)[source]

Returns the length of the input features, if defined.

make_features(element=None, features=None, training=None)[source]

Creates features from data.

Parameters:
  • element – An element from the dataset.
  • features – An optional dictionary of features to augment.
  • training – Run in training mode.
Returns:

A dictionary of tf.Tensor.

make_inputs(features, training=None)[source]

Creates the model input from the features.

Parameters:
  • features – A dictionary of tf.Tensor.
  • training – Run in training mode.
Returns:

The model input.

class opennmt.inputters.inputter.ExampleInputter(features_inputter, labels_inputter, share_parameters=False)[source]

Bases: opennmt.inputters.inputter.ParallelInputter

An inputter that returns training examples (parallel features and labels).

__init__(features_inputter, labels_inputter, share_parameters=False)[source]

Initializes this inputter.

Parameters:
  • features_inputter – An inputter producing the features (source).
  • labels_inputter – An inputter producing the labels (target).
  • share_parameters – Share the inputters parameters.
initialize(metadata, asset_dir=None, asset_prefix='')[source]

Initializes the inputter.

Parameters:
  • metadata – A dictionary containing additional metadata set by the user.
  • asset_dir – The directory where assets can be written. If None, no assets are returned.
  • asset_prefix – The prefix to attach to assets filename.
Returns:

A dictionary containing additional assets used by the inputter.

export_assets(asset_dir, asset_prefix='')[source]

Exports assets used by this tokenizer.

Parameters:
  • asset_dir – The directory where assets can be written.
  • asset_prefix – The prefix to attach to assets filename.
Returns:

A dictionary containing additional assets used by the inputter.

make_inference_dataset(features_file, batch_size, bucket_width=None, num_threads=1, prefetch_buffer_size=None)[source]

Builds a dataset to be used for inference.

For evaluation and training datasets, see opennmt.inputters.inputter.ExampleInputter.

Parameters:
  • features_file – The test file.
  • batch_size – The batch size to use.
  • bucket_width – The width of the length buckets to select batch candidates from (for efficiency). Set None to not constrain batch formation.
  • num_threads – The number of elements processed in parallel.
  • prefetch_buffer_size – The number of batches to prefetch asynchronously. If None, use an automatically tuned value on TensorFlow 1.8+ and 1 on older versions.
Returns:

A tf.data.Dataset.

make_evaluation_dataset(features_file, labels_file, batch_size, num_threads=1, prefetch_buffer_size=None)[source]

Builds a dataset to be used for evaluation.

Parameters:
  • features_file – The evaluation source file.
  • labels_file – The evaluation target file.
  • batch_size – The batch size to use.
  • num_threads – The number of elements processed in parallel.
  • prefetch_buffer_size – The number of batches to prefetch asynchronously. If None, use an automatically tuned value on TensorFlow 1.8+ and 1 on older versions.
Returns:

A tf.data.Dataset.

make_training_dataset(features_file, labels_file, batch_size, batch_type='examples', batch_multiplier=1, batch_size_multiple=1, shuffle_buffer_size=None, bucket_width=None, maximum_features_length=None, maximum_labels_length=None, single_pass=False, num_shards=1, shard_index=0, num_threads=4, prefetch_buffer_size=None)[source]

Builds a dataset to be used for training. It supports the full training pipeline, including:

  • sharding
  • shuffling
  • filtering
  • bucketing
  • prefetching
Parameters:
  • features_file – The evaluation source file.
  • labels_file – The evaluation target file.
  • batch_size – The batch size to use.
  • batch_type – The training batching stragety to use: can be “examples” or “tokens”.
  • batch_multiplier – The batch size multiplier to prepare splitting accross replicated graph parts.
  • batch_size_multiple – When batch_type is “tokens”, ensure that the result batch size is a multiple of this value.
  • shuffle_buffer_size – The number of elements from which to sample.
  • bucket_width – The width of the length buckets to select batch candidates from (for efficiency). Set None to not constrain batch formation.
  • maximum_features_length – The maximum length or list of maximum lengths of the features sequence(s). None to not constrain the length.
  • maximum_labels_length – The maximum length of the labels sequence. None to not constrain the length.
  • single_pass – If True, makes a single pass over the training data.
  • num_shards – The number of data shards (usually the number of workers in a distributed setting).
  • shard_index – The shard index this data pipeline should read from.
  • num_threads – The number of elements processed in parallel.
  • prefetch_buffer_size – The number of batches to prefetch asynchronously. If None, use an automatically tuned value on TensorFlow 1.8+ and 1 on older versions.
Returns:

A tf.data.Dataset.