opennmt.inputters.inputter module¶
Define generic inputters.
-
class
opennmt.inputters.inputter.
Inputter
(dtype=tf.float32)[source]¶ Bases:
object
Base class for inputters.
-
num_outputs
¶ How many parallel outputs does this inputter produce.
-
initialize
(metadata, asset_dir=None, asset_prefix='')[source]¶ Initializes the inputter within the current graph.
For example, one can create lookup tables in this method for their initializer to be added to the current graph
TABLE_INITIALIZERS
collection.Parameters: - metadata – A dictionary containing additional metadata set by the user.
- asset_dir – The directory where assets can be written. If
None
, no assets are returned. - asset_prefix – The prefix to attach to assets filename.
Returns: A dictionary containing additional assets used by the inputter.
-
make_dataset
(data_file, training=None)[source]¶ Creates the base dataset required by this inputter.
Parameters: - data_file – The data file.
- training – Run in training mode.
Returns: A
tf.data.Dataset
.
-
make_inference_dataset
(features_file, batch_size, bucket_width=None, num_threads=1, prefetch_buffer_size=None)[source]¶ Builds a dataset to be used for inference.
For evaluation and training datasets, see
opennmt.inputters.inputter.ExampleInputter
.Parameters: - features_file – The test file.
- batch_size – The batch size to use.
- bucket_width – The width of the length buckets to select batch candidates
from (for efficiency). Set
None
to not constrain batch formation. - num_threads – The number of elements processed in parallel.
- prefetch_buffer_size – The number of batches to prefetch asynchronously. If
None
, use an automatically tuned value on TensorFlow 1.8+ and 1 on older versions.
Returns: A
tf.data.Dataset
.
-
get_dataset_size
(data_file)[source]¶ Returns the size of the dataset.
Parameters: data_file – The data file. Returns: The total size.
-
get_serving_input_receiver
()[source]¶ Returns a serving input receiver for this inputter.
Returns: A tf.estimator.export.ServingInputReceiver
.
-
make_features
(element=None, features=None, training=None)[source]¶ Creates features from data.
Parameters: - element – An element from the dataset.
- features – An optional dictionary of features to augment.
- training – Run in training mode.
Returns: A dictionary of
tf.Tensor
.
-
make_inputs
(features, training=None)[source]¶ Creates the model input from the features.
Parameters: - features – A dictionary of
tf.Tensor
. - training – Run in training mode.
Returns: The model input.
- features – A dictionary of
-
visualize
(log_dir)[source]¶ Visualizes the transformation, usually embeddings.
Parameters: log_dir – The active log directory.
-
set_data_field
(data, key, value, volatile=False)[source]¶ Sets a data field.
Parameters: - data – The data dictionary.
- key – The value key.
- value – The value to assign.
- volatile – If
True
, the key/value pair will be removed once the processing done.
Returns: The updated data dictionary.
-
remove_data_field
(data, key)[source]¶ Removes a data field.
Parameters: - data – The data dictionary.
- key – The value key.
Returns: The updated data dictionary.
-
add_process_hooks
(hooks)[source]¶ Adds processing hooks.
Processing hooks are additional and model specific data processing functions applied after calling this inputter
opennmt.inputters.inputter.Inputter.process()
function.Parameters: hooks – A list of callables with the signature (inputter, data) -> data
.
-
process
(data, training=None)[source]¶ Prepares raw data.
Parameters: - data – The raw data.
- training – Run in training mode.
Returns: A dictionary of
tf.Tensor
.
-
transform_data
(data, mode='train', log_dir=None)[source]¶ Transforms the processed data to an input.
This is usually a simple forward of a
data
field toopennmt.inputters.inputter.Inputter.transform()
.See also process.
Parameters: - data – A dictionary of data fields.
- mode – A
tf.estimator.ModeKeys
mode. - log_dir – The log directory. If set, visualization will be setup.
Returns: The transformed input.
-
-
class
opennmt.inputters.inputter.
MultiInputter
(inputters, reducer=None)[source]¶ Bases:
opennmt.inputters.inputter.Inputter
An inputter that gathers multiple inputters.
-
num_outputs
¶ How many parallel outputs does this inputter produce.
-
initialize
(metadata, asset_dir=None, asset_prefix='')[source]¶ Initializes the inputter within the current graph.
For example, one can create lookup tables in this method for their initializer to be added to the current graph
TABLE_INITIALIZERS
collection.Parameters: - metadata – A dictionary containing additional metadata set by the user.
- asset_dir – The directory where assets can be written. If
None
, no assets are returned. - asset_prefix – The prefix to attach to assets filename.
Returns: A dictionary containing additional assets used by the inputter.
-
make_dataset
(data_file, training=None)[source]¶ Creates the base dataset required by this inputter.
Parameters: - data_file – The data file.
- training – Run in training mode.
Returns: A
tf.data.Dataset
.
-
-
class
opennmt.inputters.inputter.
ParallelInputter
(inputters, reducer=None, share_parameters=False, combine_features=True)[source]¶ Bases:
opennmt.inputters.inputter.MultiInputter
An multi inputter that process parallel data.
-
__init__
(inputters, reducer=None, share_parameters=False, combine_features=True)[source]¶ Initializes a parallel inputter.
Parameters: - inputters – A list of
opennmt.inputters.inputter.Inputter
. - reducer – A
opennmt.layers.reducer.Reducer
to merge all inputs. If set, parallel inputs are assumed to have the same length. - share_parameters – Share the inputters parameters.
- combine_features – Combine each inputter features in a single dict or
return them separately. This is typically
True
for multi source inputs butFalse
for features/labels parallel data.
- inputters – A list of
-
make_dataset
(data_file, training=None)[source]¶ Creates the base dataset required by this inputter.
Parameters: - data_file – The data file.
- training – Run in training mode.
Returns: A
tf.data.Dataset
.
-
get_dataset_size
(data_file)[source]¶ Returns the size of the dataset.
Parameters: data_file – The data file. Returns: The total size.
-
-
class
opennmt.inputters.inputter.
MixedInputter
(inputters, reducer=<opennmt.layers.reducer.ConcatReducer object>, dropout=0.0)[source]¶ Bases:
opennmt.inputters.inputter.MultiInputter
An multi inputter that applies several transformation on the same data.
-
__init__
(inputters, reducer=<opennmt.layers.reducer.ConcatReducer object>, dropout=0.0)[source]¶ Initializes a mixed inputter.
Parameters: - inputters – A list of
opennmt.inputters.inputter.Inputter
. - reducer – A
opennmt.layers.reducer.Reducer
to merge all inputs. - dropout – The probability to drop units in the merged inputs.
- inputters – A list of
-
make_dataset
(data_file, training=None)[source]¶ Creates the base dataset required by this inputter.
Parameters: - data_file – The data file.
- training – Run in training mode.
Returns: A
tf.data.Dataset
.
-
get_dataset_size
(data_file)[source]¶ Returns the size of the dataset.
Parameters: data_file – The data file. Returns: The total size.
-
-
class
opennmt.inputters.inputter.
ExampleInputter
(features_inputter, labels_inputter, share_parameters=False)[source]¶ Bases:
opennmt.inputters.inputter.ParallelInputter
An inputter that returns training examples (parallel features and labels).
-
__init__
(features_inputter, labels_inputter, share_parameters=False)[source]¶ Initializes this inputter.
Parameters: - features_inputter – An inputter producing the features (source).
- labels_inputter – An inputter producing the labels (target).
- share_parameters – Share the inputters parameters.
-
initialize
(metadata, asset_dir=None, asset_prefix='')[source]¶ Initializes the inputter within the current graph.
For example, one can create lookup tables in this method for their initializer to be added to the current graph
TABLE_INITIALIZERS
collection.Parameters: - metadata – A dictionary containing additional metadata set by the user.
- asset_dir – The directory where assets can be written. If
None
, no assets are returned. - asset_prefix – The prefix to attach to assets filename.
Returns: A dictionary containing additional assets used by the inputter.
-
make_inference_dataset
(features_file, batch_size, bucket_width=None, num_threads=1, prefetch_buffer_size=None)[source]¶ Builds a dataset to be used for inference.
For evaluation and training datasets, see
opennmt.inputters.inputter.ExampleInputter
.Parameters: - features_file – The test file.
- batch_size – The batch size to use.
- bucket_width – The width of the length buckets to select batch candidates
from (for efficiency). Set
None
to not constrain batch formation. - num_threads – The number of elements processed in parallel.
- prefetch_buffer_size – The number of batches to prefetch asynchronously. If
None
, use an automatically tuned value on TensorFlow 1.8+ and 1 on older versions.
Returns: A
tf.data.Dataset
.
-
make_evaluation_dataset
(features_file, labels_file, batch_size, num_threads=1, prefetch_buffer_size=None)[source]¶ Builds a dataset to be used for evaluation.
Parameters: - features_file – The evaluation source file.
- labels_file – The evaluation target file.
- batch_size – The batch size to use.
- num_threads – The number of elements processed in parallel.
- prefetch_buffer_size – The number of batches to prefetch asynchronously. If
None
, use an automatically tuned value on TensorFlow 1.8+ and 1 on older versions.
Returns: A
tf.data.Dataset
.
-
make_training_dataset
(features_file, labels_file, batch_size, batch_type='examples', batch_multiplier=1, batch_size_multiple=1, shuffle_buffer_size=None, bucket_width=None, maximum_features_length=None, maximum_labels_length=None, single_pass=False, num_shards=1, shard_index=0, num_threads=4, prefetch_buffer_size=None)[source]¶ Builds a dataset to be used for training. It supports the full training pipeline, including:
- sharding
- shuffling
- filtering
- bucketing
- prefetching
Parameters: - features_file – The evaluation source file.
- labels_file – The evaluation target file.
- batch_size – The batch size to use.
- batch_type – The training batching stragety to use: can be “examples” or “tokens”.
- batch_multiplier – The batch size multiplier to prepare splitting accross replicated graph parts.
- batch_size_multiple – When
batch_type
is “tokens”, ensure that the result batch size is a multiple of this value. - shuffle_buffer_size – The number of elements from which to sample.
- bucket_width – The width of the length buckets to select batch candidates
from (for efficiency). Set
None
to not constrain batch formation. - maximum_features_length – The maximum length or list of maximum lengths of
the features sequence(s).
None
to not constrain the length. - maximum_labels_length – The maximum length of the labels sequence.
None
to not constrain the length. - single_pass – If
True
, makes a single pass over the training data. - num_shards – The number of data shards (usually the number of workers in a distributed setting).
- shard_index – The shard index this data pipeline should read from.
- num_threads – The number of elements processed in parallel.
- prefetch_buffer_size – The number of batches to prefetch asynchronously. If
None
, use an automatically tuned value on TensorFlow 1.8+ and 1 on older versions.
Returns: A
tf.data.Dataset
.
-