Inputter
- class opennmt.inputters.Inputter(*args, **kwargs)[source]
Base class for inputters.
Inherits from:
keras.src.engine.base_layer.Layer
Extended by:
- property asset_prefix
The asset prefix is used to differentiate resources of parallel inputters. The most basic examples are the “source_” and “target_” prefixes.
When reading the data configuration, the inputter will read fields that start with this prefix (e.g. “source_vocabulary”).
Assets exported by this inputter start with this prefix.
- property num_outputs
The number of parallel outputs produced by this inputter.
- initialize(data_config)[source]
Initializes the inputter.
- Parameters
data_config – A dictionary containing the data configuration set by the user.
- export_assets(asset_dir)[source]
Exports assets used by this tokenizer.
- Parameters
asset_dir – The directory where assets can be written.
- Returns
A dictionary containing additional assets used by the inputter.
- abstract make_dataset(data_file, training=None)[source]
Creates the base dataset required by this inputter.
- Parameters
data_file – The data file.
training – Run in training mode.
- Returns
A
tf.data.Dataset
instance or a list oftf.data.Dataset
instances.
- get_dataset_size(data_file)[source]
Returns the dataset size.
If the inputter can efficiently compute the dataset size from a training file on disk, it can optionally override this method. Otherwise, we may compute the size later with a generic and slower approach (iterating over the dataset instance).
- Parameters
data_file – The data file.
- Returns
The dataset size or
None
.
- make_inference_dataset(features_file, batch_size, batch_type='examples', length_bucket_width=None, num_threads=1, prefetch_buffer_size=None)[source]
Builds a dataset to be used for inference.
For evaluation and training datasets, see
opennmt.inputters.ExampleInputter
.- Parameters
features_file – The test file.
batch_size – The batch size to use.
batch_type – The batching strategy to use: can be “examples” or “tokens”.
length_bucket_width – The width of the length buckets to select batch candidates from (for efficiency). Set
None
to not constrain batch formation.num_threads – The number of elements processed in parallel.
prefetch_buffer_size – The number of batches to prefetch asynchronously. If
None
, use an automatically tuned value.
- Returns
A
tf.data.Dataset
.
See also
- get_length(features, ignore_special_tokens=False)[source]
Returns the length of the input features, if defined.
- Parameters
features – The dictionary of input features.
ignore_special_tokens – Ignore special tokens that were added by the inputter (e.g. <s> and/or </s>).
- Returns
The length.
- get_padded_shapes(element_spec, maximum_length=None)[source]
Returns the padded shapes for dataset elements.
For example, this is used during batch size autotuning to pad all batches to the maximum sequence length.
- Parameters
element_spec – A nested structure of
tf.TensorSpec
.maximum_length – Pad batches to this maximum length.
- Returns
A nested structure of
tf.TensorShape
.
- has_prepare_step()[source]
Returns
True
if this inputter implements a data preparation step in methodopennmt.inputters.Inputter.prepare_elements()
.
- prepare_elements(elements, training=None)[source]
Prepares dataset elements.
This method is called on a batch of dataset elements. For example, it can be overriden to apply an external pre-tokenization.
Note that the results of the method are unbatched and then passed to method
opennmt.inputters.Inputter.make_features()
.- Parameters
elements – A batch of dataset elements.
training – Run in training mode.
- Returns
A (possibly nested) structure of
tf.Tensor
.
- abstract make_features(element=None, features=None, training=None)[source]
Creates features from data.
This is typically called in a data pipeline (such as
Dataset.map
). Common transformation includes tokenization, parsing, vocabulary lookup, etc.This method accepts both a single
element
from the dataset or a partially built dictionary offeatures
.- Parameters
element – An element from the dataset returned by
opennmt.inputters.Inputter.make_dataset()
.features – An optional and possibly partial dictionary of features to augment.
training – Run in training mode.
- Returns
A dictionary of
tf.Tensor
.
- keep_for_training(features, maximum_length=None)[source]
Returns
True
if this example should be kept for training.- Parameters
features – A dictionary of
tf.Tensor
.maximum_length – The maximum length used for training.
- Returns
A boolean.
- call(features, training=None)[source]
Creates the model input from the features (e.g. word embeddings).
- Parameters
features – A dictionary of
tf.Tensor
, the output ofopennmt.inputters.Inputter.make_features()
.training – Run in training mode.
- Returns
The model input.