ExampleInputterAdapter
- class opennmt.inputters.ExampleInputterAdapter[source]
Extends an inputter with methods to build evaluation and training datasets.
Inherits from:
builtins.object
- make_evaluation_dataset(features_file, labels_file, batch_size, batch_type='examples', length_bucket_width=None, num_threads=1, prefetch_buffer_size=None)[source]
Builds a dataset to be used for evaluation.
- Parameters
features_file – The evaluation source file.
labels_file – The evaluation target file.
batch_size – The batch size to use.
batch_type – The batching strategy to use: can be “examples” or “tokens”.
length_bucket_width – The width of the length buckets to select batch candidates from (for efficiency). Set
None
to not constrain batch formation.num_threads – The number of elements processed in parallel.
prefetch_buffer_size – The number of batches to prefetch asynchronously. If
None
, use an automatically tuned value.
- Returns
A
tf.data.Dataset
.
See also
- make_training_dataset(features_file, labels_file, batch_size, batch_type='examples', batch_multiplier=1, batch_size_multiple=1, shuffle_buffer_size=None, length_bucket_width=None, pad_to_bucket_boundary=False, maximum_features_length=None, maximum_labels_length=None, single_pass=False, num_shards=1, shard_index=0, num_threads=4, prefetch_buffer_size=None, cardinality_multiple=1, weights=None, batch_autotune_mode=False)[source]
Builds a dataset to be used for training. It supports the full training pipeline, including:
sharding
shuffling
filtering
bucketing
prefetching
- Parameters
features_file – The source file or a list of training source files.
labels_file – The target file or a list of training target files.
batch_size – The batch size to use.
batch_type – The training batching strategy to use: can be “examples” or “tokens”.
batch_multiplier – The batch size multiplier to prepare splitting accross replicated graph parts.
batch_size_multiple – When
batch_type
is “tokens”, ensure that the resulting batch size is a multiple of this value.shuffle_buffer_size – The number of elements from which to sample.
length_bucket_width – The width of the length buckets to select batch candidates from (for efficiency). Set
None
to not constrain batch formation.pad_to_bucket_boundary – Pad each batch to the length bucket boundary.
maximum_features_length – The maximum length or list of maximum lengths of the features sequence(s).
None
to not constrain the length.maximum_labels_length – The maximum length of the labels sequence.
None
to not constrain the length.single_pass – If
True
, makes a single pass over the training data.num_shards – The number of data shards (usually the number of workers in a distributed setting).
shard_index – The shard index this data pipeline should read from.
num_threads – The number of elements processed in parallel.
prefetch_buffer_size – The number of batches to prefetch asynchronously. If
None
, use an automatically tuned value.cardinality_multiple – Ensure that the dataset cardinality is a multiple of this value when
single_pass
isTrue
.weights – An optional list of weights to create a weighted dataset out of multiple training files.
batch_autotune_mode – When enabled, all batches are padded to the maximum sequence length.
- Returns
A
tf.data.Dataset
.
See also