training_pipeline
- opennmt.data.training_pipeline(batch_size, batch_type='examples', batch_multiplier=1, batch_size_multiple=1, process_fn=None, transform_fns=None, length_bucket_width=None, pad_to_bucket_boundary=False, features_length_fn=None, labels_length_fn=None, maximum_features_length=None, maximum_labels_length=None, single_pass=False, num_shards=1, shard_index=0, num_threads=None, dataset_size=None, shuffle_buffer_size=None, prefetch_buffer_size=None, cardinality_multiple=1)[source]
Transformation that applies most of the dataset operations commonly used for training on sequence data:
sharding
shuffling
processing
filtering
bucketization
batching
prefetching
Example
>>> dataset = dataset.apply(opennmt.data.training_pipeline(...))
- Parameters
batch_size – The batch size to use.
batch_type – The training batching strategy to use: can be “examples” or “tokens”.
batch_multiplier – The batch size multiplier.
batch_size_multiple – When
batch_type
is “tokens”, ensure that the resulting batch size is a multiple of this value.process_fn – The processing function to apply on each element.
transform_fns – List of dataset transformation functions (applied after
process_fn
if defined).length_bucket_width – The width of the length buckets to select batch candidates from.
None
to not constrain batch formation.features_length_fn – A function mapping features to a sequence length.
labels_length_fn – A function mapping labels to a sequence length.
maximum_features_length – The maximum length or list of maximum lengths of the features sequence(s).
None
to not constrain the length.maximum_labels_length – The maximum length of the labels sequence.
None
to not constrain the length.single_pass – If
True
, makes a single pass over the training data.num_shards – The number of data shards (usually the number of workers in a distributed setting).
shard_index – The shard index this data pipeline should read from.
num_threads – The number of elements processed in parallel.
dataset_size – If the dataset size is already known, it can be passed here to avoid a slower generic computation of the dataset size later.
shuffle_buffer_size – The number of elements from which to sample.
prefetch_buffer_size – The number of batches to prefetch asynchronously. If
None
, use an automatically tuned value.cardinality_multiple – Ensure that the dataset cardinality is a multiple of this value when
single_pass
isTrue
.
- Returns
A
tf.data.Dataset
transformation.