Data Loaders¶

Data Iterator¶

class onmt.inputters.DynamicDatasetIter(corpora, corpora_info, transforms, vocabs, task, batch_type, batch_size, batch_size_multiple, data_type='text', bucket_size=2048, bucket_size_init=-1, bucket_size_increment=0, copy=False, device=device(type='cpu'), skip_empty_level='warning', stride=1, offset=0)[source]¶

Bases: IterableDataset

Yield batch from (multiple) plain text corpus.

Parameters:

corpora (dict[str, ParallelCorpus]) – collections of corpora to iterate;
corpora_info (dict[str, dict]) – corpora infos correspond to corpora;
transforms (dict[str, Transform]) – transforms may be used by corpora;
vocabs (dict[str, Vocab]) – vocab dict for convert corpora into Tensor;
task (str) – CorpusTask.TRAIN/VALID/INFER;
batch_type (str) – batching type to count on, choices=[tokens, sents];
batch_size (int) – numbers of examples in a batch;
batch_size_multiple (int) – make batch size multiply of this;
data_type (str) – input data type, currently only text;
bucket_size (int) – accum this number of examples in a dynamic dataset;
bucket_size_init (int) – initialize the bucket with this
examples; (size with this amount of) –
bucket_size_increment (int) – increment the bucket
examples; –
copy (Bool) – if True, will add specific items for copy_attn
skip_empty_level (str) – security level when encouter empty line;
stride (int) – iterate data files with this stride;
offset (int) – iterate data files with this offset.

Variables:

sort_key (function) – functions define how to sort examples;
mixer (MixingStrategy) – the strategy to iterate corpora.

batch_iter(data, batch_size, batch_type='sents', batch_size_multiple=1)[source]¶: Yield elements from data in chunks of batch_size, where each chunk size is a multiple of batch_size_multiple.

classmethod from_opt(corpora, transforms, vocabs, opt, task, copy, device, stride=1, offset=0)[source]¶: Initilize DynamicDatasetIter with options parsed from opt.

class onmt.inputters.MixingStrategy(iterables, weights)[source]¶

Bases: object

Mixing strategy that should be used in Data Iterator.

class onmt.inputters.SequentialMixer(iterables, weights)[source]¶

Bases: MixingStrategy

Generate data sequentially from iterables which is exhaustible.

class onmt.inputters.WeightedMixer(iterables, weights)[source]¶

Bases: MixingStrategy

A mixing strategy that mix data weightedly and iterate infinitely.

Dataset¶

class onmt.inputters.ParallelCorpus(name, src, tgt, align=None, n_src_feats=0, src_feats_defaults=None)[source]¶

Bases: object

A parallel corpus file pair that can be loaded to iterate.

load(offset=0, stride=1)[source]¶: Load file and iterate by lines. offset and stride allow to iterate only on every stride example, starting from offset.

class onmt.inputters.ParallelCorpusIterator(corpus, transform, skip_empty_level='warning', stride=1, offset=0)[source]¶

Bases: object

An iterator dedicated to ParallelCorpus.

Parameters:

corpus (ParallelCorpus) – corpus to iterate;
transform (TransformPipe) – transforms to be applied to corpus;
skip_empty_level (str) – security level when encouter empty line;
stride (int) – iterate corpus with this line stride;
offset (int) – iterate corpus with this line offset.