Data Loaders

Data Iterator

class onmt.inputters.DynamicDatasetIter(corpora, corpora_info, transforms, vocabs, task, batch_type, batch_size, batch_size_multiple, data_type='text', bucket_size=2048, bucket_size_init=-1, bucket_size_increment=0, copy=False, device=device(type='cpu'), skip_empty_level='warning', stride=1, offset=0)[source]

Bases: IterableDataset

Yield batch from (multiple) plain text corpus.

Parameters:
  • corpora (dict[str, ParallelCorpus]) – collections of corpora to iterate;

  • corpora_info (dict[str, dict]) – corpora infos correspond to corpora;

  • transforms (dict[str, Transform]) – transforms may be used by corpora;

  • vocabs (dict[str, Vocab]) – vocab dict for convert corpora into Tensor;

  • task (str) – CorpusTask.TRAIN/VALID/INFER;

  • batch_type (str) – batching type to count on, choices=[tokens, sents];

  • batch_size (int) – numbers of examples in a batch;

  • batch_size_multiple (int) – make batch size multiply of this;

  • data_type (str) – input data type, currently only text;

  • bucket_size (int) – accum this number of examples in a dynamic dataset;

  • bucket_size_init (int) – initialize the bucket with this

  • examples; (size with this amount of) –

  • bucket_size_increment (int) – increment the bucket

  • examples;

  • copy (Bool) – if True, will add specific items for copy_attn

  • skip_empty_level (str) – security level when encouter empty line;

  • stride (int) – iterate data files with this stride;

  • offset (int) – iterate data files with this offset.

Variables:
  • sort_key (function) – functions define how to sort examples;

  • mixer (MixingStrategy) – the strategy to iterate corpora.

batch_iter(data, batch_size, batch_type='sents', batch_size_multiple=1)[source]

Yield elements from data in chunks of batch_size, where each chunk size is a multiple of batch_size_multiple.

classmethod from_opt(corpora, transforms, vocabs, opt, task, copy, device, stride=1, offset=0)[source]

Initilize DynamicDatasetIter with options parsed from opt.

class onmt.inputters.MixingStrategy(iterables, weights)[source]

Bases: object

Mixing strategy that should be used in Data Iterator.

class onmt.inputters.SequentialMixer(iterables, weights)[source]

Bases: MixingStrategy

Generate data sequentially from iterables which is exhaustible.

class onmt.inputters.WeightedMixer(iterables, weights)[source]

Bases: MixingStrategy

A mixing strategy that mix data weightedly and iterate infinitely.

Dataset

class onmt.inputters.ParallelCorpus(name, src, tgt, align=None, n_src_feats=0, src_feats_defaults=None)[source]

Bases: object

A parallel corpus file pair that can be loaded to iterate.

load(offset=0, stride=1)[source]

Load file and iterate by lines. offset and stride allow to iterate only on every stride example, starting from offset.

class onmt.inputters.ParallelCorpusIterator(corpus, transform, skip_empty_level='warning', stride=1, offset=0)[source]

Bases: object

An iterator dedicated to ParallelCorpus.

Parameters:
  • corpus (ParallelCorpus) – corpus to iterate;

  • transform (TransformPipe) – transforms to be applied to corpus;

  • skip_empty_level (str) – security level when encouter empty line;

  • stride (int) – iterate corpus with this line stride;

  • offset (int) – iterate corpus with this line offset.