Data Loaders

Data Readers

exception onmt.inputters.datareader_base.MissingDependencyException[source]

Bases: Exception

class onmt.inputters.DataReaderBase[source]

Bases: object

Read data from file system and yield as dicts.


onmt.inputters.datareader_base.MissingDependencyException – A number of DataReaders need specific additional packages. If any are missing, this will be raised.

classmethod from_opt(opt)[source]

Alternative constructor.


opt (argparse.Namespace) – The parsed arguments.

read(data, side)[source]

Read data from file system and yield as dicts.

class onmt.inputters.TextDataReader[source]

Bases: onmt.inputters.datareader_base.DataReaderBase

read(sequences, side)[source]

Read text data from disk.

  • sequences (str or Iterable[str]) – path to text file or iterable of the actual text data.

  • side (str) – Prefix used in return dict. Usually "src" or "tgt".


dictionaries whose keys are the names of fields and whose values are more or less the result of tokenizing with those fields.


class onmt.inputters.Dataset(fields, readers, data, sort_key, filter_pred=None)[source]


Contain data and process it.

A dataset is an object that accepts sequences of raw data (sentence pairs in the case of machine translation) and fields which describe how this raw data should be processed to produce tensors. When a dataset is instantiated, it applies the fields’ preprocessing pipeline (but not the bit that numericalizes it or turns it into batch tensors) to the raw data, producing a list of objects. torchtext’s iterators then know how to use these examples to make batches.

  • fields (dict[str, Field]) – a dict with the structure returned by onmt.inputters.get_fields(). Usually that means the dataset side, "src" or "tgt". Keys match the keys of items yielded by the readers, while values are lists of (name, Field) pairs. An attribute with this name will be created for each object and its value will be the result of applying the Field to the data that matches the key. The advantage of having sequences of fields for each piece of raw input is that it allows the dataset to store multiple “views” of each input, which allows for easy implementation of token-level features, mixed word- and character-level models, and so on. (See also onmt.inputters.TextMultiField.)

  • readers (Iterable[onmt.inputters.DataReaderBase]) – Reader objects for disk-to-dict. The yielded dicts are then processed according to fields.

  • data (Iterable[Tuple[str, Any]]) – (name, data_arg) pairs where data_arg is passed to the read() method of the reader in readers at that position. (See the reader object for details on the Any type.)

  • sort_key (Callable[[], Any]) – A function for determining the value on which data is sorted (i.e. length).

  • filter_pred (Callable[[], bool]) – A function that accepts Example objects and returns a boolean value indicating whether to include that example in the dataset.


src_vocabs (List[]) – Used with dynamic dict/copy attention. There is a very short vocab for each src example. It contains just the source words, e.g. so that the generator can predict to copy them.