Doc: Data Loaders

Datasets

class onmt.io.TextDataset(fields, src_examples_iter, tgt_examples_iter, num_src_feats=0, num_tgt_feats=0, src_seq_length=0, tgt_seq_length=0, dynamic_dict=True, use_filter_pred=True)

Dataset for data_type==’text’

Build Example objects, Field objects, and filter_pred function from text corpus.

Parameters:
  • fields (dict) – a dictionary of torchtext.data.Field. Keys are like ‘src’, ‘tgt’, ‘src_map’, and ‘alignment’.
  • src_examples_iter (dict iter) – preprocessed source example dictionary iterator.
  • tgt_examples_iter (dict iter) – preprocessed target example dictionary iterator.
  • num_src_feats (int) – number of source side features.
  • num_tgt_feats (int) – number of target side features.
  • src_seq_length (int) – maximum source sequence length.
  • tgt_seq_length (int) – maximum target sequence length.
  • dynamic_dict (bool) – create dynamic dictionaries?
  • use_filter_pred (bool) – use a custom filter predicate to filter out examples?
static collapse_copy_scores(scores, batch, tgt_vocab, src_vocabs)

Given scores from an expanded dictionary corresponeding to a batch, sums together copies, with a dictionary word when it is ambigious.

static get_fields(n_src_features, n_tgt_features)
Parameters:
  • n_src_features (int) – the number of source features to create torchtext.data.Field for.
  • n_tgt_features (int) – the number of target features to create torchtext.data.Field for.
Returns:

A dictionary whose keys are strings and whose values are the corresponding Field objects.

static get_num_features(corpus_file, side)

Peek one line and get number of features of it. (All lines must have same number of features). For text corpus, both sides are in text form, thus it works the same.

Parameters:
  • corpus_file (str) – file path to get the features.
  • side (str) – ‘src’ or ‘tgt’.
Returns:

number of features on side.

static make_text_examples_nfeats_tpl(path, truncate, side)
Parameters:
  • path (str) – location of a src or tgt file.
  • truncate (int) – maximum sequence length (0 for unlimited).
  • side (str) – “src” or “tgt”.
Returns:

(example_dict iterator, num_feats) tuple.

static read_text_file(path, truncate, side)
Parameters:
  • path (str) – location of a src or tgt file.
  • truncate (int) – maximum sequence length (0 for unlimited).
  • side (str) – “src” or “tgt”.
Yields:

(word, features, nfeat) triples for each line.

sort_key(ex)

Sort using length of source sentences.

class onmt.io.ImageDataset(fields, src_examples_iter, tgt_examples_iter, num_src_feats=0, num_tgt_feats=0, tgt_seq_length=0, use_filter_pred=True)

Dataset for data_type==’img’

Build Example objects, Field objects, and filter_pred function from image corpus.

Parameters:
  • fields (dict) – a dictionary of torchtext.data.Field.
  • src_examples_iter (dict iter) – preprocessed source example dictionary iterator.
  • tgt_examples_iter (dict iter) – preprocessed target example dictionary iterator.
  • num_src_feats (int) – number of source side features.
  • num_tgt_feats (int) – number of target side features.
  • tgt_seq_length (int) – maximum target sequence length.
  • use_filter_pred (bool) – use a custom filter predicate to filter out examples?
static get_fields(n_src_features, n_tgt_features)
Parameters:
  • n_src_features – the number of source features to create torchtext.data.Field for.
  • n_tgt_features – the number of target features to create torchtext.data.Field for.
Returns:

A dictionary whose keys are strings and whose values are the corresponding Field objects.

static get_num_features(corpus_file, side)

For image corpus, source side is in form of image, thus no feature; while target side is in form of text, thus we can extract its text features.

Parameters:
  • corpus_file (str) – file path to get the features.
  • side (str) – ‘src’ or ‘tgt’.
Returns:

number of features on side.

static make_image_examples_nfeats_tpl(path, img_dir)
Parameters:
  • path (str) – location of a src file containing image paths
  • src_dir (str) – location of source images
Returns:

(example_dict iterator, num_feats) tuple

static read_img_file(path, src_dir, side, truncate=None)
Parameters:
  • path (str) – location of a src file containing image paths
  • src_dir (str) – location of source images
  • side (str) – ‘src’ or ‘tgt’
  • truncate – maximum img size ((0,0) or None for unlimited)
Yields:

a dictionary containing image data, path and index for each line.

sort_key(ex)

Sort using the size of the image: (width, height).

class onmt.io.AudioDataset(fields, src_examples_iter, tgt_examples_iter, num_src_feats=0, num_tgt_feats=0, tgt_seq_length=0, sample_rate=0, window_size=0.0, window_stride=0.0, window=None, normalize_audio=True, use_filter_pred=True)

Dataset for data_type==’audio’

Build Example objects, Field objects, and filter_pred function from audio corpus.

Parameters:
  • fields (dict) – a dictionary of torchtext.data.Field.
  • src_examples_iter (dict iter) – preprocessed source example dictionary iterator.
  • tgt_examples_iter (dict iter) – preprocessed target example dictionary iterator.
  • num_src_feats (int) – number of source side features.
  • num_tgt_feats (int) – number of target side features.
  • tgt_seq_length (int) – maximum target sequence length.
  • sample_rate (int) – sample rate.
  • window_size (float) – window size for spectrogram in seconds.
  • window_stride (float) – window stride for spectrogram in seconds.
  • window (str) – window type for spectrogram generation.
  • normalize_audio (bool) – subtract spectrogram by mean and divide by std or not.
  • use_filter_pred (bool) – use a custom filter predicate to filter out examples?
static get_fields(n_src_features, n_tgt_features)
Parameters:
  • n_src_features – the number of source features to create torchtext.data.Field for.
  • n_tgt_features – the number of target features to create torchtext.data.Field for.
Returns:

A dictionary whose keys are strings and whose values are the corresponding Field objects.

static get_num_features(corpus_file, side)

For audio corpus, source side is in form of audio, thus no feature; while target side is in form of text, thus we can extract its text features.

Parameters:
  • corpus_file (str) – file path to get the features.
  • side (str) – ‘src’ or ‘tgt’.
Returns:

number of features on side.

static make_audio_examples_nfeats_tpl(path, audio_dir, sample_rate, window_size, window_stride, window, normalize_audio, truncate=None)
Parameters:
  • path (str) – location of a src file containing audio paths.
  • audio_dir (str) – location of source audio files.
  • sample_rate (int) – sample_rate.
  • window_size (float) – window size for spectrogram in seconds.
  • window_stride (float) – window stride for spectrogram in seconds.
  • window (str) – window type for spectrogram generation.
  • normalize_audio (bool) – subtract spectrogram by mean and divide by std or not.
  • truncate (int) – maximum audio length (0 or None for unlimited).
Returns:

(example_dict iterator, num_feats) tuple

static read_audio_file(path, src_dir, side, sample_rate, window_size, window_stride, window, normalize_audio, truncate=None)
Parameters:
  • path (str) – location of a src file containing audio paths.
  • src_dir (str) – location of source audio files.
  • side (str) – ‘src’ or ‘tgt’.
  • sample_rate (int) – sample_rate.
  • window_size (float) – window size for spectrogram in seconds.
  • window_stride (float) – window stride for spectrogram in seconds.
  • window (str) – window type for spectrogram generation.
  • normalize_audio (bool) – subtract spectrogram by mean and divide by std or not.
  • truncate (int) – maximum audio length (0 or None for unlimited).
Yields:

a dictionary containing audio data for each line.

sort_key(ex)

Sort using duration time of the sound spectrogram.