Preparation

The data preparation (or preprocessing) passes over the data to generate word vocabularies and sequences of indices used by the training.

Generally the global process includes several steps: tokenization (for text files): is splitting the corpus into space-separated tokens, possibly associated to features. See tokenization tool here. preprocessing: is building a data file from tokenized source training and validation corpus, optionally shuffling the sentences, and sorting by sentence length.

Note

It is possible to perform tokenization and preprocessing dynamically during the training using so-called "Dynamic Datasets" as documented here

Data type

By default, the data type is bitext which are aligned source and target files. Alignment is by default done at the line level, but can also be done through aligned index (see Index files).

For training language models, data type is monotext which is only one language file.

Finally, you can also manipulate the feattext data type (see Input vectors) which allows to code sequences of vectors (e.g. sequence of features generated by a device).

Note

Input vectors can only be used for the source.

Delimiters

Training data (for bitext and monotext data types) are expected to follow the following format:

  • sentences are newline-separated
  • tokens are space-separated

Index files

Index files are aligning different files by index and not by line. For instance the following files are aligned by index:

line1 First line
line2 Second line
line2 Deuxième ligne
line1 Première ligne

where the first token of each line is an index which must have an equivalent (at any position) in aligned files.

The option -idx_files is used (in preprocess.lua or translate.lua) to enable this feature.

Input vectors

OpenNMT supports the use of vector sequence instead of word sequence on the source side.

The data type is feattext and is using the Kaldi text format (.ark files). For instance the following entry, indexed by KEY is representing a sequence of m vectors of n values:

KEY [
FEAT1.1 FEAT1.2 FEAT1.3 ... FEAT1.n
...
FEATm.1 FEATm.2 FEATm.3 ... FEATm.n ]

Warning

Note that you need to use index files for representing input vectors.

Vocabularies

The main goal of the preprocessing is to build the word and features vocabularies and assign each word to an index within these dictionaries.

By default, word vocabularies are limited to 50,000. You can change this value with the -src_vocab_size and -tgt_vocab_size. Alternatively, you can prune the vocabulary size by setting the minimum frequency of words with the -src_words_min_frequency and -tgt_words_min_frequency options.

Note

When pruning vocabularies to 50,000, the preprocessing will actually report a vocabulary size of 50,004 because of 4 special tokens that are automatically added.

The preprocessing script will generate *.dict files containing the vocabularies: source and target token vocabularies are named PREFIX.src.dict and PREFIX.tgt.dict, while features' vocabulary files are named PREFIX.{source,target}_feature_N.dict.

These files are optional for the rest of the workflow. However, it is common to reuse vocabularies across dataset using the -src_vocab and -tgt_vocab options. This is particularly needed when retraining a model on new data: the vocabulary has to be the same.

Tip

Vocabularies can be generated beforehand with the tools/build_vocab.lua script.

Each line of dictionary files is space-separated fields:

  • token the vocab entry.
  • ID its index used internally to map tokens to integer as an entry of lookup tables.
  • (optional) the vocab frequency in the corpus it was extracted form. This field is generated.
  • other fields are ignored

Note

if you provide your own vocabulary - be sure to integrate the 4 special tokens: <blank> <unk> <s> </s>. A good practice is to keep them at the beginning of the file with the respective index 1, 2, 3, 4

Shuffling and sorting

By default, OpenNMT both shuffles and sorts the data before the training. This process comes from 2 constraints of batch training:

  • shuffling: sentences within a batch should come from different parts of the corpus
  • sorting: sentences within a batch should have the same source length (i.e. without padding to maximize efficiency)

Note

During the training, batches are also randomly selected unless the -curriculum option is used.

Sentence length

During preprocessing, too long sentences (with source longer than -src_seq_length or target longer than -tgt_seq_length) are discarded from the corpus. You can have an idea of the distribution of sentence length in your training corpus by looking at the preprocess log where a table gives percent of sentences with length 1-10, 11-20, 21-30, ..., 90+:

[04/14/17 00:40:10 INFO]  * Source Sentence Length (range of 10): [ 7% ; 35% ; 32% ; 16% ; 7% ; 0% ; 0% ; 0% ; 0% ; 0% ]
[04/14/17 00:40:10 INFO]  * Target Sentence Length (range of 10): [ 9% ; 38% ; 30% ; 15% ; 5% ; 0% ; 0% ; 0% ; 0% ; 0% ]

Note

Limiting maximal sentence length is a key parameter to reduce the GPU memory footprint used during training: indeed the memory grows linearly with maximal sentence length.