The data preparation (or preprocessing) passes over the data to generate word vocabularies and sequences of indices used by the training.
Generally the global process includes several steps:
tokenization (for text files): is splitting the corpus into space-separated tokens, possibly associated to features. See tokenization tool here.
preprocessing: is building a
data file from tokenized source training and validation corpus, optionally shuffling the sentences, and sorting by sentence length.
It is possible to perform tokenization and preprocessing dynamically during the training using so-called "Dynamic Datasets" as documented here
By default, the data type is
bitext which are aligned source and target files. Alignment is by default done at the line level, but can also be done through aligned index (see Index files).
For training language models, data type is
monotext which is only one language file.
Finally, you can also manipulate the
feattext data type (see Input vectors) which allows to code sequences of vectors (e.g. sequence of features generated by a device).
Input vectors can only be used for the source.
Training data (for
monotext data types) are expected to follow the following format:
- sentences are newline-separated
- tokens are space-separated
Index files are aligning different files by index and not by line. For instance the following files are aligned by index:
line1 First line line2 Second line
line2 Deuxième ligne line1 Première ligne
where the first token of each line is an index which must have an equivalent (at any position) in aligned files.
-idx_files is used (in
translate.lua) to enable this feature.
OpenNMT supports the use of vector sequence instead of word sequence on the source side.
The data type is
feattext and is using the Kaldi text format (
.ark files). For instance the following entry, indexed by
KEY is representing a sequence
m vectors of
KEY [ FEAT1.1 FEAT1.2 FEAT1.3 ... FEAT1.n ... FEATm.1 FEATm.2 FEATm.3 ... FEATm.n ]
Note that you need to use index files for representing input vectors.
The main goal of the preprocessing is to build the word and features vocabularies and assign each word to an index within these dictionaries.
By default, word vocabularies are limited to 50,000. You can change this value with the
-tgt_vocab_size. Alternatively, you can prune the vocabulary size by setting the minimum frequency of words with the
When pruning vocabularies to 50,000, the preprocessing will actually report a vocabulary size of 50,004 because of 4 special tokens that are automatically added.
The preprocessing script will generate
*.dict files containing the vocabularies: source and target token vocabularies are named
PREFIX.tgt.dict, while features' vocabulary files are named
These files are optional for the rest of the workflow. However, it is common to reuse vocabularies across dataset using the
-tgt_vocab options. This is particularly needed when retraining a model on new data: the vocabulary has to be the same.
Vocabularies can be generated beforehand with the
Each line of dictionary files is space-separated fields:
tokenthe vocab entry.
IDits index used internally to map tokens to integer as an entry of lookup tables.
- (optional) the vocab frequency in the corpus it was extracted form. This field is generated.
- other fields are ignored
if you provide your own vocabulary - be sure to integrate the 4 special tokens:
<blank> <unk> <s> </s>. A good practice is to keep them at the beginning of the file with the respective index 1, 2, 3, 4
Shuffling and sorting¶
By default, OpenNMT both shuffles and sorts the data before the training. This process comes from 2 constraints of batch training:
- shuffling: sentences within a batch should come from different parts of the corpus
- sorting: sentences within a batch should have the same source length (i.e. without padding to maximize efficiency)
During the training, batches are also randomly selected unless the
-curriculum option is used.
During preprocessing, too long sentences (with source longer than
-src_seq_length or target longer than
-tgt_seq_length) are discarded from the corpus. You can have an idea of the distribution of sentence length in your training corpus by looking at the preprocess log where a table gives percent of sentences with length 1-10, 11-20, 21-30, ..., 90+:
[04/14/17 00:40:10 INFO] * Source Sentence Length (range of 10): [ 7% ; 35% ; 32% ; 16% ; 7% ; 0% ; 0% ; 0% ; 0% ; 0% ] [04/14/17 00:40:10 INFO] * Target Sentence Length (range of 10): [ 9% ; 38% ; 30% ; 15% ; 5% ; 0% ; 0% ; 0% ; 0% ; 0% ]
Limiting maximal sentence length is a key parameter to reduce the GPU memory footprint used during training: indeed the memory grows linearly with maximal sentence length.