Preparation
The data preparation (or preprocessing) passes over the data to generate word vocabularies and sequences of indices used by the training.
Generally the global process includes several steps:
tokenization (for text files): is splitting the corpus into space-separated tokens, possibly associated to features. See tokenization tool here.
preprocessing: is building a data file
from tokenized source training and validation corpus, optionally shuffling the sentences, and sorting by sentence length.
Note
It is possible to perform tokenization and preprocessing dynamically during the training using so-called "Dynamic Datasets" as documented here
Data type¶
By default, the data type is bitext
which are aligned source and target files. Alignment is by default done at the line level, but can also be done through aligned index (see Index files).
For training language models, data type is monotext
which is only one language file.
Finally, you can also manipulate the feattext
data type (see Input vectors) which allows to code sequences of vectors (e.g. sequence of features generated by a device).
Note
Input vectors can only be used for the source.
Delimiters¶
Training data (for bitext
and monotext
data types) are expected to follow the following format:
- sentences are newline-separated
- tokens are space-separated
Index files¶
Index files are aligning different files by index and not by line. For instance the following files are aligned by index:
line1 First line line2 Second line
line2 Deuxième ligne line1 Première ligne
where the first token of each line is an index which must have an equivalent (at any position) in aligned files.
The option -idx_files
is used (in preprocess.lua
or translate.lua
) to enable this feature.
Input vectors¶
OpenNMT supports the use of vector sequence instead of word sequence on the source side.
The data type is feattext
and is using the Kaldi text format (.ark
files). For instance the following entry, indexed by KEY
is representing a sequence
of m
vectors of n
values:
KEY [ FEAT1.1 FEAT1.2 FEAT1.3 ... FEAT1.n ... FEATm.1 FEATm.2 FEATm.3 ... FEATm.n ]
Warning
Note that you need to use index files for representing input vectors.
Vocabularies¶
The main goal of the preprocessing is to build the word and features vocabularies and assign each word to an index within these dictionaries.
By default, word vocabularies are limited to 50,000. You can change this value with the -src_vocab_size
and -tgt_vocab_size
. Alternatively, you can prune the vocabulary size by setting the minimum frequency of words with the -src_words_min_frequency
and -tgt_words_min_frequency
options.
Note
When pruning vocabularies to 50,000, the preprocessing will actually report a vocabulary size of 50,004 because of 4 special tokens that are automatically added.
The preprocessing script will generate *.dict
files containing the vocabularies: source and target token vocabularies are named PREFIX.src.dict
and PREFIX.tgt.dict
, while features' vocabulary files are named PREFIX.{source,target}_feature_N.dict
.
These files are optional for the rest of the workflow. However, it is common to reuse vocabularies across dataset using the -src_vocab
and -tgt_vocab
options. This is particularly needed when retraining a model on new data: the vocabulary has to be the same.
Tip
Vocabularies can be generated beforehand with the tools/build_vocab.lua
script.
Each line of dictionary files is space-separated fields:
token
the vocab entry.ID
its index used internally to map tokens to integer as an entry of lookup tables.- (optional) the vocab frequency in the corpus it was extracted form. This field is generated.
- other fields are ignored
Note
if you provide your own vocabulary - be sure to integrate the 4 special tokens: <blank> <unk> <s> </s>
. A good practice is to keep them at the beginning of the file with the respective index 1, 2, 3, 4
Shuffling and sorting¶
By default, OpenNMT both shuffles and sorts the data before the training. This process comes from 2 constraints of batch training:
- shuffling: sentences within a batch should come from different parts of the corpus
- sorting: sentences within a batch should have the same source length (i.e. without padding to maximize efficiency)
Note
During the training, batches are also randomly selected unless the -curriculum
option is used.
Sentence length¶
During preprocessing, too long sentences (with source longer than -src_seq_length
or target longer than -tgt_seq_length
) are discarded from the corpus. You can have an idea of the distribution of sentence length in your training corpus by looking at the preprocess log where a table gives percent of sentences with length 1-10, 11-20, 21-30, ..., 90+:
[04/14/17 00:40:10 INFO] * Source Sentence Length (range of 10): [ 7% ; 35% ; 32% ; 16% ; 7% ; 0% ; 0% ; 0% ; 0% ; 0% ] [04/14/17 00:40:10 INFO] * Target Sentence Length (range of 10): [ 9% ; 38% ; 30% ; 15% ; 5% ; 0% ; 0% ; 0% ; 0% ; 0% ]
Note
Limiting maximal sentence length is a key parameter to reduce the GPU memory footprint used during training: indeed the memory grows linearly with maximal sentence length.