Step 0: Install OpenNMT-py¶
pip install --upgrade pip pip install OpenNMT-py
Step 1: Prepare the data¶
To get started, we propose to download a toy English-German dataset for machine translation containing 10k tokenized sentences:
wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz tar xf toy-ende.tar.gz cd toy-ende
The data consists of parallel source (
src) and target (
tgt) data containing one sentence per line with tokens separated by a space:
Validation files are used to evaluate the convergence of the training. It usually contains no more than 5k sentences.
$ head -n 2 toy_ende/src-train.txt It is not acceptable that , with the help of the national bureaucracies , Parliament 's legislative prerogative should be made null and void by means of implementing provisions whose content , purpose and extent are not laid down in advance . Federal Master Trainer and Senior Instructor of the Italian Federation of Aerobic Fitness , Group Fitness , Postural Gym , Stretching and Pilates; from 2004 , he has been collaborating with Antiche Terme as personal Trainer and Instructor of Stretching , Pilates and Postural Gym .
We need to build a YAML configuration file to specify the data that will be used:
# toy_en_de.yaml ## Where the samples will be written save_data: toy-ende/run/example ## Where the vocab(s) will be written src_vocab: toy-ende/run/example.vocab.src tgt_vocab: toy-ende/run/example.vocab.tgt # Prevent overwriting existing files in the folder overwrite: False # Corpus opts: data: corpus_1: path_src: toy-ende/src-train.txt path_tgt: toy-ende/tgt-train.txt valid: path_src: toy-ende/src-val.txt path_tgt: toy-ende/tgt-val.txt ...
From this configuration, we can build the vocab(s), that will be necessary to train the model:
onmt_build_vocab -config toy_en_de.yaml -n_sample 10000
-n_sampleis required here – it represents the number of lines sampled from each corpus to build the vocab.
This configuration is the simplest possible, without any tokenization or other transforms. See other example configurations for more complex pipelines.
Step 2: Train the model¶
To train a model, we need to add the following to the YAML configuration file:
the vocabulary path(s) that will be used: can be that generated by onmt_build_vocab;
training specific parameters.
# toy_en_de.yaml ... # Vocabulary files that were just created src_vocab: toy-ende/run/example.vocab.src tgt_vocab: toy-ende/run/example.vocab.tgt # Train on a single GPU world_size: 1 gpu_ranks:  # Where to save the checkpoints save_model: toy-ende/run/model save_checkpoint_steps: 500 train_steps: 1000 valid_steps: 500
Then you can simply run:
onmt_train -config toy_en_de.yaml
This configuration will run the default model, which consists of a 2-layer LSTM with 500 hidden units on both the encoder and decoder. It will run on a single GPU (
world_size 1 &
Before the training process actually starts, the
*.vocab.pt together with
*.transforms.pt can be dumped to
-save_data with configurations specified in
-config yaml file by enabling the
-dump_transforms flags. It is also possible to generate transformed samples to simplify any potentially required visual inspection. The number of sample lines to dump per corpus is set with the
Step 3: Translate¶
onmt_translate -model toy-ende/run/model_step_1000.pt -src toy-ende/src-test.txt -output toy-ende/pred_1000.txt -gpu 0 -verbose
Now you have a model which you can use to predict on new data. We do this by running beam search. This will output predictions into
The predictions are going to be quite terrible, as the demo dataset is small. Try running on some larger datasets! For example you can download millions of parallel sentences for translation or summarization.