Fork me on GitHub

Features

Data Preparation

Tokenizer - full option list

Note:

Preprocess - full option list

Pre-trained Embeddings

When training with small amounts of data, performance can be improved by starting with pretrained embeddings. The arguments -pre_word_vecs_dec and -pre_word_vecs_enc can be used to specify these files. The pretrained embeddings must be manually constructed torch serialized matrices that correspond to the src and tgt dictionary files. By default these embeddings will be updated during training, but they can be held fixed using -fix_word_vecs_enc and -fix_word_vecs_dec.

Word Features

OpenNMT supports additional features on source and target words in the form of discrete labels.

To use additional features, directly modify your data by appending labels to each word with the special character (unicode character FFE8). There can be an arbitrary number of additional features in the form word│feat1│feat2│...│featN but each word must have the same number of features and in the same order. Source and target data can have a different number of additional features.

As an example, see data/src-train-case.txt which uses a separate feature to represent the case of each word. Using case as a feature is a way to optimize the word dictionary (no duplicated words like “the” and “The”) and gives the system an additional information that can be useful to optimize its objective function.

it│C is│l not│l acceptable│l that│l ,│n with│l the│l help│l of│l the│l national│l bureaucracies│l ,│n parliament│C 's│l legislative│l prerogative│l should│l be│l made│l null│l and│l void│l by│l means│l of│l implementing│l provisions│l whose│l content│l ,│n purpose│l and│l extent│l are│l not│l laid│l down│l in│l advance│l .│n

You can generate this case feature with OpenNMT’s tokenization script and the -case_feature flag.

Training

Train - Full option list

Data options

Model options

Optimization options

Other options

Training From Snapshots

As training translation models can take a long time (sometimes many weeks), OpenNMT supports resuming a model from a snapshot. By default, it will save a snapshot every epoch, but this can by altered with the -save_every option. Snapshots are fast to save, but can be quite large. To resume from a snapshot use the -train_from option with the starting snapshot. By default the system will train starting from parameter using newly passed in options. To override this, and continue from the previous location use the -continue option.

Parallel Training

To accelerate training, you can use data parallelism for the training. Data parallelism is the possibility to use several GPUs for training with parallel batches on different replicas. To enable this option use -nparallel option which requires availability of as many GPU cores. There are 2 different modes:

To select a subset of the GPUs available on your machine, you should use the CUDA_VISIBLE_DEVICES environment variable. For example, if you want to use the first and last GPU on a 4-GPU server:

CUDA_VISIBLE_DEVICES=0,3 th train.lua -gpuid 1 -nparallel 2 -data data/demo-train.t7 -save_model demo

Here, GPU 0 will be seen as the first GPU for the process and GPU 3 the second. Note that CUDA_VISIBLE_DEVICES is 0-indexed while -gpuid is 1-indexed.

Translation

Translation - full option list

Data options

Beam Search options

Other options

By default translation is done using beam search. The -beam_size option can be used to trade-off translation time and search accuracy, with -beam_size 1 giving greedy search. The small default beam size is often enough in practice. Beam search can also be used to provide an approximate n-best list of translations by setting -n_best greater than 1. For analysis, the translation command also takes an oracle/gold -tgt file and will output a comparison of scores.

Translating Unknown Words

The default translation mode allows the model to produce the UNK symbol when it is not sure of the specific target word. Often times UNK symbols will correspond to proper names that can be directly transposed between languages. The -replace_unk option will substitute UNK with a source word using the attention of the model.

Alternatively, advanced users may prefer to provide a preconstructed phrase table from an external aligner (such as fast_align) using the -phrase_table option to allow for non-identity replacement. Instead of copying the source token with the highest attention, it will lookup in the phrase table for a possible translation. If a valid replacement is not found then the source token will be copied.

The phrase table is a file with one translation per line in the format:

source|||target

Where source and target are case sensitive and single tokens.

CPU Translation

After training a model on the GPU, you may want to release it to run on the CPU with the release_model.lua script.

th tools/release_model.lua -model model.t7 -gpuid 1

By default, it will create a model_release.t7 file. See th tools/release_model.lua -h for advanced options.

C++ Translator

OpenNMT also includes an optimized C++-only translator for CPU deployment. The code has no dependencies on Torch or Lua and can be run out of the box with standard OpenNMT models. Simply follow the CPU instructions above to release the model, and then use the installation instructions.

The C++ version takes the same arguments as translate.lua.

cli/translate --model model_release.t7 --src src-val.txt

Translation Server

OpenNMT includes a translation server for running translate remotely. This also is an easy way to use models from other languages such as Java and Python.

The server uses the 0MQ for RPC. You can install 0MQ and the Lua bindings on Ubuntu by running:

sudo apt-get install libzmq-dev
luarocks install json
luarocks install lua-zmq ZEROMQ_LIBDIR=/usr/lib/x86_64-linux-gnu/ ZEROMQ_INCDIR=/usr/include

Also you will need to install the OpenNMT as a library.

luarocks make rocks/opennmt-scm-1.rockspec

The translation server can be run using any of the arguments from translate.lua.

th tools/translation_server.lua -host ... -port ... -model ...

Note: the default host is set to 127.0.0.1 which only allows local access. If you want to support remote access, use 0.0.0.0 instead.

It runs as a message queue that takes in a JSON batch of src sentences. For example the following 5 lines of Python code can be used to send a single sentence for translation.

import zmq, sys, json
sock = zmq.Context().socket(zmq.REQ)
sock.connect("tcp://127.0.0.1:5556")
sock.send(json.dumps([{"src": " ".join(sys.argv[1:])}]))
print sock.recv()

For a longer example, see our Python/Flask server in development.

Extending the System (Image-to-Text)

OpenNMT is explicitly separated out into a library and application section. All modeling and training code can be directly used within other Torch applications.

As an example use case we have released an extension for translating from images-to-text. This model replaces the source-side word embeddings with a convolutional image network. The full model is available at OpenNMT/im2text.