OpenNMT-tf uses TensorBoard to log information during the training. Simply start
tensorboard by setting the active log directory, e.g.:
then open the URL displayed in the shell to monitor and visualize several data, including:
- training and evaluation loss
- training speed
- learning rate
- gradients norm
- computation graphs
- word embeddings
- decoder sampling probability
OpenNMT-tf training can make use of multiple GPUs with in-graph replication. In this mode, the main section of the graph is replicated over multiple devices and batches are processed in parallel. The resulting graph is equivalent to train with batches
N times larger, where
N is the number of used GPUs.
For example, if your machine has 4 GPUs, simply add the
onmt-main train [...] --num_gpus 4
Note that evaluation and inference will run on a single device.
OpenNMT-tf also supports asynchronous distributed training with between-graph replication. In this mode, each graph replica processes a batch independently, compute the gradients, and asynchronously update a shared set of parameters.
To enable distributed training, the user should use the
train_and_eval run type and set on the command line:
- a chief worker host that runs a training loop and manages checkpoints, summaries, etc.
- a list of worker hosts that run a training loop
- a list of parameter server hosts that synchronize the parameters
Then a training instance should be started on each host with a selected task, e.g.:
CUDA_VISIBLE_DEVICES=0 onmt-main train_and_eval [...] \ --ps_hosts localhost:2222 \ --chief_host localhost:2223 \ --worker_hosts localhost:2224,localhost:2225 \ --task_type worker \ --task_index 1
will start the worker 1 on the current machine and first GPU. By setting
CUDA_VISIBLE_DEVICES correctly, asynchronous distributed training can be run on a single multi-GPU machine.
Note: distributed training will also split the training directory
model_dir accross the instances. This could impact features that restore checkpoints like inference, manual export, or checkpoint averaging. The recommend approach to properly support these features while running distributed training is to store the
model_dir on a shared filesystem, e.g. by using HDFS.
Mixed precision training¶
Thanks to work from NVIDIA, OpenNMT-tf supports training models using FP16 computation. Mixed precision training is automatically enabled when the data type of the inputters is defined to be
tf.float16. See for example the predefined model
onmt-main train [...] --model_type TransformerFP16
Additional training configurations are available to tune the loss scaling algorithm:
params: # (optional) For mixed precision training, the loss scaling to apply (a constant value or # an automatic scaling algorithm: "backoff", "logmax", default: "backoff") loss_scale: backoff # (optional) For mixed precision training, the additional parameters to pass the loss scale # (see the source file opennmt/optimizers/mixed_precision_wrapper.py). loss_scale_params: scale_min: 1.0 step_factor: 2.0
For more information about the implementation and get expert recommendation on how to maximize performance, see the OpenSeq2Seq’s documentation. Currently, mixed precision training requires Volta GPUs and the NVIDIA’s TensorFlow Docker image.
If you want to convert an existing checkpoint to FP16 from FP32 (or vice-versa), see the script
onmt-convert-checkpoint. Typically, it is useful when you want to train using FP16 but still release a model in FP32, e.g.:
onmt-convert-checkpoint --model_dir ende-fp16/ --output_dir ende-fp32/ --target_dtype float32
The checkpoint generated in
ende-fp32/ can then be used in
export run types.