Performance tips
Below are some general recommendations to further improve performance. Many of these recommendations were used in the WNGT 2020 efficiency task submission.
Set the compute type to “auto” to automatically select the fastest execution path on the current system
Reduce the beam size to the minimum value that meets your quality requirement
When using a beam size of 1, keep
return_scores
disabled if you are not using prediction scores: the final softmax layer can be skippedSet
max_batch_size
and pass a larger batch to*_batch
methods: the input sentences will be sorted by length and split by chunk ofmax_batch_size
elements for improved efficiencyPrefer the “tokens”
batch_type
to make the total number of elements in a batch more constantConsider using Dynamic vocabulary reduction for translation
On CPU
Use an Intel CPU supporting AVX512
If you are processing a large volume of data, prefer increasing
inter_threads
overintra_threads
and use stream methods (methods whose name ends with_file
or_iterable
)Avoid the total number of threads
inter_threads * intra_threads
to be larger than the number of physical coresFor single core execution on Intel CPUs, consider enabling packed GEMM (set the environment variable
CT2_USE_EXPERIMENTAL_PACKED_GEMM=1
)
On GPU
Use a larger batch size
Use a NVIDIA GPU with Tensor Cores (Compute Capability >= 7.0)
Pass multiple GPU IDs to
device_index
to execute on multiple GPUs