Performance tips
Below are some general recommendations to further improve performance.
CPU
Use int8 quantization
Use an Intel CPU supporting AVX512
If you are processing a large volume of data, prefer increasing
inter_threads
overintra_threads
and use stream methods (methods whose name ends with_file
or_iterable
)Avoid the total number of threads
inter_threads * intra_threads
to be larger than the number of physical coresFor single core execution on Intel CPUs, consider enabling packed GEMM (set the environment variable
CT2_USE_EXPERIMENTAL_PACKED_GEMM=1
)
GPU
Use a larger batch size whenever possible
Use a NVIDIA GPU with Tensor Cores (Compute Capability >= 7.0)
Pass multiple GPU IDs to
device_index
to execute on multiple GPUs
Translator
The default beam size for translation is 2, but consider setting
beam_size=1
to improve performanceWhen using a beam size of 1, keep
return_scores
disabled if you are not using prediction scores: the final softmax layer can be skippedSet
max_batch_size
and pass a larger batch to*_batch
methods: the input sentences will be sorted by length and split by chunk ofmax_batch_size
elements for improved efficiencyPrefer the “tokens”
batch_type
to make the total number of elements in a batch more constantConsider using Dynamic vocabulary reduction for translation
See also
The WNGT 2020 efficiency task submission which applies many of these recommendations to optimize machine translation models.
Generator
Set
include_prompt_in_result=False
so that the input prompt can be forwarded in the decoder at onceIf the model uses a system prompt, consider passing it to the argument
static_prompt
for it to be cachedWhen using a beam size of 1, keep
return_scores
disabled if you are not using prediction scores: the final softmax layer can be skipped