Performance tips

Below are some general recommendations to further improve performance.

CPU

Use int8 quantization
Use an Intel CPU supporting AVX512
If you are processing a large volume of data, prefer increasing inter_threads over intra_threads and use stream methods (methods whose name ends with _file or _iterable)
Avoid the total number of threads inter_threads * intra_threads to be larger than the number of physical cores
For single core execution on Intel CPUs, consider enabling packed GEMM (set the environment variable CT2_USE_EXPERIMENTAL_PACKED_GEMM=1)

The default beam size for translation is 2, but consider setting beam_size=1 to improve performance
When using a beam size of 1, keep return_scores disabled if you are not using prediction scores: the final softmax layer can be skipped
Set max_batch_size and pass a larger batch to *_batch methods: the input sentences will be sorted by length and split by chunk of max_batch_size elements for improved efficiency
Prefer the “tokens” batch_type to make the total number of elements in a batch more constant
Consider using Dynamic vocabulary reduction for translation

Set include_prompt_in_result=False so that the input prompt can be forwarded in the decoder at once
If the model uses a system prompt, consider passing it to the argument static_prompt for it to be cached
When using a beam size of 1, keep return_scores disabled if you are not using prediction scores: the final softmax layer can be skipped