Performance tips

Below are some general recommendations to further improve performance. Many of these recommendations were used in the WNGT 2020 efficiency task submission.

  • Set the compute type to “auto” to automatically select the fastest execution path on the current system

  • Reduce the beam size to the minimum value that meets your quality requirement

  • When using a beam size of 1, keep return_scores disabled if you are not using prediction scores: the final softmax layer can be skipped

  • Set max_batch_size and pass a larger batch to *_batch methods: the input sentences will be sorted by length and split by chunk of max_batch_size elements for improved efficiency

  • Prefer the “tokens” batch_type to make the total number of elements in a batch more constant

  • Consider using Dynamic vocabulary reduction for translation


  • Use an Intel CPU supporting AVX512

  • If you are processing a large volume of data, prefer increasing inter_threads over intra_threads and use stream methods (methods whose name ends with _file or _iterable)

  • Avoid the total number of threads inter_threads * intra_threads to be larger than the number of physical cores

  • For single core execution on Intel CPUs, consider enabling packed GEMM (set the environment variable CT2_USE_EXPERIMENTAL_PACKED_GEMM=1)


  • Use a larger batch size

  • Use a NVIDIA GPU with Tensor Cores (Compute Capability >= 7.0)

  • Pass multiple GPU IDs to device_index to execute on multiple GPUs