Multithreading and parallelism
Intra-op multithreading on CPU
Most model operations (matmul, softmax, etc.) are using multiple threads on CPU. The number of threads can be configured with the parameter
intra_threads (the default value is 4):
translator = ctranslate2.Translator(model_path, device="cpu", intra_threads=8)
This multithreading is generally implemented with OpenMP so the threads behavior can also be customized with the different
OMP_* environment variables.
When OpenMP is disabled (which is the case for example in the Python ARM64 wheels for macOS), the multithreading is implemented with
# Create a CPU translator with 4 workers each using 1 intra-op thread: translator = ctranslate2.Translator(model_path, device="cpu", inter_threads=4, intra_threads=1) # Create a GPU translator with 4 workers each running on a separate GPU: translator = ctranslate2.Translator(model_path, device="cuda", device_index=[0, 1, 2, 3]) # Create a GPU translator with 4 workers each using a different CUDA stream: # (Note: depending on the workload and GPU specifications this may not improve the global throughput.) translator = ctranslate2.Translator(model_path, device="cuda", inter_threads=4)
When the workers are running on the same device, the model weights are shared to save on memory.
Multiple batches should be submitted concurrently to enable this parallelization. Parallel executions are enabled in the following cases:
When calling methods from multiple Python threads.
When calling methods multiple times with
When calling file-based or stream-based methods.
max_batch_size: the input will be split according to
max_batch_sizeand each sub-batch will be executed in parallel.
Parallelization with multiple Python threads is possible because all computation methods release the Python GIL.
Model and tensor parallelism
These types of parallelism are not yet implemented in CTranslate2.
Some methods can run asynchronously with
asynchronous=True. In this mode, the method returns immediately and the result can be retrieved later:
async_results =  for batch in batch_generator(): async_results.extend(translator.translate_batch(batch, asynchronous=True)) for async_result in async_results: print(async_result.result()) # This method blocks until the result is available.
Instances supporting asynchronous execution have a limited queue size by default. When the queue of batches is full, the method will block even with
asynchronous=True. See the parameter
max_queued_batches in their constructor to configure the queue size.