Multithreading and parallelism
Intra-op multithreading on CPU
Most model operations (matmul, softmax, etc.) are using multiple threads on CPU. The number of threads can be configured with the parameter intra_threads
(the default value is 4):
translator = ctranslate2.Translator(model_path, device="cpu", intra_threads=8)
This multithreading is generally implemented with OpenMP so the threads behavior can also be customized with the different OMP_*
environment variables.
When OpenMP is disabled (which is the case for example in the Python ARM64 wheels for macOS), the multithreading is implemented with BS::thread_pool
.
Data parallelism
Objects running models such as the Translator
and Generator
can be configured to process multiple batches in parallel, including on multiple GPUs:
# Create a CPU translator with 4 workers each using 1 intra-op thread:
translator = ctranslate2.Translator(model_path, device="cpu", inter_threads=4, intra_threads=1)
# Create a GPU translator with 4 workers each running on a separate GPU:
translator = ctranslate2.Translator(model_path, device="cuda", device_index=[0, 1, 2, 3])
# Create a GPU translator with 4 workers each using a different CUDA stream:
# (Note: depending on the workload and GPU specifications this may not improve the global throughput.)
translator = ctranslate2.Translator(model_path, device="cuda", inter_threads=4)
When the workers are running on the same device, the model weights are shared to save on memory.
Multiple batches should be submitted concurrently to enable this parallelization. Parallel executions are enabled in the following cases:
When calling methods from multiple Python threads.
When calling methods multiple times with
asynchronous=True
.When calling file-based or stream-based methods.
When setting
max_batch_size
: the input will be split according tomax_batch_size
and each sub-batch will be executed in parallel.
Note
Parallelization with multiple Python threads is possible because all computation methods release the Python GIL.
Model and tensor parallelism
These types of parallelism are not yet implemented in CTranslate2.
Asynchronous execution
Some methods can run asynchronously with asynchronous=True
. In this mode, the method returns immediately and the result can be retrieved later:
async_results = []
for batch in batch_generator():
async_results.extend(translator.translate_batch(batch, asynchronous=True))
for async_result in async_results:
print(async_result.result()) # This method blocks until the result is available.
Attention
Instances supporting asynchronous execution have a limited queue size by default. When the queue of batches is full, the method will block even with asynchronous=True
. See the parameter max_queued_batches
in their constructor to configure the queue size.