Multithreading and parallelism
CTranslate2 has 2 levels of parallelization:
inter_threadswhich is the maximum number of batches executed in parallel.
=> Increase this value to increase the throughput.
intra_threadswhich is the number of OpenMP threads that is used per batch.
=> Increase this value to decrease the latency on CPU.
The total number of computing threads launched by the process is
inter_threads * intra_threads.
Even though the model data are shared between parallel replicas, increasing
inter_threads will still increase the memory usage as some internal buffers are duplicated for thread safety.
On GPU, batches processed in parallel are using separate CUDA streams. Depending on the workload and GPU specifications this may or may not improve the global throughput. For better parallelism on GPU, consider using multiple GPUs as described below.
Objects running models such as the
Generator can be configured to process multiple batches in parallel, including on multiple GPUs:
# Create a CPU translator with 4 workers each using 1 thread: translator = ctranslate2.Translator(model_path, device="cpu", inter_threads=4, intra_threads=1) # Create a GPU translator with 4 workers each running on a separate GPU: translator = ctranslate2.Translator(model_path, device="cuda", device_index=[0, 1, 2, 3]) # Create a GPU translator with 4 workers each using a different CUDA stream: translator = ctranslate2.Translator(model_path, device="cuda", inter_threads=4)
Multiple batches should be submitted concurrently to enable this parallelization. Parallel executions are enabled in the following cases:
When calling methods from multiple Python threads.
When calling methods multiple times with
When calling file-based or stream-based methods.
max_batch_size: the input will be split according to
max_batch_sizeand each sub-batch will be executed in parallel.
Parallelization with multiple Python threads is possible because all computation methods release the Python GIL.
Some methods can run asynchronously with
asynchronous=True. In this mode, the method returns immediately and the result can be retrieved later:
async_results =  for batch in batch_generator(): async_results.extend(translator.translate_batch(batch, asynchronous=True)) for async_result in async_results: print(async_result.result()) # This method blocks until the result is available.
Instances supporting asynchronous execution have a limited queue size by default. When the queue of batches is full, the method will block even with
asynchronous=True. See the parameter
max_queued_batches in their constructor to configure the queue size.