Multithreading and parallelism
Intra-op multithreading on CPU
Most model operations (matmul, softmax, etc.) are using multiple threads on CPU. The number of threads can be configured with the parameter intra_threads
(the default value is 4):
translator = ctranslate2.Translator(model_path, device="cpu", intra_threads=8)
This multithreading is generally implemented with OpenMP so the threads behavior can also be customized with the different OMP_*
environment variables.
When OpenMP is disabled (which is the case for example in the Python ARM64 wheels for macOS), the multithreading is implemented with BS::thread_pool
.
Data parallelism
Objects running models such as the Translator
and Generator
can be configured to process multiple batches in parallel, including on multiple GPUs:
# Create a CPU translator with 4 workers each using 1 intra-op thread:
translator = ctranslate2.Translator(model_path, device="cpu", inter_threads=4, intra_threads=1)
# Create a GPU translator with 4 workers each running on a separate GPU:
translator = ctranslate2.Translator(model_path, device="cuda", device_index=[0, 1, 2, 3])
# Create a GPU translator with 4 workers each using a different CUDA stream:
# (Note: depending on the workload and GPU specifications this may not improve the global throughput.)
translator = ctranslate2.Translator(model_path, device="cuda", inter_threads=4)
When the workers are running on the same device, the model weights are shared to save on memory.
Multiple batches should be submitted concurrently to enable this parallelization. Parallel executions are enabled in the following cases:
When calling methods from multiple Python threads.
When calling methods multiple times with
asynchronous=True
.When calling file-based or stream-based methods.
When setting
max_batch_size
: the input will be split according tomax_batch_size
and each sub-batch will be executed in parallel.
Note
Parallelization with multiple Python threads is possible because all computation methods release the Python GIL.
Model and tensor parallelism
Models used with Translator
and Generator
can be split into multiple GPUs.
This is very useful when the model is too big to be loaded in only 1 GPU.
translator = ctranslate2.Translator(model_path, device="cuda", tensor_parallel=True)
Setup environment:
Install open-mpi
Configure open-mpi by creating the config file like
hostfile
:
[ipaddress or dns] slots=nbGPU1
[other ipaddress or dns] slots=NbGPU2
Run:
Run the application in multiprocess to use tensor parallel:
mpirun -np nbGPUExpected -hostfile hostfile python3 script
If you’re trying to use tensor parallelism in multiple machines, some additional configuration is needed:
Make sure Master and Slave can connect to each other as a pair with ssh + pubkey
Export all necessary environment variables from Master to Slave like the example below:
mpirun -x VIRTUAL_ENV_PROMPT -x PATH -x VIRTUAL_ENV -x _ -x LD_LIBRARY_PATH -np nbGPUExpected -hostfile hostfile python3 script
Read more open-mpi docs for more information.
In this mode, the application will run in multiprocess. We can filter out the master process by using:
if ctranslate2.MpiInfo.getCurRank() == 0:
print(...)
Note
Running model in tensor parallel mode in one machine can boost the performance but if the model shared between multiple machines could be slower because of the latency in the connectivity.
Note
In mode tensor parallel, inter_threads
is always supported to run multiple workers. Otherwise, device_index
no longer has any effect
because tensor parallel mode will check only for available gpus on the system and the number of gpus you want to use.
Asynchronous execution
Some methods can run asynchronously with asynchronous=True
. In this mode, the method returns immediately and the result can be retrieved later:
async_results = []
for batch in batch_generator():
async_results.extend(translator.translate_batch(batch, asynchronous=True))
for async_result in async_results:
print(async_result.result()) # This method blocks until the result is available.
Attention
Instances supporting asynchronous execution have a limited queue size by default. When the queue of batches is full, the method will block even with asynchronous=True
. See the parameter max_queued_batches
in their constructor to configure the queue size.