Encoder

class ctranslate2.Encoder

A text encoder.

Example

>>> encoder = ctranslate2.Encoder("model/", device="cpu")
>>> encoder.forward_batch([["▁Hello", "▁world", "!"]])

Inherits from: pybind11_builtins.pybind11_object

Attributes:

compute_type
device
device_index
model_is_loaded
num_active_batches
num_encoders
num_queued_batches
tensor_parallel

Methods:

forward_batch
load_model
unload_model

__init__(model_path: str, device: str = 'cpu', *, device_index: Union[int, List[int]] = 0, compute_type: Union[str, Dict[str, str]] = 'default', inter_threads: int = 1, intra_threads: int = 0, max_queued_batches: int = 0, flash_attention: bool = False, tensor_parallel: bool = False, files: object = None) → None

Initializes the encoder.

Parameters

model_path – Path to the CTranslate2 model directory.
device – Device to use (possible values are: cpu, cuda, auto).
device_index – Device IDs where to place this encoder on.
compute_type – Model computation type or a dictionary mapping a device name to the computation type (possible values are: default, auto, int8, int8_float32, int8_float16, int8_bfloat16, int16, float16, bfloat16, float32).
inter_threads – Maximum number of parallel generations.
intra_threads – Number of OpenMP threads per encoder (0 to use a default value).
max_queued_batches – Maximum numbers of batches in the queue (-1 for unlimited, 0 for an automatic value). When the queue is full, future requests will block until a free slot is available.
flash_attention – run model with flash attention 2 for self-attention layer
tensor_parallel – run model with tensor parallel mode
files – Load model files from the memory. This argument is a dictionary mapping file names to file contents as file-like or bytes objects. If this is set, model_path acts as an identifier for this model.

forward_batch(inputs: Union[List[List[str]], List[List[int]], StorageView], lengths: Optional[StorageView] = None, token_type_ids: Optional[List[List[int]]] = None) → EncoderForwardOutput

Forwards a batch of sequences in the encoder.

Parameters

inputs – A batch of sequences either as string tokens or token IDs. This argument can also be a dense int32 array with shape [batch_size, max_length] (e.g. created from a Numpy array or PyTorch tensor).
lengths – The length of each sequence as a int32 array with shape [batch_size]. Required when inputs is a dense array.
token_type_ids – A batch of token type IDs of same shape as inputs. [batch_size, max_length].

Returns

The encoder model output.

load_model(keep_cache: bool = False) → None

Loads the model back to the initial device.

Parameters: keep_cache – If True, the model cache in the CPU memory is not deleted if it exists.

unload_model(to_cpu: bool = False) → None

Unloads the model attached to this encoder but keep enough runtime context to quickly resume encoder on the initial device.

Parameters: to_cpu – If True, the model is moved to the CPU memory and not fully unloaded.

property compute_type: Computation type used by the model.

property device: Device this encoder is running on.

property device_index: List of device IDs where this encoder is running on.

property model_is_loaded: Whether the model is loaded on the initial device and ready to be used.

property num_active_batches: Number of batches waiting to be processed or currently processed.

property num_encoders: Number of encoders backing this instance.

property num_queued_batches: Number of batches waiting to be processed.

property tensor_parallel: Run model with tensor parallel mode.