Translator

class ctranslate2.Translator

A text translator.

Example

>>> translator = ctranslate2.Translator("model/", device="cpu")
>>> translator.translate_batch([["▁Hello", "▁world", "!"]])

Inherits from: pybind11_builtins.pybind11_object

Attributes:

compute_type
device
device_index
model_is_loaded
num_active_batches
num_queued_batches
num_translators
tensor_parallel

Methods:

generate_tokens
load_model
score_batch
score_file
score_iterable
translate_batch
translate_file
translate_iterable
unload_model

__init__(model_path: str, device: str = 'cpu', *, device_index: Union[int, List[int]] = 0, compute_type: Union[str, Dict[str, str]] = 'default', inter_threads: int = 1, intra_threads: int = 0, max_queued_batches: int = 0, flash_attention: bool = False, tensor_parallel: bool = False, files: object = None) → None

Initializes the translator.

Parameters

model_path – Path to the CTranslate2 model directory.
device – Device to use (possible values are: cpu, cuda, auto).
device_index – Device IDs where to place this generator on.
compute_type – Model computation type or a dictionary mapping a device name to the computation type (possible values are: default, auto, int8, int8_float32, int8_float16, int8_bfloat16, int16, float16, bfloat16, float32).
inter_threads – Maximum number of parallel translations.
intra_threads – Number of OpenMP threads per translator (0 to use a default value).
max_queued_batches – Maximum numbers of batches in the queue (-1 for unlimited, 0 for an automatic value). When the queue is full, future requests will block until a free slot is available.
flash_attention – run model with flash attention 2 for self-attention layer
tensor_parallel – run model with tensor parallel mode
files – Load model files from the memory. This argument is a dictionary mapping file names to file contents as file-like or bytes objects. If this is set, model_path acts as an identifier for this model.

generate_tokens(source: List[str], target_prefix: Optional[List[str]] = None, *, max_decoding_length: int = 256, min_decoding_length: int = 1, sampling_topk: int = 1, sampling_topp: float = 1, sampling_temperature: float = 1, return_log_prob: bool = False, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, disable_unk: bool = False, suppress_sequences: Optional[List[List[str]]] = None, end_token: Optional[Union[str, List[str], List[int]]] = None, max_input_length: int = 1024, use_vmap: bool = False) → Iterable[GenerationStepResult]

Yields tokens as they are generated by the model.

Parameters

source – Source tokens.
target_prefix – Optional target prefix tokens.
max_decoding_length – Maximum prediction length.
min_decoding_length – Minimum prediction length.
sampling_topk – Randomly sample predictions from the top K candidates.
sampling_topp – Keep the most probable tokens whose cumulative probability exceeds this value.
sampling_temperature – Sampling temperature to generate more random samples.
return_log_prob – Include the token log probability in the result.
repetition_penalty – Penalty applied to the score of previously generated tokens (set > 1 to penalize).
no_repeat_ngram_size – Prevent repetitions of ngrams with this size (set 0 to disable).
disable_unk – Disable the generation of the unknown token.
suppress_sequences – Disable the generation of some sequences of tokens.
end_token – Stop the decoding on one of these tokens (defaults to the model EOS token).
max_input_length – Truncate inputs after this many tokens (set 0 to disable).
use_vmap – Use the vocabulary mapping file saved in this model

Returns

A generator iterator over ctranslate2.GenerationStepResult instances.

Note

This generation method is not compatible with beam search which requires a complete decoding.

load_model(keep_cache: bool = False) → None

Loads the model back to the initial device.

Parameters: keep_cache – If True, the model cache in the CPU memory is not deleted if it exists.

score_batch(source: List[List[str]], target: List[List[str]], *, max_batch_size: int = 0, batch_type: str = 'examples', max_input_length: int = 1024, offset: int = 0, asynchronous: bool = False) → Union[List[ScoringResult], List[AsyncScoringResult]]

Scores a batch of parallel tokens.

Parameters

source – Batch of source tokens.
target – Batch of target tokens.
max_batch_size – The maximum batch size. If the number of inputs is greater than max_batch_size, the inputs are sorted by length and split by chunks of max_batch_size examples so that the number of padding positions is minimized.
batch_type – Whether max_batch_size is the number of “examples” or “tokens”.
max_input_length – Truncate inputs after this many tokens (0 to disable).
offset – Ignore the first n tokens in target in score calculation.
asynchronous – Run the scoring asynchronously.

Returns

A list of scoring results.

score_file(source_path: str, target_path: str, output_path: str, *, max_batch_size: int = 32, read_batch_size: int = 0, batch_type: str = 'examples', max_input_length: int = 1024, offset: int = 0, with_tokens_score: bool = False, source_tokenize_fn: Callable[[str], List[str]] = None, target_tokenize_fn: Callable[[str], List[str]] = None, target_detokenize_fn: Callable[[List[str]], str] = None) → ExecutionStats

Scores a parallel tokenized text file.

Each line in output_path will have the format:

<score> ||| <target> [||| <score_token_0> <score_token_1> ...]

The score is normalized by the target length which includes the end of sentence token </s>.

Parameters

source_path – Path to the source file.
target_path – Path to the target file.
output_path – Path to the output file.
max_batch_size – The maximum batch size.
read_batch_size – The number of examples to read from the file before sorting by length and splitting by chunks of max_batch_size examples (set 0 for an automatic value).
batch_type – Whether max_batch_size and read_batch_size are the number of “examples” or “tokens”.
max_input_length – Truncate inputs after this many tokens (0 to disable).
offset – Ignore the first n tokens in target in score calculation.
with_tokens_score – Include the token-level scores in the output file.
source_tokenize_fn – Function to tokenize source lines.
target_tokenize_fn – Function to tokenize target lines.
target_detokenize_fn – Function to detokenize target outputs.

Returns

A statistics object.

score_iterable(source: Iterable[List[str]], target: Iterable[List[str]], max_batch_size: int = 64, batch_type: str = 'examples', **kwargs) → Iterable[ScoringResult]

Scores an iterable of tokenized examples.

This method is built on top of ctranslate2.Translator.score_batch() to efficiently score an arbitrarily large stream of data. It enables the following optimizations:

stream processing (the iterable is not fully materialized in memory)
parallel scoring (if the translator has multiple workers)
asynchronous batch prefetching
local sorting by length

Parameters

source – An iterable of tokenized source examples.
target – An iterable of tokenized target examples.
max_batch_size – The maximum batch size.
batch_type – Whether max_batch_size is the number of “examples” or “tokens”.
**kwargs – Any scoring options accepted by ctranslate2.Translator.score_batch().

Returns

A generator iterator over ctranslate2.ScoringResult instances.

translate_batch(source: List[List[str]], target_prefix: Optional[List[Optional[List[str]]]] = None, *, max_batch_size: int = 0, batch_type: str = 'examples', asynchronous: bool = False, beam_size: int = 2, patience: float = 1, num_hypotheses: int = 1, length_penalty: float = 1, coverage_penalty: float = 0, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, disable_unk: bool = False, suppress_sequences: Optional[List[List[str]]] = None, end_token: Optional[Union[str, List[str], List[int]]] = None, return_end_token: bool = False, prefix_bias_beta: float = 0, max_input_length: int = 1024, max_decoding_length: int = 256, min_decoding_length: int = 1, use_vmap: bool = False, return_scores: bool = False, return_logits_vocab: bool = False, return_attention: bool = False, return_alternatives: bool = False, min_alternative_expansion_prob: float = 0, sampling_topk: int = 1, sampling_topp: float = 1, sampling_temperature: float = 1, replace_unknowns: bool = False, callback: Callable[[GenerationStepResult], bool] = None) → Union[List[TranslationResult], List[AsyncTranslationResult]]

Translates a batch of tokens.

Parameters

source – Batch of source tokens.
target_prefix – Optional batch of target prefix tokens.
max_batch_size – The maximum batch size. If the number of inputs is greater than max_batch_size, the inputs are sorted by length and split by chunks of max_batch_size examples so that the number of padding positions is minimized.
batch_type – Whether max_batch_size is the number of “examples” or “tokens”.
asynchronous – Run the translation asynchronously.
beam_size – Beam size (1 for greedy search).
patience – Beam search patience factor, as described in https://arxiv.org/abs/2204.05424. The decoding will continue until beam_size*patience hypotheses are finished.
num_hypotheses – Number of hypotheses to return.
length_penalty – Exponential penalty applied to the length during beam search.
coverage_penalty – Coverage penalty weight applied during beam search.
repetition_penalty – Penalty applied to the score of previously generated tokens (set > 1 to penalize).
no_repeat_ngram_size – Prevent repetitions of ngrams with this size (set 0 to disable).
disable_unk – Disable the generation of the unknown token.
suppress_sequences – Disable the generation of some sequences of tokens.
end_token – Stop the decoding on one of these tokens (defaults to the model EOS token).
return_end_token – Include the end token in the results.
prefix_bias_beta – Parameter for biasing translations towards given prefix.
max_input_length – Truncate inputs after this many tokens (set 0 to disable).
max_decoding_length – Maximum prediction length.
min_decoding_length – Minimum prediction length.
use_vmap – Use the vocabulary mapping file saved in this model
return_scores – Include the scores in the output.
return_logits_vocab – Include the log probs of each token in the output
return_attention – Include the attention vectors in the output.
return_alternatives – Return alternatives at the first unconstrained decoding position.
min_alternative_expansion_prob – Minimum initial probability to expand an alternative.
sampling_topk – Randomly sample predictions from the top K candidates.
sampling_topp – Keep the most probable tokens whose cumulative probability exceeds this value.
sampling_temperature – Sampling temperature to generate more random samples.
replace_unknowns – Replace unknown target tokens by the source token with the highest attention.
callback – Optional function that is called for each generated token when beam_size is 1. If the callback function returns True, the decoding will stop for this batch.

Returns

A list of translation results.