Translator

class ctranslate2.Translator

A text translator.

Example

>>> translator = ctranslate2.Translator("model/", device="cpu")
>>> translator.translate_batch([["▁Hello", "▁world", "!"]])

Inherits from: pybind11_builtins.pybind11_object

Attributes:

Methods:

__init__(model_path: str, device: str = 'cpu', *, device_index: Union[int, List[int]] = 0, compute_type: Union[str, Dict[str, str]] = 'default', inter_threads: int = 1, intra_threads: int = 0, max_queued_batches: int = 0, flash_attention: bool = False, tensor_parallel: bool = False, files: object = None) None

Initializes the translator.

Parameters
  • model_path – Path to the CTranslate2 model directory.

  • device – Device to use (possible values are: cpu, cuda, auto).

  • device_index – Device IDs where to place this generator on.

  • compute_type – Model computation type or a dictionary mapping a device name to the computation type (possible values are: default, auto, int8, int8_float32, int8_float16, int8_bfloat16, int16, float16, bfloat16, float32).

  • inter_threads – Maximum number of parallel translations.

  • intra_threads – Number of OpenMP threads per translator (0 to use a default value).

  • max_queued_batches – Maximum numbers of batches in the queue (-1 for unlimited, 0 for an automatic value). When the queue is full, future requests will block until a free slot is available.

  • flash_attention – run model with flash attention 2 for self-attention layer

  • tensor_parallel – run model with tensor parallel mode

  • files – Load model files from the memory. This argument is a dictionary mapping file names to file contents as file-like or bytes objects. If this is set, model_path acts as an identifier for this model.

generate_tokens(source: List[str], target_prefix: Optional[List[str]] = None, *, max_decoding_length: int = 256, min_decoding_length: int = 1, sampling_topk: int = 1, sampling_topp: float = 1, sampling_temperature: float = 1, return_log_prob: bool = False, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, disable_unk: bool = False, suppress_sequences: Optional[List[List[str]]] = None, end_token: Optional[Union[str, List[str], List[int]]] = None, max_input_length: int = 1024, use_vmap: bool = False) Iterable[GenerationStepResult]

Yields tokens as they are generated by the model.

Parameters
  • source – Source tokens.

  • target_prefix – Optional target prefix tokens.

  • max_decoding_length – Maximum prediction length.

  • min_decoding_length – Minimum prediction length.

  • sampling_topk – Randomly sample predictions from the top K candidates.

  • sampling_topp – Keep the most probable tokens whose cumulative probability exceeds this value.

  • sampling_temperature – Sampling temperature to generate more random samples.

  • return_log_prob – Include the token log probability in the result.

  • repetition_penalty – Penalty applied to the score of previously generated tokens (set > 1 to penalize).

  • no_repeat_ngram_size – Prevent repetitions of ngrams with this size (set 0 to disable).

  • disable_unk – Disable the generation of the unknown token.

  • suppress_sequences – Disable the generation of some sequences of tokens.

  • end_token – Stop the decoding on one of these tokens (defaults to the model EOS token).

  • max_input_length – Truncate inputs after this many tokens (set 0 to disable).

  • use_vmap – Use the vocabulary mapping file saved in this model

Returns

A generator iterator over ctranslate2.GenerationStepResult instances.

Note

This generation method is not compatible with beam search which requires a complete decoding.

load_model(keep_cache: bool = False) None

Loads the model back to the initial device.

Parameters

keep_cache – If True, the model cache in the CPU memory is not deleted if it exists.

score_batch(source: List[List[str]], target: List[List[str]], *, max_batch_size: int = 0, batch_type: str = 'examples', max_input_length: int = 1024, offset: int = 0, asynchronous: bool = False) Union[List[ScoringResult], List[AsyncScoringResult]]

Scores a batch of parallel tokens.

Parameters
  • source – Batch of source tokens.

  • target – Batch of target tokens.

  • max_batch_size – The maximum batch size. If the number of inputs is greater than max_batch_size, the inputs are sorted by length and split by chunks of max_batch_size examples so that the number of padding positions is minimized.

  • batch_type – Whether max_batch_size is the number of “examples” or “tokens”.

  • max_input_length – Truncate inputs after this many tokens (0 to disable).

  • offset – Ignore the first n tokens in target in score calculation.

  • asynchronous – Run the scoring asynchronously.

Returns

A list of scoring results.

score_file(source_path: str, target_path: str, output_path: str, *, max_batch_size: int = 32, read_batch_size: int = 0, batch_type: str = 'examples', max_input_length: int = 1024, offset: int = 0, with_tokens_score: bool = False, source_tokenize_fn: Callable[[str], List[str]] = None, target_tokenize_fn: Callable[[str], List[str]] = None, target_detokenize_fn: Callable[[List[str]], str] = None) ExecutionStats

Scores a parallel tokenized text file.

Each line in output_path will have the format:

<score> ||| <target> [||| <score_token_0> <score_token_1> ...]

The score is normalized by the target length which includes the end of sentence token </s>.

Parameters
  • source_path – Path to the source file.

  • target_path – Path to the target file.

  • output_path – Path to the output file.

  • max_batch_size – The maximum batch size.

  • read_batch_size – The number of examples to read from the file before sorting by length and splitting by chunks of max_batch_size examples (set 0 for an automatic value).

  • batch_type – Whether max_batch_size and read_batch_size are the number of “examples” or “tokens”.

  • max_input_length – Truncate inputs after this many tokens (0 to disable).

  • offset – Ignore the first n tokens in target in score calculation.

  • with_tokens_score – Include the token-level scores in the output file.

  • source_tokenize_fn – Function to tokenize source lines.

  • target_tokenize_fn – Function to tokenize target lines.

  • target_detokenize_fn – Function to detokenize target outputs.

Returns

A statistics object.

score_iterable(source: Iterable[List[str]], target: Iterable[List[str]], max_batch_size: int = 64, batch_type: str = 'examples', **kwargs) Iterable[ScoringResult]

Scores an iterable of tokenized examples.

This method is built on top of ctranslate2.Translator.score_batch() to efficiently score an arbitrarily large stream of data. It enables the following optimizations:

  • stream processing (the iterable is not fully materialized in memory)

  • parallel scoring (if the translator has multiple workers)

  • asynchronous batch prefetching

  • local sorting by length

Parameters
  • source – An iterable of tokenized source examples.

  • target – An iterable of tokenized target examples.

  • max_batch_size – The maximum batch size.

  • batch_type – Whether max_batch_size is the number of “examples” or “tokens”.

  • **kwargs – Any scoring options accepted by ctranslate2.Translator.score_batch().

Returns

A generator iterator over ctranslate2.ScoringResult instances.

translate_batch(source: List[List[str]], target_prefix: Optional[List[Optional[List[str]]]] = None, *, max_batch_size: int = 0, batch_type: str = 'examples', asynchronous: bool = False, beam_size: int = 2, patience: float = 1, num_hypotheses: int = 1, length_penalty: float = 1, coverage_penalty: float = 0, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, disable_unk: bool = False, suppress_sequences: Optional[List[List[str]]] = None, end_token: Optional[Union[str, List[str], List[int]]] = None, return_end_token: bool = False, prefix_bias_beta: float = 0, max_input_length: int = 1024, max_decoding_length: int = 256, min_decoding_length: int = 1, use_vmap: bool = False, return_scores: bool = False, return_attention: bool = False, return_alternatives: bool = False, min_alternative_expansion_prob: float = 0, sampling_topk: int = 1, sampling_topp: float = 1, sampling_temperature: float = 1, replace_unknowns: bool = False, callback: Callable[[GenerationStepResult], bool] = None) Union[List[TranslationResult], List[AsyncTranslationResult]]

Translates a batch of tokens.

Parameters
  • source – Batch of source tokens.

  • target_prefix – Optional batch of target prefix tokens.

  • max_batch_size – The maximum batch size. If the number of inputs is greater than max_batch_size, the inputs are sorted by length and split by chunks of max_batch_size examples so that the number of padding positions is minimized.

  • batch_type – Whether max_batch_size is the number of “examples” or “tokens”.

  • asynchronous – Run the translation asynchronously.

  • beam_size – Beam size (1 for greedy search).

  • patience – Beam search patience factor, as described in https://arxiv.org/abs/2204.05424. The decoding will continue until beam_size*patience hypotheses are finished.

  • num_hypotheses – Number of hypotheses to return.

  • length_penalty – Exponential penalty applied to the length during beam search.

  • coverage_penalty – Coverage penalty weight applied during beam search.

  • repetition_penalty – Penalty applied to the score of previously generated tokens (set > 1 to penalize).

  • no_repeat_ngram_size – Prevent repetitions of ngrams with this size (set 0 to disable).

  • disable_unk – Disable the generation of the unknown token.

  • suppress_sequences – Disable the generation of some sequences of tokens.

  • end_token – Stop the decoding on one of these tokens (defaults to the model EOS token).

  • return_end_token – Include the end token in the results.

  • prefix_bias_beta – Parameter for biasing translations towards given prefix.

  • max_input_length – Truncate inputs after this many tokens (set 0 to disable).

  • max_decoding_length – Maximum prediction length.

  • min_decoding_length – Minimum prediction length.

  • use_vmap – Use the vocabulary mapping file saved in this model

  • return_scores – Include the scores in the output.

  • return_attention – Include the attention vectors in the output.

  • return_alternatives – Return alternatives at the first unconstrained decoding position.

  • min_alternative_expansion_prob – Minimum initial probability to expand an alternative.

  • sampling_topk – Randomly sample predictions from the top K candidates.

  • sampling_topp – Keep the most probable tokens whose cumulative probability exceeds this value.

  • sampling_temperature – Sampling temperature to generate more random samples.

  • replace_unknowns – Replace unknown target tokens by the source token with the highest attention.

  • callback – Optional function that is called for each generated token when beam_size is 1. If the callback function returns True, the decoding will stop for this batch.

Returns

A list of translation results.

See also

TranslationOptions structure in the C++ library.

translate_file(source_path: str, output_path: str, target_path: Optional[str] = None, *, max_batch_size: int = 32, read_batch_size: int = 0, batch_type: str = 'examples', beam_size: int = 2, patience: float = 1, num_hypotheses: int = 1, length_penalty: float = 1, coverage_penalty: float = 0, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, disable_unk: bool = False, suppress_sequences: Optional[List[List[str]]] = None, end_token: Optional[Union[str, List[str], List[int]]] = None, prefix_bias_beta: float = 0, max_input_length: int = 1024, max_decoding_length: int = 256, min_decoding_length: int = 1, use_vmap: bool = False, with_scores: bool = False, sampling_topk: int = 1, sampling_topp: float = 1, sampling_temperature: float = 1, replace_unknowns: bool = False, source_tokenize_fn: Callable[[str], List[str]] = None, target_tokenize_fn: Callable[[str], List[str]] = None, target_detokenize_fn: Callable[[List[str]], str] = None) ExecutionStats

Translates a tokenized text file.

Parameters
  • source_path – Path to the source file.

  • output_path – Path to the output file.

  • target_path – Path to the target prefix file.

  • max_batch_size – The maximum batch size.

  • read_batch_size – The number of examples to read from the file before sorting by length and splitting by chunks of max_batch_size examples (set 0 for an automatic value).

  • batch_type – Whether max_batch_size and read_batch_size are the numbers of “examples” or “tokens”.

  • asynchronous – Run the translation asynchronously.

  • beam_size – Beam size (1 for greedy search).

  • patience – Beam search patience factor, as described in https://arxiv.org/abs/2204.05424. The decoding will continue until beam_size*patience hypotheses are finished.

  • num_hypotheses – Number of hypotheses to return.

  • length_penalty – Exponential penalty applied to the length during beam search.

  • coverage_penalty – Coverage penalty weight applied during beam search.

  • repetition_penalty – Penalty applied to the score of previously generated tokens (set > 1 to penalize).

  • no_repeat_ngram_size – Prevent repetitions of ngrams with this size (set 0 to disable).

  • disable_unk – Disable the generation of the unknown token.

  • suppress_sequences – Disable the generation of some sequences of tokens.

  • end_token – Stop the decoding on one of these tokens (defaults to the model EOS token).

  • prefix_bias_beta – Parameter for biasing translations towards given prefix.

  • max_input_length – Truncate inputs after this many tokens (set 0 to disable).

  • max_decoding_length – Maximum prediction length.

  • min_decoding_length – Minimum prediction length.

  • use_vmap – Use the vocabulary mapping file saved in this model

  • with_scores – Include the scores in the output.

  • sampling_topk – Randomly sample predictions from the top K candidates.

  • sampling_topp – Keep the most probable tokens whose cumulative probability exceeds this value.

  • sampling_temperature – Sampling temperature to generate more random samples.

  • replace_unknowns – Replace unknown target tokens by the source token with the highest attention.

  • source_tokenize_fn – Function to tokenize source lines.

  • target_tokenize_fn – Function to tokenize target lines.

  • target_detokenize_fn – Function to detokenize target outputs.

Returns

A statistics object.

See also

TranslationOptions structure in the C++ library.

translate_iterable(source: Iterable[List[str]], target_prefix: Optional[Iterable[List[str]]] = None, max_batch_size: int = 32, batch_type: str = 'examples', **kwargs) Iterable[TranslationResult]

Translates an iterable of tokenized examples.

This method is built on top of ctranslate2.Translator.translate_batch() to efficiently translate an arbitrarily large stream of data. It enables the following optimizations:

  • stream processing (the iterable is not fully materialized in memory)

  • parallel translations (if the translator has multiple workers)

  • asynchronous batch prefetching

  • local sorting by length

Parameters
  • source – An iterable of tokenized source examples.

  • target_prefix – An optional iterable of tokenized target prefixes.

  • max_batch_size – The maximum batch size.

  • batch_type – Whether max_batch_size is the number of “examples” or “tokens”.

  • **kwargs – Any translation options accepted by ctranslate2.Translator.translate_batch().

Returns

A generator iterator over ctranslate2.TranslationResult instances.

Example

This method can be used to efficiently translate text files:

# Replace by your own tokenization and detokenization functions.
tokenize_fn = lambda line: line.strip().split()
detokenize_fn = lambda tokens: " ".join(tokens)

with open("input.txt") as input_file:
    source = map(tokenize_fn, input_file)
    results = translator.translate_iterable(source, max_batch_size=64)

    for result in results:
        tokens = result.hypotheses[0]
        target = detokenize_fn(tokens)
        print(target)
unload_model(to_cpu: bool = False) None

Unloads the model attached to this translator but keep enough runtime context to quickly resume translation on the initial device. The model is not guaranteed to be unloaded if translations are running concurrently.

Parameters

to_cpu – If True, the model is moved to the CPU memory and not fully unloaded.

property compute_type

Computation type used by the model.

property device

Device this translator is running on.

property device_index

List of device IDs where this translator is running on.

property model_is_loaded

Whether the model is loaded on the initial device and ready to be used.

property num_active_batches

Number of batches waiting to be processed or currently processed.

property num_queued_batches

Number of batches waiting to be processed.

property num_translators

Number of translators backing this instance.

property tensor_parallel

Run model with tensor parallel mode.