Generator

class ctranslate2.Generator

A text generator.

Example

>>> generator = ctranslate2.Generator("model/", device="cpu")
>>> generator.generate_batch([["<s>"]], max_length=50, sampling_topk=20)

Inherits from: pybind11_builtins.pybind11_object

Attributes:

compute_type
device
device_index
model_is_loaded
num_active_batches
num_generators
num_queued_batches
tensor_parallel

Methods:

async_generate_tokens
forward_batch
generate_batch
generate_iterable
generate_tokens
load_model
score_batch
score_iterable
unload_model

__init__(model_path: str, device: str = 'cpu', *, device_index: Union[int, List[int]] = 0, compute_type: Union[str, Dict[str, str]] = 'default', inter_threads: int = 1, intra_threads: int = 0, max_queued_batches: int = 0, flash_attention: bool = False, tensor_parallel: bool = False, files: object = None) → None

Initializes the generator.

Parameters

model_path – Path to the CTranslate2 model directory.
device – Device to use (possible values are: cpu, cuda, auto).
device_index – Device IDs where to place this generator on.
compute_type – Model computation type or a dictionary mapping a device name to the computation type (possible values are: default, auto, int8, int8_float32, int8_float16, int8_bfloat16, int16, float16, bfloat16, float32).
inter_threads – Maximum number of parallel generations.
intra_threads – Number of OpenMP threads per generator (0 to use a default value).
max_queued_batches – Maximum numbers of batches in the queue (-1 for unlimited, 0 for an automatic value). When the queue is full, future requests will block until a free slot is available.
flash_attention – run model with flash attention 2 for self-attention layer
tensor_parallel – run model with tensor parallel mode.
files – Load model files from the memory. This argument is a dictionary mapping file names to file contents as file-like or bytes objects. If this is set, model_path acts as an identifier for this model.

async async_generate_tokens(prompt: Union[List[str], List[List[str]]], max_batch_size: int = 0, batch_type: str = 'examples', *, max_length: int = 512, min_length: int = 0, sampling_topk: int = 1, sampling_topp: float = 1, sampling_temperature: float = 1, return_log_prob: bool = False, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, disable_unk: bool = False, suppress_sequences: Optional[List[List[str]]] = None, end_token: Optional[Union[str, List[str], List[int]]] = None, static_prompt: Optional[List[str]] = None, cache_static_prompt: bool = True, callback: Optional[Callable[[GenerationStepResult], bool]] = None) → AsyncIterable[GenerationStepResult]

Yields tokens asynchronously as they are generated by the model.

Parameters

prompt – Batch of start tokens. If the decoder starts from a special start token like <s>, this token should be added to this input.
max_batch_size – The maximum batch size.
batch_type – Whether max_batch_size is the number of “examples” or “tokens”.
max_length – Maximum generation length.
min_length – Minimum generation length.
sampling_topk – Randomly sample predictions from the top K candidates.
sampling_topp – Keep the most probable tokens whose cumulative probability exceeds this value.
sampling_temperature – Sampling temperature to generate more random samples.
return_log_prob – Include the token log probability in the result.
repetition_penalty – Penalty applied to the score of previously generated tokens (set > 1 to penalize).
no_repeat_ngram_size – Prevent repetitions of ngrams with this size (set 0 to disable).
disable_unk – Disable the generation of the unknown token.
suppress_sequences – Disable the generation of some sequences of tokens.
end_token – Stop the decoding on one of these tokens (defaults to the model EOS token).
static_prompt – If the model expects a static prompt (a.k.a. system prompt) it can be set here to simplify the inputs and optionally cache the model state for this prompt to accelerate future generations.
cache_static_prompt – Cache the model state after the static prompt and reuse it for future generations using the same static prompt.
callback – Optional function that is called for each generated token when obj:beam_size is 1. If the callback function returns True, the decoding will stop for this batch index.

Returns

An async generator iterator over ctranslate2.GenerationStepResult instances.

Note

This generation method is not compatible with beam search which requires a complete decoding.

forward_batch(inputs: Union[List[List[str]], List[List[int]], StorageView], lengths: Optional[StorageView] = None, *, return_log_probs: bool = False) → StorageView

Forwards a batch of sequences in the generator.

Parameters

inputs – A batch of sequences either as string tokens or token IDs. This argument can also be a dense int32 array with shape [batch_size, max_length] (e.g. created from a Numpy array or PyTorch tensor).
lengths – The length of each sequence as a int32 array with shape [batch_size]. Required when inputs is a dense array.
return_log_probs – If True, the method returns the log probabilties instead of the unscaled logits.

Returns

The output logits, or the output log probabilities if return_log_probs is enabled.

generate_batch(start_tokens: List[List[str]], *, max_batch_size: int = 0, batch_type: str = 'examples', asynchronous: bool = False, beam_size: int = 1, patience: float = 1, num_hypotheses: int = 1, length_penalty: float = 1, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, disable_unk: bool = False, suppress_sequences: Optional[List[List[str]]] = None, end_token: Optional[Union[str, List[str], List[int]]] = None, return_end_token: bool = False, max_length: int = 512, min_length: int = 0, static_prompt: Optional[List[str]] = None, cache_static_prompt: bool = True, include_prompt_in_result: bool = True, return_scores: bool = False, return_logits_vocab: bool = False, return_alternatives: bool = False, min_alternative_expansion_prob: float = 0, sampling_topk: int = 1, sampling_topp: float = 1, sampling_temperature: float = 1, callback: Callable[[GenerationStepResult], bool] = None) → Union[List[GenerationResult], List[AsyncGenerationResult]]

Generates from a batch of start tokens.

Note

The way the start tokens are forwarded in the decoder depends on the argument include_prompt_in_result:

If include_prompt_in_result is True (the default), the decoding loop is constrained to generate the start tokens that are then included in the result.
If include_prompt_in_result is False, the start tokens are forwarded in the decoder at once to initialize its state (i.e. the KV cache for Transformer models). For variable-length inputs, only the tokens up to the minimum length in the batch are forwarded at once. The remaining tokens are generated in the decoding loop with constrained decoding.

Consider setting include_prompt_in_result=False to increase the performance for long inputs.

Parameters

start_tokens – Batch of start tokens. If the decoder starts from a special start token like <s>, this token should be added to this input.
max_batch_size – The maximum batch size. If the number of inputs is greater than max_batch_size, the inputs are sorted by length and split by chunks of max_batch_size examples so that the number of padding positions is minimized.
batch_type – Whether max_batch_size is the number of “examples” or “tokens”.
asynchronous – Run the generation asynchronously.
beam_size – Beam size (1 for greedy search).
patience – Beam search patience factor, as described in https://arxiv.org/abs/2204.05424. The decoding will continue until beam_size*patience hypotheses are finished.
num_hypotheses – Number of hypotheses to return.
length_penalty – Exponential penalty applied to the length during beam search.
repetition_penalty – Penalty applied to the score of previously generated tokens (set > 1 to penalize).
no_repeat_ngram_size – Prevent repetitions of ngrams with this size (set 0 to disable).
disable_unk – Disable the generation of the unknown token.
suppress_sequences – Disable the generation of some sequences of tokens.
end_token – Stop the decoding on one of these tokens (defaults to the model EOS token).
return_end_token – Include the end token in the results.
max_length – Maximum generation length.
min_length – Minimum generation length.
static_prompt – If the model expects a static prompt (a.k.a. system prompt) it can be set here to simplify the inputs and optionally cache the model state for this prompt to accelerate future generations.
cache_static_prompt – Cache the model state after the static prompt and reuse it for future generations using the same static prompt.
include_prompt_in_result – Include the start_tokens in the result.
return_scores – Include the scores in the output.
return_logits_vocab – Include log probs for each token in the output
return_alternatives – Return alternatives at the first unconstrained decoding position.
min_alternative_expansion_prob – Minimum initial probability to expand an alternative.
sampling_topk – Randomly sample predictions from the top K candidates.
sampling_topp – Keep the most probable tokens whose cumulative probability exceeds this value.
sampling_temperature – Sampling temperature to generate more random samples.
callback – Optional function that is called for each generated token when beam_size is 1. If the callback function returns True, the decoding will stop for this batch index.

Returns

A list of generation results.

See also

GenerationOptions structure in the C++ library.

generate_iterable(start_tokens: Iterable[List[str]], max_batch_size: int = 32, batch_type: str = 'examples', **kwargs) → Iterable[GenerationResult]

Generates from an iterable of tokenized prompts.

This method is built on top of ctranslate2.Generator.generate_batch() to efficiently run generation on an arbitrarily large stream of data. It enables the following optimizations:

stream processing (the iterable is not fully materialized in memory)
parallel generations (if the generator has multiple workers)
asynchronous batch prefetching
local sorting by length

Parameters

start_tokens – An iterable of tokenized prompts.
max_batch_size – The maximum batch size.
batch_type – Whether max_batch_size is the number of “examples” or “tokens”.
**kwargs – Any generation options accepted by ctranslate2.Generator.generate_batch().

Returns

A generator iterator over ctranslate2.GenerationResult instances.

generate_tokens(prompt: Union[List[str], List[List[str]]], max_batch_size: int = 0, batch_type: str = 'examples', *, max_length: int = 512, min_length: int = 0, sampling_topk: int = 1, sampling_topp: float = 1, sampling_temperature: float = 1, return_log_prob: bool = False, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, disable_unk: bool = False, suppress_sequences: Optional[List[List[str]]] = None, end_token: Optional[Union[str, List[str], List[int]]] = None, static_prompt: Optional[List[str]] = None, cache_static_prompt: bool = True, callback: Optional[Callable[[GenerationStepResult], bool]] = None) → Iterable[GenerationStepResult]

Yields tokens as they are generated by the model.

Parameters

prompt – Batch of start tokens. If the decoder starts from a special start token like <s>, this token should be added to this input.
max_batch_size – The maximum batch size.
batch_type – Whether max_batch_size is the number of “examples” or “tokens”.
max_length – Maximum generation length.
min_length – Minimum generation length.
sampling_topk – Randomly sample predictions from the top K candidates.
sampling_topp – Keep the most probable tokens whose cumulative probability exceeds this value.
sampling_temperature – Sampling temperature to generate more random samples.
return_log_prob – Include the token log probability in the result.
repetition_penalty – Penalty applied to the score of previously generated tokens (set > 1 to penalize).
no_repeat_ngram_size – Prevent repetitions of ngrams with this size (set 0 to disable).
disable_unk – Disable the generation of the unknown token.
suppress_sequences – Disable the generation of some sequences of tokens.
end_token – Stop the decoding on one these tokens (defaults to the model EOS token).
static_prompt – If the model expects a static prompt (a.k.a. system prompt) it can be set here to simplify the inputs and optionally cache the model state for this prompt to accelerate future generations.
cache_static_prompt – Cache the model state after the static prompt and reuse it for future generations using the same static prompt.
callback – Optional function that is called for each generated token when obj:beam_size is 1. If the callback function returns True, the decoding will stop for this batch index.

Returns

A generator iterator over ctranslate2.GenerationStepResult instances.

Note

This generation method is not compatible with beam search which requires a complete decoding.

load_model(keep_cache: bool = False) → None

Loads the model back to the initial device.

Parameters: keep_cache – If True, the model cache in the CPU memory is not deleted if it exists.

score_batch(tokens: List[List[str]], *, max_batch_size: int = 0, batch_type: str = 'examples', max_input_length: int = 1024, asynchronous: bool = False) → Union[List[ScoringResult], List[AsyncScoringResult]]

Scores a batch of tokens.

Parameters

tokens – Batch of tokens to score. If the model expects special start or end tokens, they should also be added to this input.
max_batch_size – The maximum batch size. If the number of inputs is greater than max_batch_size, the inputs are sorted by length and split by chunks of max_batch_size examples so that the number of padding positions is minimized.
batch_type – Whether max_batch_size is the number of “examples” or “tokens”.
max_input_length – Truncate inputs after this many tokens (0 to disable).
asynchronous – Run the scoring asynchronously.

Returns

A list of scoring results.

score_iterable(tokens: Iterable[List[str]], max_batch_size: int = 64, batch_type: str = 'examples', **kwargs) → Iterable[ScoringResult]

Scores an iterable of tokenized examples.

This method is built on top of ctranslate2.Generator.score_batch() to efficiently score an arbitrarily large stream of data. It enables the following optimizations:

stream processing (the iterable is not fully materialized in memory)
parallel scoring (if the generator has multiple workers)
asynchronous batch prefetching
local sorting by length

Parameters

tokens – An iterable of tokenized examples.
max_batch_size – The maximum batch size.
batch_type – Whether max_batch_size is the number of “examples” or “tokens”.
**kwargs – Any score options accepted by ctranslate2.Generator.score_batch().

Returns

A generator iterator over ctranslate2.ScoringResult instances.

unload_model(to_cpu: bool = False) → None

Unloads the model attached to this generator but keep enough runtime context to quickly resume generator on the initial device. The model is not guaranteed to be unloaded if generations are running concurrently.

Parameters: to_cpu – If True, the model is moved to the CPU memory and not fully unloaded.

property compute_type: Computation type used by the model.

property device: Device this generator is running on.

property device_index: List of device IDs where this generator is running on.

property model_is_loaded: Whether the model is loaded on the initial device and ready to be used.

property num_active_batches: Number of batches waiting to be processed or currently processed.

property num_generators: Number of generators backing this instance.

property num_queued_batches: Number of batches waiting to be processed.

property tensor_parallel: Run model with tensor parallel mode.