Whisper
- class ctranslate2.models.Whisper
- Implements the Whisper speech recognition model published by OpenAI. - See also - Inherits from: - pybind11_builtins.pybind11_object- Attributes: - Methods: - __init__(model_path: str, device: str = 'cpu', *, device_index: Union[int, List[int]] = 0, compute_type: Union[str, Dict[str, str]] = 'default', inter_threads: int = 1, intra_threads: int = 0, max_queued_batches: int = 0, flash_attention: bool = False, tensor_parallel: bool = False, files: object = None) None
- Initializes a Whisper model from a converted model. - Parameters
- model_path – Path to the CTranslate2 model directory. 
- device – Device to use (possible values are: cpu, cuda, auto). 
- device_index – Device IDs where to place this model on. 
- compute_type – Model computation type or a dictionary mapping a device name to the computation type (possible values are: default, auto, int8, int8_float32, int8_float16, int8_bfloat16, int16, float16, bfloat16, float32). 
- inter_threads – Number of workers to allow executing multiple batches in parallel. 
- intra_threads – Number of OpenMP threads per worker (0 to use a default value). 
- max_queued_batches – Maximum numbers of batches in the worker queue (-1 for unlimited, 0 for an automatic value). When the queue is full, future requests will block until a free slot is available. 
- flash_attention – run model with flash attention 2 for self-attention layer 
- tensor_parallel – run model with tensor parallel mode 
- files – Load model files from the memory. This argument is a dictionary mapping file names to file contents as file-like or bytes objects. If this is set, - model_pathacts as an identifier for this model.
 
 
 - align(features: StorageView, start_sequence: List[int], text_tokens: List[List[int]], num_frames: Union[int, List[int]], *, median_filter_width: int = 7) List[WhisperAlignmentResult]
- Computes the alignments between the text tokens and the audio. - Parameters
- features – Mel spectogram of the audio, as a float array with shape - [batch_size, n_mels, chunk_length]. This method also accepts the encoded features returned by the method- ctranslate2.models.Whisper.encode(), which have shape- [batch_size, chunk_length // 2, d_model].
- start_sequence – The start sequence tokens. 
- text_tokens – Batch of text tokens to align. 
- num_frames – Number of non padding frames in the features. 
- median_filter_width – Width of the median filter kernel. 
 
- Returns
- A list of alignment results. 
 
 - detect_language(features: StorageView) List[List[Tuple[str, float]]]
- Returns the probability of each language. - Parameters
- features – Mel spectogram of the audio, as a float array with shape - [batch_size, n_mels, chunk_length]. This method also accepts the encoded features returned by the method- ctranslate2.models.Whisper.encode(), which have shape- [batch_size, chunk_length // 2, d_model].
- Returns
- For each batch, a list of pairs (language, probability) ordered from best to worst probability. 
- Raises
- RuntimeError – if the model is not multilingual. 
 
 - encode(features: StorageView, to_cpu: bool = False) StorageView
- Encodes the input features. - Parameters
- features – Mel spectogram of the audio, as a float array with shape - [batch_size, n_mels, chunk_length].
- to_cpu – Copy the encoder output to the CPU before returning the value. 
 
- Returns
- The encoder output. 
 
 - generate(features: StorageView, prompts: Union[List[List[str]], List[List[int]]], *, asynchronous: bool = False, beam_size: int = 5, patience: float = 1, num_hypotheses: int = 1, length_penalty: float = 1, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, max_length: int = 448, return_scores: bool = False, return_logits_vocab: bool = False, return_no_speech_prob: bool = False, max_initial_timestamp_index: int = 50, suppress_blank: bool = True, suppress_tokens: Optional[List[int]] = [-1], sampling_topk: int = 1, sampling_temperature: float = 1) Union[List[WhisperGenerationResult], List[WhisperGenerationResultAsync]]
- Encodes the input features and generates from the given prompt. - Parameters
- features – Mel spectogram of the audio, as a float array with shape - [batch_size, n_mels, chunk_length]. This method also accepts the encoded features returned by the method- ctranslate2.models.Whisper.encode(), which have shape- [batch_size, chunk_length // 2, d_model].
- prompts – Batch of initial string tokens or token IDs. 
- asynchronous – Run the model asynchronously. 
- beam_size – Beam size (1 for greedy search). 
- patience – Beam search patience factor, as described in https://arxiv.org/abs/2204.05424. The decoding will continue until beam_size*patience hypotheses are finished. 
- num_hypotheses – Number of hypotheses to return. 
- length_penalty – Exponential penalty applied to the length during beam search. 
- repetition_penalty – Penalty applied to the score of previously generated tokens (set > 1 to penalize). 
- no_repeat_ngram_size – Prevent repetitions of ngrams with this size (set 0 to disable). 
- max_length – Maximum generation length. 
- return_scores – Include the scores in the output. 
- return_logits_vocab – Include the log probs in the output 
- return_no_speech_prob – Include the probability of the no speech token in the result. 
- max_initial_timestamp_index – Maximum index of the first predicted timestamp. 
- suppress_blank – Suppress blank outputs at the beginning of the sampling. 
- suppress_tokens – List of token IDs to suppress. -1 will suppress a default set of symbols as defined in the model - config.jsonfile.
- sampling_topk – Randomly sample predictions from the top K candidates. 
- sampling_temperature – Sampling temperature to generate more random samples. 
 
- Returns
- A list of generation results. 
 
 - load_model(keep_cache: bool = False) None
- Loads the model back to the initial device. - Parameters
- keep_cache – If - True, the model cache in the CPU memory is not deleted if it exists.
 
 - unload_model(to_cpu: bool = False) None
- Unloads the model attached to this whisper but keep enough runtime context to quickly resume whisper on the initial device. - Parameters
- to_cpu – If - True, the model is moved to the CPU memory and not fully unloaded.
 
 - property compute_type
- Computation type used by the model. 
 - property device
- Device this model is running on. 
 - property device_index
- List of device IDs where this model is running on. 
 - property is_multilingual
- Returns - Trueif this model is multilingual.
 - property model_is_loaded
- Whether the model is loaded on the initial device and ready to be used. 
 - property n_mels
- Returns dimension of mel input features. 
 - property num_active_batches
- Number of batches waiting to be processed or currently processed. 
 - property num_languages
- Returns the number of languages supported. 
 - property num_queued_batches
- Number of batches waiting to be processed. 
 - property num_workers
- Number of model workers backing this instance. 
 - property tensor_parallel
- Run model with tensor parallel mode.