TransformerDecoderModelSpec

class ctranslate2.specs.TransformerDecoderModelSpec

Describes a Transformer decoder model (e.g. GPT-2).

Inherits from: ctranslate2.specs.LanguageModelSpec

Attributes:

Methods:

__init__(decoder: TransformerDecoderSpec)

Initializes a Transformer decoder model specification.

Parameters

decoder – The decoder specification.

classmethod from_config(num_layers: int, num_heads: int, pre_norm: bool = True, activation: Activation = Activation.RELU, layernorm_embedding: bool = False, no_final_norm: bool = False, project_in_out: bool = False, with_relative_position: bool = False, ffn_glu: bool = False, rms_norm: bool = False, alibi: bool = False, alibi_use_positive_positions: bool = False, scale_alibi: bool = False, rotary_dim: Optional[int] = None, rotary_interleave: bool = True, rotary_scaling_type: Optional[RotaryScalingType] = None, rotary_scaling_factor: float = 1, rotary_base: float = 10000, original_max_position_embeddings: int = 0, max_position_embeddings: int = 0, parallel_residual: bool = False, shared_layer_norm: bool = False, pre_post_layer_norm: bool = False, multi_query_attention: bool = False, num_heads_kv: Optional[int] = None, head_dim: Optional[int] = None, sliding_window: Optional[int] = None, quant_type: Optional[Quantization] = None, quant_group_size: Optional[int] = None, quant_bits: Optional[int] = None)

Creates a Transformer decoder model specification.

Parameters
  • num_layers – Number of decoder layers.

  • num_heads – Number of attention heads.

  • pre_norm – Enable the pre-norm Transformer architecture.

  • activation – Activation to apply in the feed-forward network.

  • layernorm_embedding – Apply layer normalization after the embedding layer.

  • no_final_norm – Do not apply layer normalization after the last decoder block.

  • project_in_out – Add a linear layer after the embedding layer and another one before the final output projection.

  • with_relative_position – Enable relative position representations modules.

  • ffn_glu – Use gated linear units in the FFN layers as described in https://arxiv.org/abs/2002.05202.

  • rms_norm – Use the root mean square layer normalization.

  • alibi – Use attention with linear biases.

  • alibi_use_positive_positions – Use positive positions in the ALiBi definition.

  • scale_alibi – Apply the dot product scale factor to ALiBi.

  • rotary_dim – Apply rotary embeddings to these first N dimensions. If 0, rotary embeddings are applied to all dimensions.

  • rotary_interleave – Interleave the head dimensions when rotary embeddings are applied. Otherwise the head dimensions are sliced in half.

  • rotary_scaling_type – Type of RoPE scaling.

  • rotary_scaling_factor – Factor used in the RoPE scaling.

  • rotary_base – The base period of the rotary embeddings.

  • original_max_position_embeddings – The original max position embeddings for Su rope embeddings

  • max_position_embeddings – The max position embeddings for Su rope embeddings

  • parallel_residual – Use parallel residual connections in each layer block, as used by the GPT-J and GPT-NeoX models.

  • shared_layer_norm – When using parallel residual, share the input and post attention layer norms.

  • pre_post_layer_norm – add post layer norm for each pre norm layer

  • multi_query_attention – Use multi-query attention (alias for num_heads_kv=1).

  • num_heads_kv – Number of attention heads for the key and value.

  • head_dim – Number of head

  • sliding_window – max sequence length to retain KV cache

  • quant_type – quantization type used (like awq… for lower bit quantization)

  • quant_group_size – group size of the lower bit quantization

  • quant_bits – number of bit of the quantization (ex: 4bit)

get_default_config()

Returns the default configuration used by this model.

get_vocabulary_size()

Returns the vocabulary size expected by the model.

optimize(quantization: Optional[str] = None) None

Recursively applies some optimizations to this layer:

  • Alias variables with the same shape and value.

  • Quantize weights.

Parameters

quantization – Weight quantization scheme (possible values are: int8, int8_float32, int8_float16, int8_bfloat16, int16, float16, bfloat16, float32).

register_file(path: str, filename: Optional[str] = None) None

Registers a file to be saved in the model directory.

register_vocabulary(tokens: List[str]) None

Registers the vocabulary of tokens.

Parameters

tokens – List of tokens.

save(output_dir: str) None

Saves this model on disk.

Parameters

output_dir – Output directory where the model is saved.

validate() None

Verify that the required weights are set.

Raises

ValueError – If a required weight is not set in the specification.

variables(prefix: str = '', ordered: bool = False) Dict[str, ndarray]

Recursively returns the weights from this layer and its children.

Parameters
  • prefix – Prefix to prepend to all variable names.

  • ordered – If set, an ordered list is returned instead.

Returns

Dictionary mapping variables name to value.

property config

The model configuration.

property name

The name of the model specification.

property revision

The model specification revision.

This value is incremented each time the weights layout of the model is changed (e.g. a weight is renamed).