TransformerDecoderSpec

class ctranslate2.specs.TransformerDecoderSpec

Inherits from: ctranslate2.specs.LayerSpec

Attributes:

Methods:

__init__(num_layers: int, num_heads: int, pre_norm: bool = True, activation: Activation = Activation.RELU, layernorm_embedding: bool = False, with_encoder_attention: bool = True, no_final_norm: bool = False, project_in_out: bool = False, relative_position: bool = False, relative_attention_bias: bool = False, alignment_layer: int = - 1, alignment_heads: int = 1, ffn_glu: bool = False, rms_norm: bool = False, alibi: bool = False, alibi_use_positive_positions: bool = False, scale_alibi: bool = False, rotary_dim: Optional[int] = None, rotary_interleave: bool = True, rotary_scaling_type: Optional[RotaryScalingType] = None, rotary_scaling_factor: float = 1, rotary_base: float = 10000, original_max_position_embeddings: int = 0, max_position_embeddings: int = 0, parallel_residual: bool = False, shared_layer_norm: bool = False, pre_post_layer_norm: bool = False, multi_query_attention: bool = False, num_heads_kv: Optional[int] = None, head_dim: Optional[int] = None, sliding_window: Optional[int] = None, quant_type: Optional[Quantization] = None, quant_group_size: Optional[int] = None, quant_bits: Optional[int] = None)

Initializes a Transformer decoder specification.

Parameters
  • num_layers – Number of layers.

  • num_heads – Number of attention heads.

  • pre_norm – Enable the pre-norm Transformer architecture.

  • activation – Activation to apply in the feed-forward network.

  • layernorm_embedding – Apply layer normalization after the embedding layer.

  • with_encoder_attention – Enable the encoder attention sublayers.

  • no_final_norm – Disable the final layer norm in the pre-norm architecture.

  • project_in_out – Add linear transformations after the embedding layer and before the final layer.

  • relative_position – Use relative position representations in the self-attention layers as described in https://arxiv.org/abs/1803.02155.

  • relative_attention_bias – Use relative attention bias in the self-attention layers as described in the T5 paper https://arxiv.org/abs/1910.10683.

  • alignment_layer – Layer index selected for alignment.

  • alignment_heads – Number of attention heads selected for alignment.

  • ffn_glu – Use gated linear units in the FFN layers as described in https://arxiv.org/abs/2002.05202.

  • rms_norm – Use the root mean square layer normalization.

  • alibi – Use attention with linear biases.

  • alibi_use_positive_positions – Use positive positions in the ALiBi definition.

  • scale_alibi – Apply the dot product scale factor to ALiBi.

  • rotary_dim – Apply rotary embeddings to these first N dimensions. If 0, rotary embeddings are applied to all dimensions.

  • rotary_interleave – Interleave the head dimensions when rotary embeddings are applied. Otherwise the head dimensions are sliced in half.

  • rotary_scaling_type – Type of RoPE scaling.

  • rotary_scaling_factor – Factor used in the RoPE scaling.

  • rotary_base – The base period of the rotary embeddings.

  • original_max_position_embeddings – The original max position embeddings for Su rope embeddings

  • max_position_embeddings – The max position embeddings for Su rope embeddings

  • parallel_residual – Use parallel residual connections in each layer block, as used by the GPT-J and GPT-NeoX models.

  • shared_layer_norm – When using parallel residual, share the input and post attention layer norms.

  • pre_post_layer_norm – Add post layer norm for each pre norm layer

  • multi_query_attention – Use multi-query attention (alias for num_heads_kv=1).

  • num_heads_kv – Number of attention heads for the key and value.

  • sliding_window – Max sequence length to retain in KV Cache.

  • quant_type – quantization type used (like awq… for lower bit quantization)

  • quant_group_size – group size of the lower bit quantization

  • quant_bits – number of bit of the quantization (ex: 4bit)

optimize(quantization: Optional[str] = None) None

Recursively applies some optimizations to this layer:

  • Alias variables with the same shape and value.

  • Quantize weights.

Parameters

quantization – Weight quantization scheme (possible values are: int8, int8_float32, int8_float16, int8_bfloat16, int16, float16, bfloat16, float32).

validate() None

Verify that the required weights are set.

Raises

ValueError – If a required weight is not set in the specification.

variables(prefix: str = '', ordered: bool = False) Dict[str, ndarray]

Recursively returns the weights from this layer and its children.

Parameters
  • prefix – Prefix to prepend to all variable names.

  • ordered – If set, an ordered list is returned instead.

Returns

Dictionary mapping variables name to value.

property config