TransformerEncoderSpec
- class ctranslate2.specs.TransformerEncoderSpec
Inherits from:
ctranslate2.specs.LayerSpecMethods:
- __init__(num_layers: int, num_heads: int, pre_norm: bool = True, no_final_norm: bool = False, activation: Activation = Activation.RELU, num_source_embeddings: int = 1, embeddings_merge: EmbeddingsMerge = EmbeddingsMerge.CONCAT, layernorm_embedding: bool = False, relative_position: bool = False, relative_attention_bias: bool = False, ffn_glu: bool = False, rms_norm: bool = False, multi_query_attention: bool = False, num_heads_kv: Optional[int] = None, head_dim: Optional[int] = None, rotary_dim: Optional[int] = None, rotary_interleave: bool = True, rotary_scaling_type: Optional[RotaryScalingType] = None, rotary_scaling_factor: float = 1, rotary_base: float = 10000, sliding_window: Optional[int] = None, qk_norm: Optional[bool] = False, pre_post_layer_norm: bool = False)
Initializes a Transformer encoder specification.
- Parameters
num_layers – Number of layers.
num_heads – Number of attention heads.
pre_norm – Enable the pre-norm Transformer architecture.
no_final_norm – Disable the final layer norm in the pre-norm architecture.
activation – Activation to apply in the feed-forward network.
num_source_embeddings – Number of source embeddings.
embeddings_merge – When
num_source_embeddings> 1, specify how the embeddings are merged.layernorm_embedding – Apply layer normalization after the embedding layer.
relative_position – Use relative position representations in the self-attention layers as described in https://arxiv.org/abs/1803.02155.
relative_attention_bias – Use relative attention bias in the self-attention layers as described in the T5 paper https://arxiv.org/abs/1910.10683.
ffn_glu – Use gated linear units in the FFN layers as described in https://arxiv.org/abs/2002.05202.
rms_norm – Use the root mean square layer normalization.
multi_query_attention – Use multi-query attention (alias for num_heads_kv=1).
num_heads_kv – Number of attention heads for the key and value.
head_dim – Number of dimensions per attention head.
rotary_dim – Apply rotary embeddings to these first N dimensions. If 0, rotary embeddings are applied to all dimensions.
rotary_interleave – Interleave the head dimensions when rotary embeddings are applied. Otherwise the head dimensions are sliced in half.
rotary_scaling_type – Type of RoPE scaling.
rotary_scaling_factor – Factor used in the RoPE scaling.
rotary_base – The base period of the rotary embeddings.
sliding_window – Max sequence length to retain in KV Cache.
qk_norm – Apply layer normalization to the query and key projections.
pre_post_layer_norm – Add post layer norm for each pre norm layer.
- optimize(quantization: Optional[str] = None) None
Recursively applies some optimizations to this layer:
Alias variables with the same shape and value.
Quantize weights.
- Parameters
quantization – Weight quantization scheme (possible values are: int8, int8_float32, int8_float16, int8_bfloat16, int16, float16, bfloat16, float32).
- validate() None
Verify that the required weights are set.
- Raises
ValueError – If a required weight is not set in the specification.
- variables(prefix: str = '', ordered: bool = False) Dict[str, ndarray]
Recursively returns the weights from this layer and its children.
- Parameters
prefix – Prefix to prepend to all variable names.
ordered – If set, an ordered list is returned instead.
- Returns
Dictionary mapping variables name to value.