class opennmt.models.Transformer(*args, **kwargs)[source]

Attention-based sequence-to-sequence model as described in

Inherits from: opennmt.models.SequenceToSequence

Extended by:

__init__(source_inputter=None, target_inputter=None, num_layers=6, num_units=512, num_heads=8, ffn_inner_dim=2048, dropout=0.1, attention_dropout=0.1, ffn_dropout=0.1, ffn_activation=<function relu>, position_encoder_class=<class 'opennmt.layers.position.SinusoidalPositionEncoder'>, share_embeddings=0, share_encoders=False, maximum_relative_position=None, attention_reduction=<MultiHeadAttentionReduction.FIRST_HEAD_LAST_LAYER: 1>, pre_norm=True)[source]

Initializes a Transformer model.

  • source_inputter – A opennmt.inputters.Inputter to process the source data. If this inputter returns parallel inputs, a multi source Transformer architecture will be constructed. Defaults to a opennmt.inputters.WordEmbedder with num_units as embedding size.

  • target_inputter – A opennmt.inputters.Inputter to process the target data. Currently, only the opennmt.inputters.WordEmbedder is supported. Defaults to a opennmt.inputters.WordEmbedder with num_units as embedding size.

  • num_layers – The number of layers or a 2-tuple with the number of encoder layers and decoder layers.

  • num_units – The number of hidden units.

  • num_heads – The number of heads in each self-attention layers.

  • ffn_inner_dim – The inner dimension of the feed forward layers.

  • dropout – The probability to drop units in each layer output.

  • attention_dropout – The probability to drop units from the attention.

  • ffn_dropout – The probability to drop units from the ReLU activation in the feed forward layer.

  • ffn_activation – The activation function to apply between the two linear transformations of the feed forward layer.

  • position_encoder_class – The opennmt.layers.PositionEncoder class to use for position encoding (or a callable that returns an instance).

  • share_embeddings – Level of embeddings sharing, see opennmt.models.EmbeddingsSharingLevel for possible values.

  • share_encoders – In case of multi source architecture, whether to share the separate encoders parameters or not.

  • maximum_relative_position – Maximum relative position representation (from

  • attention_reduction – A opennmt.layers.MultiHeadAttentionReduction value to specify how to reduce target-source multi-head attention matrices.

  • pre_norm – If True, layer normalization is applied before each sub-layer. Otherwise it is applied after. The original paper uses pre_norm=False, but the authors later suggested that pre_norm=True “seems better for harder-to-learn models, so it should probably be the default.”

property ctranslate2_spec

The equivalent CTranslate2 model specification.


Returns automatic configuration values specific to this model.


num_replicas – The number of synchronous model replicas used for the training.


A partial training configuration.


Maps current weights to V1 weights.


weights – A nested dictionary following the scope names used in V1. The leaves are tuples with the variable value and optionally the optimizer slots.


A list of tuples associating variables and their V1 equivalent.