Transformer

class opennmt.models.Transformer(*args, **kwargs)[source]

Attention-based sequence-to-sequence model as described in https://arxiv.org/abs/1706.03762.

Inherits from: opennmt.models.SequenceToSequence

Extended by:

__init__(source_inputter=None, target_inputter=None, num_layers=6, num_units=512, num_heads=8, ffn_inner_dim=2048, dropout=0.1, attention_dropout=0.1, ffn_dropout=0.1, ffn_activation=<function relu>, mha_bias=True, position_encoder_class=<class 'opennmt.layers.position.SinusoidalPositionEncoder'>, share_embeddings=0, share_encoders=False, maximum_relative_position=None, attention_reduction=MultiHeadAttentionReduction.FIRST_HEAD_LAST_LAYER, pre_norm=True, output_layer_bias=True)[source]

Initializes a Transformer model.

Parameters
  • source_inputter – A opennmt.inputters.Inputter to process the source data. If this inputter returns parallel inputs, a multi source Transformer architecture will be constructed. Defaults to a opennmt.inputters.WordEmbedder with num_units as embedding size.

  • target_inputter – A opennmt.inputters.Inputter to process the target data. Currently, only the opennmt.inputters.WordEmbedder is supported. Defaults to a opennmt.inputters.WordEmbedder with num_units as embedding size.

  • num_layers – The number of layers or a 2-tuple with the number of encoder layers and decoder layers.

  • num_units – The number of hidden units.

  • num_heads – The number of heads in each self-attention layers.

  • ffn_inner_dim – The inner dimension of the feed forward layers.

  • dropout – The probability to drop units in each layer output.

  • attention_dropout – The probability to drop units from the attention.

  • ffn_dropout – The probability to drop units from the ReLU activation in the feed forward layer.

  • ffn_activation – The activation function to apply between the two linear transformations of the feed forward layer.

  • mha_bias – Add bias after linear layers in the multi-head attention.

  • position_encoder_class – The opennmt.layers.PositionEncoder class to use for position encoding (or a callable that returns an instance).

  • share_embeddings – Level of embeddings sharing, see opennmt.models.EmbeddingsSharingLevel for possible values.

  • share_encoders – In case of multi source architecture, whether to share the separate encoders parameters or not.

  • maximum_relative_position – Maximum relative position representation (from https://arxiv.org/abs/1803.02155).

  • attention_reduction – A opennmt.layers.MultiHeadAttentionReduction value to specify how to reduce target-source multi-head attention matrices.

  • pre_norm – If True, layer normalization is applied before each sub-layer. Otherwise it is applied after. The original paper uses pre_norm=False, but the authors later suggested that pre_norm=True “seems better for harder-to-learn models, so it should probably be the default.”

  • output_layer_bias – Add bias after the output layer.

auto_config(num_replicas=1)[source]

Returns automatic configuration values specific to this model.

Parameters

num_replicas – The number of synchronous model replicas used for the training.

Returns

A partial training configuration.

map_v1_weights(weights)[source]

Maps current weights to V1 weights.

Parameters

weights – A nested dictionary following the scope names used in V1. The leaves are tuples with the variable value and optionally the optimizer slots.

Returns

A list of tuples associating variables and their V1 equivalent.