Transformer
- class opennmt.models.Transformer(*args, **kwargs)[source]
Attention-based sequence-to-sequence model as described in https://arxiv.org/abs/1706.03762.
Inherits from:
opennmt.models.SequenceToSequence
Extended by:
- __init__(source_inputter=None, target_inputter=None, num_layers=6, num_units=512, num_heads=8, ffn_inner_dim=2048, dropout=0.1, attention_dropout=0.1, ffn_dropout=0.1, ffn_activation=<function relu>, mha_bias=True, position_encoder_class=<class 'opennmt.layers.position.SinusoidalPositionEncoder'>, share_embeddings=0, share_encoders=False, maximum_relative_position=None, attention_reduction=MultiHeadAttentionReduction.FIRST_HEAD_LAST_LAYER, pre_norm=True, output_layer_bias=True)[source]
Initializes a Transformer model.
- Parameters
source_inputter – A
opennmt.inputters.Inputter
to process the source data. If this inputter returns parallel inputs, a multi source Transformer architecture will be constructed. Defaults to aopennmt.inputters.WordEmbedder
withnum_units
as embedding size.target_inputter – A
opennmt.inputters.Inputter
to process the target data. Currently, only theopennmt.inputters.WordEmbedder
is supported. Defaults to aopennmt.inputters.WordEmbedder
withnum_units
as embedding size.num_layers – The number of layers or a 2-tuple with the number of encoder layers and decoder layers.
num_units – The number of hidden units.
num_heads – The number of heads in each self-attention layers.
ffn_inner_dim – The inner dimension of the feed forward layers.
dropout – The probability to drop units in each layer output.
attention_dropout – The probability to drop units from the attention.
ffn_dropout – The probability to drop units from the ReLU activation in the feed forward layer.
ffn_activation – The activation function to apply between the two linear transformations of the feed forward layer.
mha_bias – Add bias after linear layers in the multi-head attention.
position_encoder_class – The
opennmt.layers.PositionEncoder
class to use for position encoding (or a callable that returns an instance).share_embeddings – Level of embeddings sharing, see
opennmt.models.EmbeddingsSharingLevel
for possible values.share_encoders – In case of multi source architecture, whether to share the separate encoders parameters or not.
maximum_relative_position – Maximum relative position representation (from https://arxiv.org/abs/1803.02155).
attention_reduction – A
opennmt.layers.MultiHeadAttentionReduction
value to specify how to reduce target-source multi-head attention matrices.pre_norm – If
True
, layer normalization is applied before each sub-layer. Otherwise it is applied after. The original paper usespre_norm=False
, but the authors later suggested thatpre_norm=True
“seems better for harder-to-learn models, so it should probably be the default.”output_layer_bias – Add bias after the output layer.
- auto_config(num_replicas=1)[source]
Returns automatic configuration values specific to this model.
- Parameters
num_replicas – The number of synchronous model replicas used for the training.
- Returns
A partial training configuration.
- map_v1_weights(weights)[source]
Maps current weights to V1 weights.
- Parameters
weights – A nested dictionary following the scope names used in V1. The leaves are tuples with the variable value and optionally the optimizer slots.
- Returns
A list of tuples associating variables and their V1 equivalent.