Doc: Modules

Core Modules

class onmt.modules.Embeddings(word_vec_size, word_vocab_size, word_padding_idx, position_encoding=False, feat_merge='concat', feat_vec_exponent=0.7, feat_vec_size=-1, feat_padding_idx=[], feat_vocab_sizes=[], dropout=0)

Words embeddings for encoder/decoder.

Additionally includes ability to add sparse input features based on “Linguistic Input Features Improve Neural Machine Translation” [SH16].

graph LR A[Input] C[Feature 1 Lookup] A-->B[Word Lookup] A-->C A-->D[Feature N Lookup] B-->E[MLP/Concat] C-->E D-->E E-->F[Output]
Parameters:
  • word_vec_size (int) – size of the dictionary of embeddings.
  • word_padding_idx (int) – padding index for words in the embeddings.
  • feats_padding_idx (list of int) – padding index for a list of features in the embeddings.
  • word_vocab_size (int) – size of dictionary of embeddings for words.
  • feat_vocab_sizes ([int], optional) – list of size of dictionary of embeddings for each feature.
  • position_encoding (bool) – see onmt.modules.PositionalEncoding
  • feat_merge (string) – merge action for the features embeddings: concat, sum or mlp.
  • feat_vec_exponent (float) – when using -feat_merge concat, feature embedding size is N^feat_dim_exponent, where N is the number of values of feature takes.
  • feat_vec_size (int) – embedding dimension for features when using -feat_merge mlp
  • dropout (float) – dropout probability.
forward(input)

Computes the embeddings for words and features.

Parameters:input (LongTensor) – index tensor [len x batch x nfeat]
Returns:word embeddings [len x batch x embedding_size]
Return type:FloatTensor
load_pretrained_vectors(emb_file, fixed)

Load in pretrained embeddings.

Parameters:
  • emb_file (str) – path to torch serialized embeddings
  • fixed (bool) – if true, embeddings are not updated

Encoders

class onmt.modules.EncoderBase

Base encoder class. Specifies the interface used by different encoder types and required by onmt.Models.NMTModel.

graph BT A[Input] subgraph RNN C[Pos 1] D[Pos 2] E[Pos N] end F[Context] G[Final] A-->C A-->D A-->E C-->F D-->F E-->F E-->G
forward(input, lengths=None, hidden=None)
Parameters:
  • input (LongTensor) – padded sequences of sparse indices [src_len x batch x nfeat]
  • lengths (LongTensor) – length of each sequence [batch]
  • hidden (class specific) – initial hidden state.
Returns:k
(tuple of FloatTensor, FloatTensor):
  • final encoder state, used to initialize decoder
    [layers x batch x hidden]
  • contexts for attention, [src_len x batch x hidden]
class onmt.modules.MeanEncoder(num_layers, embeddings)

A trivial non-recurrent encoder. Simply applies mean pooling.

Parameters:
  • num_layers (int) – number of replicated layers
  • embeddings (onmt.modules.Embeddings) – embedding module to use
forward(input, lengths=None, hidden=None)

See EncoderBase.forward()

class onmt.modules.RNNEncoder(rnn_type, bidirectional, num_layers, hidden_size, dropout=0.0, embeddings=None)

A generic recurrent neural network encoder.

Parameters:
  • rnn_type (str) – style of recurrent unit to use, one of [RNN, LSTM, GRU, SRU]
  • bidirectional (bool) – use a bidirectional RNN
  • num_layers (int) – number of stacked layers
  • hidden_size (int) – hidden size of each layer
  • dropout (float) – dropout value for nn.Dropout
  • embeddings (onmt.modules.Embeddings) – embedding module to use
forward(input, lengths=None, hidden=None)

See EncoderBase.forward()

Decoders

class onmt.modules.RNNDecoderBase(rnn_type, bidirectional_encoder, num_layers, hidden_size, attn_type='general', coverage_attn=False, context_gate=None, copy_attn=False, dropout=0.0, embeddings=None)

Base recurrent attention-based decoder class. Specifies the interface used by different decoder types and required by onmt.Models.NMTModel.

graph BT A[Input] subgraph RNN C[Pos 1] D[Pos 2] E[Pos N] end G[Decoder State] H[Decoder State] I[Outputs] F[Context] A--emb-->C A--emb-->D A--emb-->E H-->C C-- attn --- F D-- attn --- F E-- attn --- F C-->I D-->I E-->I E-->G F---I
Parameters:
  • rnn_type (str) – style of recurrent unit to use, one of [RNN, LSTM, GRU, SRU]
  • bidirectional_encoder (bool) – use with a bidirectional encoder
  • num_layers (int) – number of stacked layers
  • hidden_size (int) – hidden size of each layer
  • attn_type (str) – see onmt.modules.GlobalAttention
  • coverage_attn (str) – see onmt.modules.GlobalAttention
  • context_gate (str) – see onmt.modules.ContextGate
  • copy_attn (bool) – setup a separate copy attention mechanism
  • dropout (float) – dropout value for nn.Dropout
  • embeddings (onmt.modules.Embeddings) – embedding module to use
forward(input, context, state, context_lengths=None)
Parameters:
  • input (LongTensor) – sequences of padded tokens [tgt_len x batch x nfeats].
  • context (FloatTensor) – vectors from the encoder [src_len x batch x hidden].
  • state (onmt.Models.DecoderState) – decoder state object to initialize the decoder
  • context_lengths (LongTensor) – the padded source lengths [batch].
Returns:

  • outputs: output from the decoder
    [tgt_len x batch x hidden].
  • state: final hidden state from the decoder
  • attns: distribution over src at each tgt
    [tgt_len x batch x src_len].

Return type:

(FloatTensor,:obj:onmt.Models.DecoderState,`FloatTensor`)

class onmt.modules.StdRNNDecoder(rnn_type, bidirectional_encoder, num_layers, hidden_size, attn_type='general', coverage_attn=False, context_gate=None, copy_attn=False, dropout=0.0, embeddings=None)

Standard fully batched RNN decoder with attention. Faster implementation, uses CuDNN for implementation. See RNNDecoderBase for options.

Based around the approach from “Neural Machine Translation By Jointly Learning To Align and Translate” [BCB14]

Implemented without input_feeding and currently with no coverage_attn or copy_attn support.

class onmt.modules.InputFeedRNNDecoder(rnn_type, bidirectional_encoder, num_layers, hidden_size, attn_type='general', coverage_attn=False, context_gate=None, copy_attn=False, dropout=0.0, embeddings=None)

Input feeding based decoder. See RNNDecoderBase for options.

Based around the input feeding approach from “Effective Approaches to Attention-based Neural Machine Translation” [LPM15]

graph BT A[Input n-1] AB[Input n] subgraph RNN E[Pos n-1] F[Pos n] E --> F end G[Encoder] H[Context n-1] A --> E AB --> F E --> H G --> H

Attention

class onmt.modules.GlobalAttention(dim, coverage=False, attn_type='dot')

Global attention takes a matrix and a query vector. It then computes a parameterized convex combination of the matrix based on the input query.

Constructs a unit mapping a query q of size dim and a source matrix H of size n x dim, to an output of size dim.

graph BT A[Query] subgraph RNN C[H 1] D[H 2] E[H N] end F[Attn] G[Output] A --> F C --> F D --> F E --> F C -.-> G D -.-> G E -.-> G F --> G

All models compute the output as \(c = \sum_{j=1}^{SeqLength} a_j H_j\) where \(a_j\) is the softmax of a score function. Then then apply a projection layer to [q, c].

However they differ on how they compute the attention score.

  • Luong Attention (dot, general):
    • dot: \(score(H_j,q) = H_j^T q\)
    • general: \(score(H_j, q) = H_j^T W_a q\)
  • Bahdanau Attention (mlp):
    • \(score(H_j, q) = v_a^T tanh(W_a q + U_a h_j)\)
Parameters:
  • dim (int) – dimensionality of query and key
  • coverage (bool) – use coverage term
  • attn_type (str) – type of attention to use, options [dot,general,mlp]
forward(input, context, context_lengths=None, coverage=None)
Parameters:
  • input (FloatTensor) – query vectors [batch x tgt_len x dim]
  • context (FloatTensor) – source vectors [batch x src_len x dim]
  • context_lengths (LongTensor) – the source context lengths [batch]
  • coverage (FloatTensor) – None (not supported yet)
Returns:

  • Computed vector [tgt_len x batch x dim]
  • Attention distribtutions for each query
    [tgt_len x batch x src_len]

Return type:

(FloatTensor, FloatTensor)

score(h_t, h_s)
Parameters:
  • h_t (FloatTensor) – sequence of queries [batch x tgt_len x dim]
  • h_s (FloatTensor) – sequence of sources [batch x src_len x dim]
Returns:

raw attention scores (unnormalized) for each src index [batch x tgt_len x src_len]

Return type:

FloatTensor

Architecture: Transfomer

class onmt.modules.PositionalEncoding(dropout, dim, max_len=5000)

Implements the sinusoidal positional encoding for non-recurrent neural networks.

Implementation based on “Attention Is All You Need” [DBLP:journals/corr/VaswaniSPUJGKP17]

Parameters:
  • dropout (float) – dropout parameter
  • dim (int) – embedding size
class onmt.modules.PositionwiseFeedForward(size, hidden_size, dropout=0.1)

A two-layer Feed-Forward-Network with residual layer norm.

Parameters:
  • size (int) – the size of input for the first-layer of the FFN.
  • hidden_size (int) – the hidden layer size of the second-layer of the FNN.
  • dropout (float) – dropout probability(0-1.0).
class onmt.modules.TransformerEncoder(num_layers, hidden_size, dropout, embeddings)

The Transformer encoder from “Attention is All You Need”.

graph BT A[input] B[multi-head self-attn] C[feed forward] O[output] A --> B B --> C C --> O
Parameters:
  • num_layers (int) – number of encoder layers
  • hidden_size (int) – number of hidden units
  • dropout (float) – dropout parameters
  • embeddings (onmt.modules.Embeddings) – embeddings to use, should have positional encodings
forward(input, lengths=None, hidden=None)

See EncoderBase.forward()

class onmt.modules.TransformerDecoder(num_layers, hidden_size, attn_type, copy_attn, dropout, embeddings)

The Transformer decoder from “Attention is All You Need”.

graph BT A[input] B[multi-head self-attn] BB[multi-head src-attn] C[feed forward] O[output] A --> B B --> BB BB --> C C --> O
Parameters:
  • num_layers (int) – number of encoder layers.
  • hidden_size (int) – number of hidden units
  • dropout (float) – dropout parameters
  • embeddings (onmt.modules.Embeddings) – embeddings to use, should have positional encodings
  • attn_type (str) – if using a seperate copy attention
forward(input, context, state, context_lengths=None)

See onmt.modules.RNNDecoderBase.forward()

class onmt.modules.MultiHeadedAttention(head_count, model_dim, dropout=0.1)

Multi-Head Attention module from “Attention is All You Need” [DBLP:journals/corr/VaswaniSPUJGKP17].

Similar to standard dot attention but uses multiple attention distributions simulataneously to select relevant items.

graph BT A[key] B[value] C[query] O[output] subgraph Attn D[Attn 1] E[Attn 2] F[Attn N] end A --> D C --> D A --> E C --> E A --> F C --> F D --> O E --> O F --> O B --> O

Also includes several additional tricks.

Parameters:
  • head_count (int) – number of parallel heads
  • model_dim (int) – the dimension of keys/values/queries, must be divisible by head_count
  • dropout (float) – dropout parameter
forward(key, value, query, mask=None)

Compute the context vector and the attention vectors.

Parameters:
  • key (FloatTensor) – set of key_len key vectors [batch, key_len, dim]
  • value (FloatTensor) – set of key_len value vectors [batch, key_len, dim]
  • query (FloatTensor) – set of query_len query vectors [batch, query_len, dim]
  • mask – binary mask indicating which keys have non-zero attention [batch, query_len, key_len]
Returns:

  • output context vectors [batch, query_len, dim]
  • one of the attention vectors [batch, query_len, key_len]

Return type:

(FloatTensor, FloatTensor)

Architecture: Conv2Conv

(These methods are from a user contribution and have not been thoroughly tested.)

class onmt.modules.CNNEncoder(num_layers, hidden_size, cnn_kernel_width, dropout, embeddings)

Encoder built on CNN based on [DBLP:journals/corr/GehringAGYD17].

forward(input, lengths=None, hidden=None)

See onmt.modules.EncoderBase.forward()

class onmt.modules.CNNDecoder(num_layers, hidden_size, attn_type, copy_attn, cnn_kernel_width, dropout, embeddings)

Decoder built on CNN, based on [DBLP:journals/corr/GehringAGYD17].

Consists of residual convolutional layers, with ConvMultiStepAttention.

forward(input, context, state, context_lengths=None)

See onmt.modules.RNNDecoderBase.forward()

class onmt.modules.ConvMultiStepAttention(input_size)

Conv attention takes a key matrix, a value matrix and a query vector. Attention weight is calculated by key matrix with the query vector and sum on the value matrix. And the same operation is applied in each decode conv layer.

forward(base_target_emb, input, encoder_out_top, encoder_out_combine)
Parameters:
  • base_target_emb – target emb tensor
  • input – output of decode conv
  • encoder_out_t – the key matrix for calculation of attetion weight, which is the top output of encode conv
  • encoder_out_combine – the value matrix for the attention-weighted sum, which is the combination of base emb and top output of encode
onmt.modules.WeightNorm

alias of onmt.modules.WeightNorm

Architecture: SRU

onmt.modules.SRU

alias of onmt.modules.SRU

Alternative Encoders

onmt.modules.AudioEncoder

class onmt.modules.AudioEncoder(num_layers, bidirectional, rnn_size, dropout, sample_rate, window_size)

A simple encoder convolutional -> recurrent neural network for audio input.

Parameters:
  • num_layers (int) – number of encoder layers.
  • bidirectional (bool) – bidirectional encoder.
  • rnn_size (int) – size of hidden states of the rnn.
  • dropout (float) – dropout probablity.
  • sample_rate (float) – input spec
  • window_size (int) – input spec
forward(input, lengths=None)

See onmt.modules.EncoderBase.forward()

onmt.modules.ImageEncoder

class onmt.modules.ImageEncoder(num_layers, bidirectional, rnn_size, dropout)

A simple encoder convolutional -> recurrent neural network for image input.

Parameters:
  • num_layers (int) – number of encoder layers.
  • bidirectional (bool) – bidirectional encoder.
  • rnn_size (int) – size of hidden states of the rnn.
  • dropout (float) – dropout probablity.
forward(input, lengths=None)

See onmt.modules.EncoderBase.forward()

Copy Attention

class onmt.modules.CopyGenerator(input_size, tgt_dict)

Generator module that additionally considers copying words directly from the source.

The main idea is that we have an extended “dynamic dictionary”. It contains |tgt_dict| words plus an arbitrary number of additional words introduced by the source sentence. For each source sentence we have a src_map that maps each source word to an index in tgt_dict if it known, or else to an extra word.

The copy generator is an extended version of the standard generator that computse three values.

  • \(p_{softmax}\) the standard softmax over tgt_dict
  • \(p(z)\) the probability of instead copying a word from the source, computed using a bernoulli
  • \(p_{copy}\) the probility of copying a word instead. taken from the attention distribution directly.

The model returns a distribution over the extend dictionary, computed as

\(p(w) = p(z=1) p_{copy}(w) + p(z=0) p_{softmax}(w)\)

graph BT A[input] S[src_map] B[softmax] BB[switch] C[attn] D[copy] O[output] A --> B A --> BB S --> D C --> D D --> O B --> O BB --> O
Parameters:
  • input_size (int) – size of input representation
  • tgt_dict (Vocab) – output target dictionary
forward(hidden, attn, src_map)

Compute a distribution over the target dictionary extended by the dynamic dictionary implied by compying source words.

Parameters:
  • hidden (FloatTensor) – hidden outputs [batch*tlen, input_size]
  • attn (FloatTensor) – attn for each [batch*tlen, input_size]
  • src_map (FloatTensor) – A sparse indicator matrix mapping each source word to its index in the “extended” vocab containing. [src_len, batch, extra_words]

Structured Attention

class onmt.modules.MatrixTree(eps=1e-05)

Implementation of the matrix-tree theorem for computing marginals of non-projective dependency parsing. This attention layer is used in the paper “Learning Structured Text Representations.”

[LL17]