Embeddings

Word embeddings are learned using a lookup table. Each word is assigned to a random vector within this table that is simply updated with the gradients coming from the network.

Pretrained

When training with small amounts of data, performance can be improved by starting with pretrained embeddings. The arguments -pre_word_vecs_dec and -pre_word_vecs_enc can be used to specify these files.

The pretrained embeddings must be manually constructed Torch serialized tensors that correspond to the source and target dictionary files. For example:

local vocab_size = 50004
local embedding_size = 500

local embeddings = torch.Tensor(vocab_size, embedding_size):uniform()

torch.save('enc_embeddings.t7', embeddings)

where embeddings[i] is the embedding of the -th word in the vocabulary.

To automate this process, OpenNMT provides a script tools/embeddings.lua that can download pretrained embeddings from Polyglot or convert trained embeddings from word2vec, GloVe or FastText with regard to the word vocabularies generated by preprocess.lua. Supported format are:

  • word2vec-bin (default): binary format generated by word2vec.
  • word2vec-txt: textual word2vec format - starts with header line containing number of words and embedding size, and is then followed by one line per embedding: the first token is the word, and following fields are the embeddings values.
  • glove: text format - same format as word2vec-txt but without header line.

Note

The script requires the lua-zlib package.

For example, to generate pretrained English words embeddings:

th tools/embeddings.lua -lang en -dict_file data/demo.src.dict -save_data data/demo-src-emb

Note

Languages codes are Polygot's Wikipedia Language Codes.

Or to map pretrained word2vec vectors to the built vocabulary:

th tools/embeddings.lua -embed_type word2vec -embed_file data/GoogleNews-vectors-negative300.bin -dict_file data/demo.src.dict\
                        -save_data data/demo-src-emb

Tip

If vocabs as-is are not found in the embeddings file, you can use -approximate option to also look for uppercase variants and variants without possible joiner marks. You can dump the non found vocabs by setting -save_unknown_dict parameter.

Fixed

By default these embeddings will be updated during training, but they can be held fixed using -fix_word_vecs_enc and -fix_word_vecs_dec options. These options can be enabled or disabled during a retraining.

Tip

When using pretrained word embeddings, if you declare a larger -word_vec_size then the difference is uniformally initalized and you can use -fix_word_vecs_enc pretrained (or -fix_word_vecs_dec pretrained) to fix the pretrained part and optimize the remaining part.

Extraction

The tools/extract_embeddings.lua script can be used to extract the model word embeddings into text files. They can then be easily transformed into another format for visualization or processing.