Word embeddings are learned using a lookup table. Each word is assigned to a random vector within this table that is simply updated with the gradients coming from the network.
When training with small amounts of data, performance can be improved by starting with pretrained embeddings. The arguments
-pre_word_vecs_enc can be used to specify these files.
The pretrained embeddings must be manually constructed Torch serialized tensors that correspond to the source and target dictionary files. For example:
local vocab_size = 50004 local embedding_size = 500 local embeddings = torch.Tensor(vocab_size, embedding_size):uniform() torch.save('enc_embeddings.t7', embeddings)
embeddings[i] is the embedding of the -th word in the vocabulary.
To automate this process, OpenNMT provides a script
tools/embeddings.lua than can download pretrained embeddings from Polyglot or convert trained embeddings from word2vec or GloVe with regard to the word vocabularies generated by
The script requires the
For example, to generate pretrained English words embeddings:
th tools/embeddings.lua -lang en -dict_file data/demo.src.dict -save_data data/demo-src-emb
Languages codes are Polygot's Wikipedia Language Codes.
Or to map pretrained word2vec vectors to the built vocabulary:
th tools/embeddings.lua -embed_type word2vec -embed_file data/GoogleNews-vectors-negative300.bin -dict_file data/demo.src.dict -save_data data/demo-src-emb
By default these embeddings will be updated during training, but they can be held fixed using
-fix_word_vecs_dec options. These options can be enabled or disabled during a retraining.
tools/extract_embeddings.lua script can be used to extract the model word embeddings into text files. They can then be easily transformed into another format for visualization or processing.