Embeddings
Word embeddings are learned using a lookup table. Each word is assigned to a random vector within this table that is simply updated with the gradients coming from the network.
Pretrained¶
When training with small amounts of data, performance can be improved by starting with pretrained embeddings. The arguments -pre_word_vecs_dec
and -pre_word_vecs_enc
can be used to specify these files.
The pretrained embeddings must be manually constructed Torch serialized tensors that correspond to the source and target dictionary files. For example:
local vocab_size = 50004 local embedding_size = 500 local embeddings = torch.Tensor(vocab_size, embedding_size):uniform() torch.save('enc_embeddings.t7', embeddings)
where embeddings[i]
is the embedding of the -th word in the vocabulary.
To automate this process, OpenNMT provides a script tools/embeddings.lua
that can download pretrained embeddings from Polyglot or convert trained embeddings from word2vec, GloVe or FastText with regard to the word vocabularies generated by preprocess.lua
. Supported format are:
word2vec-bin
(default): binary format generated by word2vec.word2vec-txt
: textual word2vec format - starts with header line containing number of words and embedding size, and is then followed by one line per embedding: the first token is the word, and following fields are the embeddings values.glove
: text format - same format asword2vec-txt
but without header line.
Note
The script requires the lua-zlib
package.
For example, to generate pretrained English words embeddings:
th tools/embeddings.lua -lang en -dict_file data/demo.src.dict -save_data data/demo-src-emb
Note
Languages codes are Polygot's Wikipedia Language Codes.
Or to map pretrained word2vec vectors to the built vocabulary:
th tools/embeddings.lua -embed_type word2vec -embed_file data/GoogleNews-vectors-negative300.bin -dict_file data/demo.src.dict\
-save_data data/demo-src-emb
Tip
If vocabs as-is are not found in the embeddings file, you can use -approximate
option to also look for uppercase variants and variants without possible joiner marks. You can dump the non found vocabs by setting -save_unknown_dict
parameter.
Fixed¶
By default these embeddings will be updated during training, but they can be held fixed using -fix_word_vecs_enc
and -fix_word_vecs_dec
options. These options can be enabled or disabled during a retraining.
Tip
When using pretrained word embeddings, if you declare a larger -word_vec_size
then the difference is uniformally initalized and you can use -fix_word_vecs_enc pretrained
(or -fix_word_vecs_dec pretrained
) to fix the pretrained part and optimize the remaining part.
Extraction¶
The tools/extract_embeddings.lua
script can be used to extract the model word embeddings into text files. They can then be easily transformed into another format for visualization or processing.