Word embeddings are learned using a lookup table. Each word is assigned to a random vector within this table that is simply updated with the gradients coming from the network.
When training with small amounts of data, performance can be improved by starting with pretrained embeddings. The arguments
-pre_word_vecs_enc can be used to specify these files.
The pretrained embeddings must be manually constructed Torch serialized tensors that correspond to the source and target dictionary files. For example:
local vocab_size = 50004 local embedding_size = 500 local embeddings = torch.Tensor(vocab_size, embedding_size):uniform() torch.save('enc_embeddings.t7', embeddings)
embeddings[i] is the embedding of the -th word in the vocabulary.
To automate this process, OpenNMT provides a script
tools/embeddings.lua that can download pretrained embeddings from Polyglot or convert trained embeddings from word2vec, GloVe or FastText with regard to the word vocabularies generated by
preprocess.lua. Supported format are:
word2vec-bin(default): binary format generated by word2vec.
word2vec-txt: textual word2vec format - starts with header line containing number of words and embedding size, and is then followed by one line per embedding: the first token is the word, and following fields are the embeddings values.
glove: text format - same format as
word2vec-txtbut without header line.
The script requires the
For example, to generate pretrained English words embeddings:
th tools/embeddings.lua -lang en -dict_file data/demo.src.dict -save_data data/demo-src-emb
Languages codes are Polygot's Wikipedia Language Codes.
Or to map pretrained word2vec vectors to the built vocabulary:
th tools/embeddings.lua -embed_type word2vec -embed_file data/GoogleNews-vectors-negative300.bin -dict_file data/demo.src.dict\ -save_data data/demo-src-emb
If vocabs as-is are not found in the embeddings file, you can use
-approximate option to also look for uppercase variants and variants without possible joiner marks. You can dump the non found vocabs by setting
By default these embeddings will be updated during training, but they can be held fixed using
-fix_word_vecs_dec options. These options can be enabled or disabled during a retraining.
When using pretrained word embeddings, if you declare a larger
-word_vec_size then the difference is uniformally initalized and you can use
-fix_word_vecs_enc pretrained (or
-fix_word_vecs_dec pretrained) to fix the pretrained part and optimize the remaining part.
tools/extract_embeddings.lua script can be used to extract the model word embeddings into text files. They can then be easily transformed into another format for visualization or processing.