opennmt.inputters.load_pretrained_embeddings

opennmt.inputters.load_pretrained_embeddings(embedding_file, vocabulary_file, num_oov_buckets=1, with_header=True, case_insensitive_embeddings=True)[source]

Returns pretrained embeddings relative to the vocabulary.

The embedding_file must have the following format:

N M
word1 val1 val2 ... valM
word2 val1 val2 ... valM
...
wordN val1 val2 ... valM

or if with_header is False:

word1 val1 val2 ... valM
word2 val1 val2 ... valM
...
wordN val1 val2 ... valM

This function will iterate on each embedding in embedding_file and assign the pretrained vector to the associated word in vocabulary_file if found. Otherwise, the embedding is ignored.

If case_insensitive_embeddings is True, word embeddings are assumed to be trained on lowercase data. In that case, word alignments are case insensitive meaning the pretrained word embedding for “the” will be assigned to “the”, “The”, “THE”, or any other case variants included in vocabulary_file.

Parameters
  • embedding_file – Path the embedding file. Entries will be matched against vocabulary_file.

  • vocabulary_file – The vocabulary file containing one word per line.

  • num_oov_buckets – The number of additional unknown tokens.

  • with_headerTrue if the embedding file starts with a header line like in GloVe embedding files.

  • case_insensitive_embeddingsTrue if embeddings are trained on lowercase data.

Returns

A Numpy array of shape [vocabulary_size + num_oov_buckets, embedding_size].