Unknown words

The default translation mode allows the model to produce the <unk> symbol when it is not sure of the specific target word.

Often times <unk> symbols will correspond to proper names that can be directly transposed between languages. The -replace_unk option will substitute <unk> with source words that have the highest attention weight. The -replace_unk_tagged option will do the same, but wrap the token in a ⦅unk:xxxxx⦆ tag.

Phrase table

Alternatively, advanced users may prefer to provide a pre-constructed phrase table from an external aligner (such as fast_align) using the -phrase_table option to allow for non-identity replacement.

Instead of copying the source token with the highest attention, it will lookup in the phrase table for a possible translation. If a valid replacement is not found only then the source token will be copied.

The phrase table is a file with one translation per line in the format:

source|||target

Where source and target are case sensitive and single tokens.

Workarounds

Several techniques exist to minimize the out-of-vocabulary issue:

  • sub-tokenization like BPE or "wordpiece" to simulate open vocabularies
  • mixed word/characters model as described in Wu et al. (2016)