The default translation mode allows the model to produce the
<unk> symbol when it is not sure of the specific target word.
<unk> symbols will correspond to proper names that can be directly transposed between languages. The
-replace_unk option will substitute
<unk> with source words that have the highest attention weight. The
-replace_unk_tagged option will do the same, but wrap the token in a ｟unk:xxxxx｠ tag.
Alternatively, advanced users may prefer to provide a pre-constructed phrase table from an external aligner (such as fast_align) using the
-phrase_table option to allow for non-identity replacement.
Instead of copying the source token with the highest attention, it will lookup in the phrase table for a possible translation. If a valid replacement is not found only then the source token will be copied.
The phrase table is a file with one translation per line in the format:
target are case sensitive and single tokens.
Several techniques exist to minimize the out-of-vocabulary issue:
- sub-tokenization like BPE or "wordpiece" to simulate open vocabularies
- mixed word/characters model as described in Wu et al. (2016)