Fairseq

CTranslate2 supports some Transformer models trained with Fairseq. The following model names are currently supported:

  • bart

  • multilingual_transformer

  • transformer

  • transformer_align

  • transformer_lm

The conversion minimally requires the PyTorch model path and the Fairseq data directory which contains the vocabulary files:

pip install fairseq
ct2-fairseq-converter --model_path model.pt --data_dir data-bin/ --output_dir ct2_model

Beam search equivalence

The default beam search parameters in CTranslate2 are different than Fairseq. Set the following parameters to match the Fairseq behavior:

translator.translate_batch(tokens, beam_size=5)

WMT16 English-German

Download and convert the pretrained WMT16 English-German model:

wget https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2
tar xf wmt16.en-de.joined-dict.transformer.tar.bz2

ct2-fairseq-converter --model_path wmt16.en-de.joined-dict.transformer/model.pt \
    --data_dir wmt16.en-de.joined-dict.transformer \
    --output_dir ende_ctranslate2

The converted model can then be used on tokenized inputs:

import ctranslate2

translator = ctranslate2.Translator("ende_ctranslate2/", device="cpu")
results = translator.translate_batch([["H@@", "ello", "world@@", "!"]])

print(results[0].hypotheses[0])

Note

For simplicity, this example does not show how to tokenize the text. The tokens are obtained by running sacremoses and applying the BPE codes included in the model.

WMT19 language model

The FAIR team published pretrained language models as part of the WMT19 news translation task. They can be converted to the CTranslate2 format:

wget https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.en.tar.gz
tar xf wmt19.en.tar.gz

ct2-fairseq-converter --data_dir wmt19.en/ --model_path wmt19.en/model.pt --output_dir wmt19_en_ct2

The model can then be used to sample or score sequences of tokens. All inputs should start with the special token </s>:

import numpy as np
import ctranslate2

generator = ctranslate2.Generator("wmt19_en_ct2/", device="cpu")

# Sample from the language model.
results = generator.generate_batch([["</s>", "The"]], sampling_topk=10, max_length=50)
print(results[0].sequences[0])

# Compute the perplexity for a sentence.
outputs = generator.score_batch([["</s>", "The", "sky", "is", "blue", "."]])
perplexity = np.exp(-np.mean(outputs[0].log_probs))
print(perplexity)

Note

For simplicity, this example does not show how to tokenize the text. The tokens are obtained by running sacremoses and applying the BPE codes included in the model.

M2M-100

The pretrained multilingual model M2M-100 can also be used in CTranslate2. The conversion option --fixed_dictionary is required for this model that uses a single vocabulary file:

# 418M parameters:
wget https://dl.fbaipublicfiles.com/m2m_100/418M_last_checkpoint.pt

# 1.2B parameters:
wget https://dl.fbaipublicfiles.com/m2m_100/1.2B_last_checkpoint.pt

wget https://dl.fbaipublicfiles.com/m2m_100/model_dict.128k.txt
wget https://dl.fbaipublicfiles.com/m2m_100/spm.128k.model

ct2-fairseq-converter --data_dir . --model_path 418M_last_checkpoint.pt \
    --fixed_dictionary model_dict.128k.txt \
    --output_dir m2m_100_418m_ct2

For translation, the language tokens should prefix the source and target sequences. Language tokens have the format __X__ where X is the language code. See the end of the fixed dictionary file for the list of accepted languages.

import ctranslate2
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("spm.128k.model")

source = ["__en__"] + sp.encode("Hello world!", out_type=str)
target_prefix = ["__de__"]

translator = ctranslate2.Translator("m2m_100_418m_ct2")
result = translator.translate_batch([source], target_prefix=[target_prefix])

output = sp.decode(result[0].hypotheses[0][1:])
print(output)

MBART-50

MBART-50 is another pretrained multilingual translation model.

wget https://dl.fbaipublicfiles.com/fairseq/models/mbart50/mbart50.ft.nn.tar.gz
tar xf mbart50.ft.nn.tar.gz

ct2-fairseq-converter --data_dir mbart50.ft.nn/ --model_path mbart50.ft.nn/model.pt \
    --output_dir mbart50_ct2

Similar to M2M-100, the language tokens should prefix the source and target sequences. The list of language tokens is defined in the file mbart50.ft.nn/ML50_langs.txt.

import ctranslate2
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("mbart50.ft.nn/sentence.bpe.model")

source = sp.encode("UN Chief Says There Is No Military Solution in Syria", out_type=str)
source = ["[en_XX]"] + source
target_prefix = ["[ro_RO]"]

translator = ctranslate2.Translator("mbart50_ct2")
result = translator.translate_batch([source], target_prefix=[target_prefix])

output = sp.decode(result[0].hypotheses[0][1:])
print(output)