Transformers
CTranslate2 supports selected models from Hugging Face’s Transformers. The following models are currently supported:
BART
BERT
BLOOM
CodeGen
DistilBERT
Falcon
Llama
M2M100
MarianMT
MBART
MPT
NLLB
OpenAI GPT2
GPTBigCode
GPT-J
GPT-NeoX
OPT
Pegasus
T5
Whisper
XLM-RoBERTa
The converter takes as argument the pretrained model name or the path to a model directory:
pip install transformers[torch]
ct2-transformers-converter --model facebook/m2m100_418M --output_dir ct2_model
Special tokens in translation
For other frameworks, the Translator
methods implicitly add special tokens to the source input when required. For example, models converted from Fairseq or Marian will implicitly append </s>
to the source tokens.
However, these special tokens are not implicitly added for Transformers models since they are already returned by the corresponding tokenizer:
>>> import transformers
>>> tokenizer = transformers.AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
>>> tokenizer.convert_ids_to_tokens(tokenizer.encode("Hello world!"))
['▁Hello', '▁world', '!', '</s>']
Important
If you are not using the Hugging Face tokenizers, make sure to add these special tokens when required.
BART
This example uses the BART model that was fine-tuned on CNN Daily Mail for text summarization.
ct2-transformers-converter --model facebook/bart-large-cnn --output_dir bart-large-cnn
import ctranslate2
import transformers
translator = ctranslate2.Translator("bart-large-cnn")
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
text = (
"PG&E stated it scheduled the blackouts in response to forecasts for high winds "
"amid dry conditions. "
"The aim is to reduce the risk of wildfires. "
"Nearly 800 thousand customers were scheduled to be affected by the shutoffs which "
"were expected to last through at least midday tomorrow."
)
source = tokenizer.convert_ids_to_tokens(tokenizer.encode(text))
results = translator.translate_batch([source])
target = results[0].hypotheses[0]
print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target), skip_special_tokens=True))
BERT
BERT is pretrained model on English language using a masked language modeling objective.
CTranslate2 only implements the BertModel
class from Transformers which includes the Transformer encoder and the pooling layer. Task-specific layers should be run with PyTorch as shown in the example below.
ct2-transformers-converter --model textattack/bert-base-uncased-yelp-polarity --output_dir bert-base-uncased-yelp-polarity
import ctranslate2
import numpy as np
import torch
import transformers
device = "cuda"
encoder = ctranslate2.Encoder("bert-base-uncased-yelp-polarity", device=device)
model_name = "textattack/bert-base-uncased-yelp-polarity"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
classifier = transformers.AutoModelForSequenceClassification.from_pretrained(model_name).classifier
classifier.eval()
classifier.to(device)
inputs = ["It was good!", "Worst experience in my life.", "It was not good."]
tokens = tokenizer(inputs).input_ids
output = encoder.forward_batch(tokens)
pooler_output = output.pooler_output
if device == "cuda":
pooler_output = torch.as_tensor(pooler_output, device=device)
else:
pooler_output = np.array(pooler_output)
pooler_output = torch.as_tensor(pooler_output)
logits = classifier(pooler_output)
predicted_class_ids = logits.argmax(1)
print(predicted_class_ids)
BLOOM
BLOOM is a collection of multilingual language models trained by the BigScience workshop.
This example uses the smallest model with 560M parameters.
ct2-transformers-converter --model bigscience/bloom-560m --output_dir bloom-560m
import ctranslate2
import transformers
generator = ctranslate2.Generator("bloom-560m")
tokenizer = transformers.AutoTokenizer.from_pretrained("bigscience/bloom-560m")
text = "Hello, I am"
start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(text))
results = generator.generate_batch([start_tokens], max_length=30, sampling_topk=10)
print(tokenizer.decode(results[0].sequences_ids[0]))
DistilBERT
DistilBERT is a small, fast, cheap and light Transformer Encoder model trained by distilling BERT base.
CTranslate2 only implements the DistilBertModel
class from Transformers which includes the Transformer encoder. Task-specific layers should be run with PyTorch, similar to the example for BERT.
ct2-transformers-converter --model distilbert-base-uncased --output_dir distilbert-base-uncased
Falcon
Falcon is a collection of generative language models trained by TII. The example below uses “Falcon-7B-Instruct” which is based on “Falcon-7B” and finetuned on a mixture of chat/instruct datasets.
ct2-transformers-converter --model tiiuae/falcon-7b-instruct --quantization float16 --output_dir falcon-7b-instruct --trust_remote_code
import ctranslate2
import transformers
generator = ctranslate2.Generator("falcon-7b-instruct", device="cuda")
tokenizer = transformers.AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct")
prompt = (
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. "
"Giraftron believes all other animals are irrelevant when compared to the glorious majesty "
"of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:"
)
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
results = generator.generate_batch([tokens], sampling_topk=10, max_length=200, include_prompt_in_result=False)
output = tokenizer.decode(results[0].sequences_ids[0])
print(output)
Llama 2
Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters.
The models with the suffix “-hf” such as meta-llama/Llama-2-7b-chat-hf can be converted with the Transformers converter. For example:
ct2-transformers-converter --model meta-llama/Llama-2-7b-chat-hf --quantization float16 --output_dir llama-2-7b-chat-ct2
Important
You need to request an access to the Llama 2 models before you can download them from the Hugging Face Hub. See the instructions on the model page. Once you have access to the model, you should login with huggingface-cli login
before running the conversion command.
See also
The example Chat with Llama 2 which demonstrates how to implement an interactive chat session using CTranslate2.
MarianMT
This example uses the English-German model from MarianMT.
ct2-transformers-converter --model Helsinki-NLP/opus-mt-en-de --output_dir opus-mt-en-de
import ctranslate2
import transformers
translator = ctranslate2.Translator("opus-mt-en-de")
tokenizer = transformers.AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Hello world!"))
results = translator.translate_batch([source])
target = results[0].hypotheses[0]
print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))
M2M-100
This example uses the M2M-100 multilingual model.
ct2-transformers-converter --model facebook/m2m100_418M --output_dir m2m100_418
import ctranslate2
import transformers
translator = ctranslate2.Translator("m2m100_418")
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/m2m100_418M")
tokenizer.src_lang = "en"
source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Hello world!"))
target_prefix = [tokenizer.lang_code_to_token["de"]]
results = translator.translate_batch([source], target_prefix=[target_prefix])
target = results[0].hypotheses[0][1:]
print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))
MPT
MPT-7B is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code. This model was trained by MosaicML.
Note
The code is included in the model so you should pass --trust_remote_code
to the conversion command.
ct2-transformers-converter --model mosaicml/mpt-7b --output_dir mpt-7b --quantization int8_float16 --trust_remote_code
import ctranslate2
import transformers
generator = ctranslate2.Generator("mpt-7b")
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
prompt = "In a shocking finding, scientists"
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
results = generator.generate_batch([tokens], max_length=30, sampling_topk=10)
text = tokenizer.decode(results[0].sequences_ids[0])
print(text)
NLLB
NLLB is a collection of multilingual models trained by Meta and supporting 200 languages. See here for the list of accepted language codes.
The example below uses the smallest version with 600M parameters.
Important
Converting NLLB models requires transformers>=4.21.0
.
ct2-transformers-converter --model facebook/nllb-200-distilled-600M --output_dir nllb-200-distilled-600M
import ctranslate2
import transformers
src_lang = "eng_Latn"
tgt_lang = "fra_Latn"
translator = ctranslate2.Translator("nllb-200-distilled-600M")
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang=src_lang)
source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Hello world!"))
target_prefix = [tgt_lang]
results = translator.translate_batch([source], target_prefix=[target_prefix])
target = results[0].hypotheses[0][1:]
print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))
GPT-2
This example uses the small GPT-2 model.
ct2-transformers-converter --model gpt2 --output_dir gpt2_ct2
import ctranslate2
import transformers
generator = ctranslate2.Generator("gpt2_ct2")
tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2")
# Unconditional generation.
start_tokens = [tokenizer.bos_token]
results = generator.generate_batch([start_tokens], max_length=30, sampling_topk=10)
print(tokenizer.decode(results[0].sequences_ids[0]))
# Conditional generation.
start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode("It is"))
results = generator.generate_batch([start_tokens], max_length=30, sampling_topk=10)
print(tokenizer.decode(results[0].sequences_ids[0]))
GPTBigCode
GPTBigCode model was first proposed in SantaCoder: don’t reach for the stars, and used by models like StarCoder.
ct2-transformers-converter --model bigcode/starcoder --revision main --quantization float16 --output_dir starcoder_ct2
import ctranslate2
import transformers
generator = ctranslate2.Generator("starcoder_ct2")
tokenizer = transformers.AutoTokenizer.from_pretrained("bigcode/starcoder")
prompt = "<fim_prefix>def print_hello_world():\n <fim_suffix>\n print('Hello world!')<fim_middle>"
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
results = generator.generate_batch([tokens], max_length=30, include_prompt_in_result=False)
text = tokenizer.decode(results[0].sequences_ids[0])
print(text)
GPT-J
GPT-J is a GPT-2-like language model trained on the Pile dataset. The example below uses the version with 6B parameters:
ct2-transformers-converter --model EleutherAI/gpt-j-6B --revision float16 --quantization float16 --output_dir gptj_ct2
Note
To reduce the memory usage during conversion, the command above uses the float16 branch of the model and saves the weights in FP16. Still, the conversion will use up to 24GB of memory.
import ctranslate2
import transformers
generator = ctranslate2.Generator("gptj_ct2")
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
prompt = "In a shocking finding, scientists"
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
results = generator.generate_batch([tokens], max_length=30, sampling_topk=10)
text = tokenizer.decode(results[0].sequences_ids[0])
print(text)
GPT-NeoX
The GPT-NeoX architecture was first introduced by EleutherAI with “GPT-NeoX-20B”, a 20 billion parameter autoregressive language model trained on the Pile.
ct2-transformers-converter --model EleutherAI/gpt-neox-20b --quantization float16 --output_dir gpt_neox_ct2
import ctranslate2
import transformers
generator = ctranslate2.Generator("gpt_neox_ct2")
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
prompt = "GPTNeoX20B is a 20B-parameter autoregressive Transformer model developed by EleutherAI."
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
results = generator.generate_batch(
[tokens],
max_length=64,
sampling_topk=20,
sampling_temperature=0.9,
)
text = tokenizer.decode(results[0].sequences_ids[0])
print(text)
OPT
This example uses Meta’s OPT model with 350M parameters. The usage is similar to GPT-2 but all inputs should start with the special token </s>
which is automatically added by GPT2Tokenizer
.
Important
Converting OPT models requires transformers>=4.20.1
.
Tip
If you plan to quantize OPT models to 8-bit, it is recommended to download the corresponding activation scales from the SmoothQuant repository and pass them to the converter option --activation_scales
. Some weights will be rescaled to smooth the intermediate activations and improve the quantization accuracy.
ct2-transformers-converter --model facebook/opt-350m --output_dir opt-350m-ct2
import ctranslate2
import transformers
tokenizer = transformers.GPT2Tokenizer.from_pretrained("facebook/opt-350m")
generator = ctranslate2.Generator("opt-350m-ct2")
prompt = "Hey, are you conscious? Can you talk to me?"
start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
results = generator.generate_batch([start_tokens], max_length=30)
output = tokenizer.decode(results[0].sequences_ids[0])
print(output)
T5
T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format.
The example below uses the t5-small
and machine translation input.
Note
The variants T5v1.1, mT5, and FLAN-T5 are also supported.
ct2-transformers-converter --model t5-small --output_dir t5-small-ct2
import ctranslate2
import transformers
translator = ctranslate2.Translator("t5-small-ct2")
tokenizer = transformers.AutoTokenizer.from_pretrained("t5-small")
input_text = "translate English to German: The house is wonderful."
input_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input_text))
results = translator.translate_batch([input_tokens])
output_tokens = results[0].hypotheses[0]
output_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))
print(output_text)
Whisper
Whisper is a multilingual speech recognition model published by OpenAI.
Important
Converting Whisper models requires transformers>=4.23.0
.
The example below uses the smallest model with 39M parameters. Consider using a larger model to get better results.
ct2-transformers-converter --model openai/whisper-tiny --output_dir whisper-tiny-ct2
import ctranslate2
import librosa
import transformers
# Load and resample the audio file.
audio, _ = librosa.load("audio.wav", sr=16000, mono=True)
# Compute the features of the first 30 seconds of audio.
processor = transformers.WhisperProcessor.from_pretrained("openai/whisper-tiny")
inputs = processor(audio, return_tensors="np", sampling_rate=16000)
features = ctranslate2.StorageView.from_array(inputs.input_features)
# Load the model on CPU.
model = ctranslate2.models.Whisper("whisper-tiny-ct2")
# Detect the language.
results = model.detect_language(features)
language, probability = results[0][0]
print("Detected language %s with probability %f" % (language, probability))
# Describe the task in the prompt.
# See the prompt format in https://github.com/openai/whisper.
prompt = processor.tokenizer.convert_tokens_to_ids(
[
"<|startoftranscript|>",
language,
"<|transcribe|>",
"<|notimestamps|>", # Remove this token to generate timestamps.
]
)
# Run generation for the 30-second window.
results = model.generate(features, [prompt])
transcription = processor.decode(results[0].sequences_ids[0])
print(transcription)
Note
This example only transcribes the first 30 seconds of audio. To transcribe longer files, you need to call generate
on each 30-second window and aggregate the results. See the project faster-whisper for a complete transcription example using CTranslate2.