Decoding features

This page describes CTranslate2 decoding features.

Note

The text translation API is used for demonstration but most features are also available for text generation.

The examples use the following symbols that are left unspecified:

translator: a ctranslate2.Translator instance
tokenize: a function taking a string and returning a list of string
detokenize: a function taking a list of string and returning a string

This input sentence will be used as an example:

This project is geared towards efficient serving of standard translation models but is also a place for experimentation around model compression and inference acceleration.

Greedy search

Greedy search is the most basic and fastest decoding strategy. It simply takes the token that has the highest probability at each timestep.

results = translator.translate_batch([tokenize(input)], beam_size=1)
print(detokenize(results[0].hypotheses[0]))

Dieses Projekt ist auf die effiziente Bedienung von Standard-Übersetzungsmodellen ausgerichtet, aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

Beam search

Beam search is a common decoding strategy for sequence models. The algorithm keeps N hypotheses at all times. This negatively impacts decoding speed and memory but allows finding a better final hypothesis.

results = translator.translate_batch([tokenize(input)], beam_size=4)
print(detokenize(results[0].hypotheses[0]))

Dieses Projekt ist auf die effiziente Bedienung von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

Tip

More hypotheses can be returned by setting the num_hypotheses argument.

Length constraints

The arguments min_decoding_length and max_decoding_length control the minimum and maximum number of tokens generated by the decoder. The length does not include the end of sequence token:

results = translator.translate_batch([tokenize(input)], max_decoding_length=10)
assert len(results[0].hypotheses[0]) == 10

These length constraints do not apply to empty inputs. Empty inputs are not forwarded into the model and always return an empty output. This is why min_decoding_length is set by default to 1 as we expect non empty inputs to generate at least one token:

results = translator.translate_batch([[]], min_decoding_length=1)
assert len(results[0].hypotheses[0]) == 0

Attention

By default, the input is truncated after 1024 tokens to limit the maximum memory usage of the model. See the option max_input_length.

Autocompletion

The target_prefix argument can be used to force the start of the translation. Let’s say we want to replace the first occurrence of die by das in the translation:

results = translator.translate_batch(
    [tokenize(input)],
    target_prefix=[tokenize("Dieses Projekt ist auf das")],
)

print(detokenize(results[0].hypotheses[0]))

The prefix effectively changes the target context and the rest of the translation:

Dieses Projekt ist auf das effiziente Servieren von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

Biased decoding

Instead of using Autocompletion to force a translation to start with a target_prefix argument, we can “bias” a translation towards a prefix by setting prefix_bias_beta to a value in (0, 1). The higher prefix_bias_beta is, the stronger the bias. A translation can diverge from a prefix when prefix_bias_beta is low and the translator is confident in decoding tokens that are different from the prefix’s tokens. See section 4.2 for more details on the biasing algorithm.

results = translator.translate_batch(
    [tokenize(input)],
    target_prefix=[tokenize("Dieses Projekt ist auf das")],
    prefix_bias_beta=0.5,
    beam_size=4,
)

print(detokenize(results[0].hypotheses[0]))

Setting prefix_bias_beta=0.5 effectively enforces the target_prefix and changes the rest of the translation:

Dieses Projekt ist auf das effiziente Servieren von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

results = translator.translate_batch(
    [tokenize(input)],
    target_prefix=[tokenize("Dieses Projekt ist auf das")],
    prefix_bias_beta=0.1,
    beam_size=4,
)

print(detokenize(results[0].hypotheses[0]))

Lowering the bias by setting prefix_bias_beta=0.1 results in a divergence in the prefix from das to die:

Dieses Projekt ist auf die effiziente Bedienung von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

Alternatives at a position

Combining target_prefix with the return_alternatives flag returns alternative sequences just after the prefix:

results = translator.translate_batch(
    [tokenize(input)],
    target_prefix=[tokenize("Dieses Projekt ist auf die")],
    num_hypotheses=5,
    return_alternatives=True,
)

for hypothesis in results[0].hypotheses:
    print(detokenize(hypothesis))

Dieses Projekt ist auf die effiziente Bedienung von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

Dieses Projekt ist auf die effektive Bedienung von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

Dieses Projekt ist auf die effizientere Bedienung von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

Dieses Projekt ist auf die effizienten Dienste von Standard-Übersetzungsmodellen ausgerichtet, aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

Dieses Projekt ist auf die Effizienz des Servierens von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

In practice, the decoding extracts the num_hypotheses tokens that are most likely to appear after the target prefix. These tokens are then included in the prefix and the decoding completes each hypothesis independently.

Tip

The parameter min_alternative_expansion_prob can be used to filter out alternatives that are very unlikely. The expansion probability corresponds to the probability of the tokens that immediately follow the prefix. Try setting a small value like min_alternative_expansion_prob=0.001 to filter out the most nonsensical alternatives.

Random sampling

This decoding mode randomly samples tokens from the model output distribution. This strategy is frequently used in back-translation techniques (Edunov et al. 2018). The example below restricts the sampling to the best 10 candidates at each timestep and returns 3 random hypotheses:

results = translator.translate_batch(
    [tokenize(input)],
    beam_size=1,
    sampling_topk=10,
    num_hypotheses=3,
)

for hypothesis in results[0].hypotheses:
    print(detokenize(hypothesis))

Dieses Programm ist auf eine effiziente Bedienung von Standard-Übersetzungsmodellen ausgerichtet und ermöglicht gleichzeitig einen Einsatzort für Experimente rund um die Modellkompression oder das Beschleunigen der Schlussfolgerung.

Es dient dazu, die standardisierten Übersetzungsmodelle effizient zu bedienen, aber auch zur Erprobung um die Formkomprimierung und die Folgebeschleunigung.

Das Projekt richtet sich zwar auf den effizienten Service von Standard-Übersetzungen-Modellen, ist aber auch ein Ort für Experimente rund um Modellkomprimierung und ineffektive Beschleunigung.

Tip

You can increase the randomness of the generation by increasing the value of the argument sampling_temperature.