Tokenizer

class opennmt.tokenizers.Tokenizer[source]

Base class for tokenizers.

Inherits from: abc.ABC

Extended by:

property in_graph: Returns True if this tokenizer can be run in graph (i.e. uses TensorFlow ops).

export_assets(asset_dir, asset_prefix='')[source]

Exports assets for this tokenizer.

Parameters

asset_dir – The directory where assets can be written.
asset_prefix – The prefix to attach to assets filename.

Returns

A dictionary containing additional assets used by the tokenizer.

tokenize_stream(input_stream=<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>, output_stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, delimiter=' ', training=True)[source]

Tokenizes a stream of sentences.

Parameters

input_stream – The input stream.
output_stream – The output stream.
delimiter – The token delimiter to use for text serialization.
training – Set to False to tokenize for inference.

detokenize_stream(input_stream=<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>, output_stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, delimiter=' ')[source]

Detokenizes a stream of sentences.

Parameters

input_stream – The input stream.
output_stream – The output stream.
delimiter – The token delimiter used for text serialization.

tokenize(text, training=True)[source]

Tokenizes text.

Parameters

text – A string or batch of strings to tokenize as a tf.Tensor or Python values.
training – Set to False to tokenize for inference.

Returns

If text is a Python string, a list of Python strings.
If text is a list of Python strings, a list of list of Python strings.
If text is a 0-D tf.Tensor, a 1-D tf.Tensor.
If text is a 1-D tf.Tensor, a 2-D tf.RaggedTensor.

Raises

ValueError – if the rank of text is greater than 1.

detokenize(tokens, sequence_length=None)[source]

Detokenizes tokens.

The Tensor version supports batches of tokens.

Parameters

tokens – Tokens or batch of tokens as a tf.Tensor, tf.RaggedTensor, or Python values.
sequence_length – The length of each sequence. Required if tokens is a dense 2-D tf.Tensor.

Returns

If tokens is a list of list of Python strings, a list of Python strings.
If tokens is a list of Python strings, a Python string.
If tokens is a N-D tf.Tensor (or tf.RaggedTensor), a (N-1)-D tf.Tensor.

Raises

ValueError – if the rank of tokens is greater than 2.
ValueError – if tokens is a 2-D dense tf.Tensor and sequence_length is not set.