Tokenizer
- class opennmt.tokenizers.Tokenizer[source]
- Base class for tokenizers. - Inherits from: - abc.ABC- Extended by: - property in_graph
- Returns - Trueif this tokenizer can be run in graph (i.e. uses TensorFlow ops).
 - export_assets(asset_dir, asset_prefix='')[source]
- Exports assets for this tokenizer. - Parameters
- asset_dir – The directory where assets can be written. 
- asset_prefix – The prefix to attach to assets filename. 
 
- Returns
- A dictionary containing additional assets used by the tokenizer. 
 
 - tokenize_stream(input_stream=<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>, output_stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, delimiter=' ', training=True)[source]
- Tokenizes a stream of sentences. - Parameters
- input_stream – The input stream. 
- output_stream – The output stream. 
- delimiter – The token delimiter to use for text serialization. 
- training – Set to - Falseto tokenize for inference.
 
 
 - detokenize_stream(input_stream=<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>, output_stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, delimiter=' ')[source]
- Detokenizes a stream of sentences. - Parameters
- input_stream – The input stream. 
- output_stream – The output stream. 
- delimiter – The token delimiter used for text serialization. 
 
 
 - tokenize(text, training=True)[source]
- Tokenizes text. - Parameters
- text – A string or batch of strings to tokenize as a - tf.Tensoror Python values.
- training – Set to - Falseto tokenize for inference.
 
- Returns
- If - textis a Python string, a list of Python strings.
- If - textis a list of Python strings, a list of list of Python strings.
- If - textis a 0-D- tf.Tensor, a 1-D- tf.Tensor.
- If - textis a 1-D- tf.Tensor, a 2-D- tf.RaggedTensor.
 
- Raises
- ValueError – if the rank of - textis greater than 1.
 
 - detokenize(tokens, sequence_length=None)[source]
- Detokenizes tokens. - The Tensor version supports batches of tokens. - Parameters
- tokens – Tokens or batch of tokens as a - tf.Tensor,- tf.RaggedTensor, or Python values.
- sequence_length – The length of each sequence. Required if - tokensis a dense 2-D- tf.Tensor.
 
- Returns
- If - tokensis a list of list of Python strings, a list of Python strings.
- If - tokensis a list of Python strings, a Python string.
- If - tokensis a N-D- tf.Tensor(or- tf.RaggedTensor), a (N-1)-D- tf.Tensor.
 
- Raises
- ValueError – if the rank of - tokensis greater than 2.
- ValueError – if - tokensis a 2-D dense- tf.Tensorand- sequence_lengthis not set.