Tokenizer
- class opennmt.tokenizers.Tokenizer[source]
Base class for tokenizers.
Inherits from:
abc.ABCExtended by:
- property in_graph
Returns
Trueif this tokenizer can be run in graph (i.e. uses TensorFlow ops).
- export_assets(asset_dir, asset_prefix='')[source]
Exports assets for this tokenizer.
- Parameters
asset_dir – The directory where assets can be written.
asset_prefix – The prefix to attach to assets filename.
- Returns
A dictionary containing additional assets used by the tokenizer.
- tokenize_stream(input_stream=<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>, output_stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, delimiter=' ', training=True)[source]
Tokenizes a stream of sentences.
- Parameters
input_stream – The input stream.
output_stream – The output stream.
delimiter – The token delimiter to use for text serialization.
training – Set to
Falseto tokenize for inference.
- detokenize_stream(input_stream=<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>, output_stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, delimiter=' ')[source]
Detokenizes a stream of sentences.
- Parameters
input_stream – The input stream.
output_stream – The output stream.
delimiter – The token delimiter used for text serialization.
- tokenize(text, training=True)[source]
Tokenizes text.
- Parameters
text – A string or batch of strings to tokenize as a
tf.Tensoror Python values.training – Set to
Falseto tokenize for inference.
- Returns
If
textis a Python string, a list of Python strings.If
textis a list of Python strings, a list of list of Python strings.If
textis a 0-Dtf.Tensor, a 1-Dtf.Tensor.If
textis a 1-Dtf.Tensor, a 2-Dtf.RaggedTensor.
- Raises
ValueError – if the rank of
textis greater than 1.
- detokenize(tokens, sequence_length=None)[source]
Detokenizes tokens.
The Tensor version supports batches of tokens.
- Parameters
tokens – Tokens or batch of tokens as a
tf.Tensor,tf.RaggedTensor, or Python values.sequence_length – The length of each sequence. Required if
tokensis a dense 2-Dtf.Tensor.
- Returns
If
tokensis a list of list of Python strings, a list of Python strings.If
tokensis a list of Python strings, a Python string.If
tokensis a N-Dtf.Tensor(ortf.RaggedTensor), a (N-1)-Dtf.Tensor.
- Raises
ValueError – if the rank of
tokensis greater than 2.ValueError – if
tokensis a 2-D densetf.Tensorandsequence_lengthis not set.