Tokenizer
- class opennmt.tokenizers.Tokenizer[source]
Base class for tokenizers.
Inherits from:
abc.ABC
Extended by:
- property in_graph
Returns
True
if this tokenizer can be run in graph (i.e. uses TensorFlow ops).
- export_assets(asset_dir, asset_prefix='')[source]
Exports assets for this tokenizer.
- Parameters
asset_dir – The directory where assets can be written.
asset_prefix – The prefix to attach to assets filename.
- Returns
A dictionary containing additional assets used by the tokenizer.
- tokenize_stream(input_stream=<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>, output_stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, delimiter=' ', training=True)[source]
Tokenizes a stream of sentences.
- Parameters
input_stream – The input stream.
output_stream – The output stream.
delimiter – The token delimiter to use for text serialization.
training – Set to
False
to tokenize for inference.
- detokenize_stream(input_stream=<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>, output_stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, delimiter=' ')[source]
Detokenizes a stream of sentences.
- Parameters
input_stream – The input stream.
output_stream – The output stream.
delimiter – The token delimiter used for text serialization.
- tokenize(text, training=True)[source]
Tokenizes text.
- Parameters
text – A string or batch of strings to tokenize as a
tf.Tensor
or Python values.training – Set to
False
to tokenize for inference.
- Returns
If
text
is a Python string, a list of Python strings.If
text
is a list of Python strings, a list of list of Python strings.If
text
is a 0-Dtf.Tensor
, a 1-Dtf.Tensor
.If
text
is a 1-Dtf.Tensor
, a 2-Dtf.RaggedTensor
.
- Raises
ValueError – if the rank of
text
is greater than 1.
- detokenize(tokens, sequence_length=None)[source]
Detokenizes tokens.
The Tensor version supports batches of tokens.
- Parameters
tokens – Tokens or batch of tokens as a
tf.Tensor
,tf.RaggedTensor
, or Python values.sequence_length – The length of each sequence. Required if
tokens
is a dense 2-Dtf.Tensor
.
- Returns
If
tokens
is a list of list of Python strings, a list of Python strings.If
tokens
is a list of Python strings, a Python string.If
tokens
is a N-Dtf.Tensor
(ortf.RaggedTensor
), a (N-1)-Dtf.Tensor
.
- Raises
ValueError – if the rank of
tokens
is greater than 2.ValueError – if
tokens
is a 2-D densetf.Tensor
andsequence_length
is not set.