WordNoiser

class opennmt.data.WordNoiser(noises=None, subword_token='￭', is_spacer=None)[source]

Applies noise to words sequences.

Inherits from: builtins.object

__init__(noises=None, subword_token='￭', is_spacer=None)[source]

Initializes the noising class.

Parameters

noises – A list of opennmt.data.Noise instances to apply sequentially.
subword_token – The special token used by the subword tokenizer. This is required when the noise should be applied at the word level and not the subword level.
is_spacer – Whether subword_token is used as a spacer (as in SentencePiece) or a joiner (as in BPE). If None, will infer directly from subword_token.

See also

opennmt.data.tokens_to_words()

add(noise)[source]: Adds a noise to apply.

__call__(tokens, sequence_length=None, keep_shape=False, probability=None)[source]

Applies noise on tokens.

Parameters

tokens – A string tf.Tensor, a batch of string tf.Tensor, or a string tf.RaggedTensor.
sequence_length – When tokens is a dense tensor, the length of each sequence in the batch.
keep_shape – Ensure that the original dense shape is kept. Otherwise, fit the shape to the new lengths.
probability – Probability to apply noise on each example.

Returns

If tokens is a tf.RaggedTensor, the method returns the noisy tokens as a tf.RaggedTensor, otherwise it returns a tuple with the noisy tokens as a tf.Tensor and the new lengths.

Raises

ValueError – if tokens is a batch of string but sequence_length is not passed.
ValueError – if keep_shape is True but tokens is a tf.RaggedTensor.