WordNoiser

class opennmt.data.WordNoiser(noises=None, subword_token='■', is_spacer=None)[source]

Applies noise to words sequences.

Inherits from: builtins.object

__init__(noises=None, subword_token='■', is_spacer=None)[source]

Initializes the noising class.

Parameters
  • noises – A list of opennmt.data.Noise instances to apply sequentially.

  • subword_token – The special token used by the subword tokenizer. This is required when the noise should be applied at the word level and not the subword level.

  • is_spacer – Whether subword_token is used as a spacer (as in SentencePiece) or a joiner (as in BPE). If None, will infer directly from subword_token.

add(noise)[source]

Adds a noise to apply.

__call__(tokens, sequence_length=None, keep_shape=False, probability=None)[source]

Applies noise on tokens.

Parameters
  • tokens – A string tf.Tensor, a batch of string tf.Tensor, or a string tf.RaggedTensor.

  • sequence_length – When tokens is a dense tensor, the length of each sequence in the batch.

  • keep_shape – Ensure that the original dense shape is kept. Otherwise, fit the shape to the new lengths.

  • probability – Probability to apply noise on each example.

Returns

If tokens is a tf.RaggedTensor, the method returns the noisy tokens as a tf.RaggedTensor, otherwise it returns a tuple with the noisy tokens as a tf.Tensor and the new lengths.

Raises
  • ValueError – if tokens is a batch of string but sequence_length is not passed.

  • ValueError – if keep_shape is True but tokens is a tf.RaggedTensor.