opennmt.optimizers.adafactor module


class opennmt.optimizers.adafactor.AdafactorOptimizer(multiply_by_parameter_scale=True, learning_rate=None, decay_rate=None, beta1=0.0, clipping_threshold=1.0, factored=True, use_locking=False, name='Adafactor', epsilon1=1e-30, epsilon2=0.001)[source]


Optimizer that implements the Adafactor algorithm.

Adafactor is described in

Adafactor is most similar to Adam (Kingma and Ba), the major differences are:

  1. For a two-dimensional AxB weight matrix, Adafactor uses only A+B auxiliary parameters to maintain the second-moment estimator, instead of AB. This is advantageous on memory-limited systems. In addition, beta1 (momentum) is set to zero by default, saving an additional auxiliary parameter per weight. Variables with >=3 dimensions are treated as collections of two-dimensional matrices - factorization is over the final two dimensions.
  2. Adafactor incorporates “update-clipping” - a scale-invariant analog of gradient clipping. This adds stability
  3. Adafactor does not require an external “learning rate”. By default, it incorporates a relative-update-scale schedule, corresponding to inverse-square-root learning-rate-decay in ADAM. We hope this works well for most applications.


parameter -= absolute_update_scale * clip(grad / grad_scale)


absolute_update_scale := relative_update_scale * parameter_scale relative_update_scale := min((step_num + 1)**-0.5, 1e-2) parameter_scale := max(rms(var)), epsilon2) clip(x) := x / max(1.0, rms(x)) grad_scale := tf.sqrt(v) (v is the second-moment estimator)

The second-moment estimator v is maintained in a manner similar to Adam: We initialize ``` if var is 2-dimensional:

v_r <- zeros([num_rows]) v_c <- zeros([num_cols])
if var is 0-dimensional or 1-dimensional:
v <- zeros(shape(var))


The update rule is as follows: ``` decay_rate = 1 - (step_num + 1) ^ -0.8 grad_squared = tf.square(grad) + epsilon1 if var is 2-dimensional:

v_r <- decay_rate * v_r + (1 - decay_rate) * reduce_mean(grad_squared, 1) v_c <- decay_rate * v_c + (1 - decay_rate) * reduce_mean(grad_squared, 0) v = outer_prod(v_r, v_c) / reduce_mean(v_r)
if var is 0-dimensional or 1-dimensional:
v <- decay_rate * v + (1 - decay_rate) * grad_squared


For variables with >=3 dimensions, we factorize the second-moment accumulator over the final 2 dimensions. See the code for details.

Several parts of this algorithm are configurable from the initializer.

multiply_by_parameter_scale: If True, then compute absolute_update_scale
as described above. If False, let absolute_update_scale be the externally supplied learning_rate.
learning_rate: represents relative_update_scale if
multiply_by_parameter_scale==True, or absolute_update_scale if multiply_by_parameter_scale==False.
decay_rate: Decay rate of the second moment estimator (varies by step_num).
This should be set to a function such that: 1-1/(step_num + 1) <= decay_rate(step_num) < 1.0

beta1: enables momentum, as in Adam. Uses extra memory if nonzero. clipping_threshold: should be >=1.0 or None for no update clipping factored: whether to factor the second-moment estimator. True means

less memory usage.
__init__(multiply_by_parameter_scale=True, learning_rate=None, decay_rate=None, beta1=0.0, clipping_threshold=1.0, factored=True, use_locking=False, name='Adafactor', epsilon1=1e-30, epsilon2=0.001)[source]

Construct a new Adafactor optimizer.

See class comment.

  • multiply_by_parameter_scale – a boolean
  • learning_rate – an optional Scalar.
  • decay_rate – an optional Scalar.
  • beta1 – a float value between 0 and 1
  • clipping_threshold – an optional float >= 1
  • factored – a boolean - whether to use factored second-moment estimator for 2d variables
  • use_locking – If True use locks for update operations.
  • name – Optional name for the operations created when applying gradients. Defaults to “AdafactorOptimizer”.
  • epsilon1 – Regularization constant for squared gradient.
  • epsilon2 – Regularization constant for parameter scale.

ValueError – if absolute_update_scale and relative_update_scale_fn are both present or both absent.


Second-moment decay rate like Adam, subsuming the correction factor.

Parameters:beta2 – a float between 0 and 1
Returns:a scalar

Second moment decay rate where memory-length grows as step_num^exponent.

Parameters:exponent – a float between 0 and 1
Returns:a scalar
opennmt.optimizers.adafactor.get_optimizer_from_params(optimizer_class, params, learning_rate=None)[source]

Get the Adafactor optimizer from user parameters.

  • optimizer_class – The AdafactorOptimizer class.
  • params – A dictionary containing the user parameters for this optimizer.
  • learning_rate – Optional learning rate.

An Adafactor optimizer instance if learning_rate is set, otherwise a callable that takes the learning rate as argument and return an instance.