train.lua options:

  • -h
    This help.
  • -md
    Dump help in Markdown format.
  • -config <string>
    Load options from this file.
  • -save_config <string>
    Save options to this file.

Data options

  • -data <string>
    Path to the data package *-train.t7 generated by the preprocessing step.

Sampled dataset options

  • -sample <number> (default: 0)
    Number of instances to sample from train data in each epoch.
  • -sample_w_ppl
    If set, ese perplexity as a probability distribution when sampling.
  • -sample_w_ppl_init <number> (default: 15)
    Start perplexity-based sampling when average train perplexity per batch falls below this value.
  • -sample_w_ppl_max <number> (default: -1.5)
    When greater than 0, instances with perplexity above this value will be considered as noise and ignored; when less than 0, mode + -sample_w_ppl_max * stdev will be used as threshold.

Model options

  • -model_type <string> (accepted: lm, seq2seq, seqtagger; default: seq2seq)
    Type of model to train. This option impacts all options choices.
  • -param_init <number> (default: 0.1)
    Parameters are initialized over uniform distribution with support (-param_init, param_init).

Sequence to Sequence with Attention options

  • -enc_layers <number> (default: 0)
    If > 0, number of layers of the encode. This overrides the global -layers option.
  • -dec_layers <number> (default: 0)
    If > 0, number of layers of the decoder. This overrides the global -layers option.
  • -word_vec_size <number> (default: 0)
    Shared word embedding size. If set, this overrides -src_word_vec_size and -tgt_word_vec_size.
  • -src_word_vec_size <string> (default: 500)
    Comma-separated list of source embedding sizes: word[,feat1[,feat2[,...] ] ].
  • -tgt_word_vec_size <string> (default: 500)
    Comma-separated list of target embedding sizes: word[,feat1[,feat2[,...] ] ].
  • -pre_word_vecs_enc <string>
    Path to pretrained word embeddings on the encoder side serialized as a Torch tensor.
  • -pre_word_vecs_dec <string>
    Path to pretrained word embeddings on the decoder side serialized as a Torch tensor.
  • -fix_word_vecs_enc <number> (accepted: 0, 1; default: 0)
    Fix word embeddings on the encoder side.
  • -fix_word_vecs_dec <number> (accepted: 0, 1; default: 0)
    Fix word embeddings on the decoder side.
  • -feat_merge <string> (accepted: concat, sum; default: concat)
    Merge action for the features embeddings.
  • -feat_vec_exponent <number> (default: 0.7)
    When features embedding sizes are not set and using -feat_merge concat, their dimension will be set to N^feat_vec_exponent where N is the number of values the feature takes.
  • -feat_vec_size <number> (default: 20)
    When features embedding sizes are not set and using -feat_merge sum, this is the common embedding size of the features
  • -layers <number> (default: 2)
    Number of recurrent layers of the encoder and decoder. See also -enc_layers, -dec_layers and -bridge to assign different layers to the encoder and decoder.
  • -rnn_size <number> (default: 500)
    Hidden size of the recurrent unit.
  • -rnn_type <string> (accepted: LSTM, GRU; default: LSTM)
    Type of recurrent cell.
  • -dropout <number> (default: 0.3)
    Dropout probability applied between recurrent layers.
  • -dropout_input
    Also apply dropout to the input of the recurrent module.
  • -residual
    Add residual connections between recurrent layers.
  • -bridge <string> (accepted: copy, dense, dense_nonlinear, none; default: copy)
    Define how to pass encoder states to the decoder. With copy, the encoder and decoder must have the same number of layers.
  • -input_feed <number> (accepted: 0, 1; default: 1)
    Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.
  • -brnn
    Use a bidirectional encoder.
  • -dbrnn
    Use a deep bidirectional encoder.
  • -pdbrnn
    Use a pyramidal deep bidirectional encoder.
  • -attention <string> (accepted: none, global; default: global)
    Attention model.
  • -brnn_merge <string> (accepted: concat, sum; default: sum)
    Merge action for the bidirectional states.
  • -pdbrnn_reduction <number> (default: 2)
    Time-reduction factor at each layer.

Global Attention Model options

  • -global_attention <string> (accepted: general, dot, concat; default: general)
    Global attention model type.

Trainer options

  • -save_every <number> (default: 5000)
    Save intermediate models every this many iterations within an epoch. If = 0, will not save intermediate models.
  • -report_every <number> (default: 50)
    Report progress every this many iterations within an epoch.
  • -async_parallel
    When training on multiple GPUs, update parameters asynchronously.
  • -async_parallel_minbatch <number> (default: 1000)
    In asynchronous training, minimal number of sequential batches before being parallel.
  • -start_iteration <number> (default: 1)
    If loading from a checkpoint, the iteration from which to start.
  • -start_epoch <number> (default: 1)
    If loading from a checkpoint, the epoch from which to start.
  • -end_epoch <number> (default: 13)
    The final epoch of the training.
  • -curriculum <number> (default: 0)
    For this many epochs, order the minibatches based on source length (from smaller to longer). Sometimes setting this to 1 will increase convergence speed.

Optimization options

  • -max_batch_size <number> (default: 64)
    Maximum batch size.
  • -uneven_batches
    If set, batches are filled up to max_batch_size even if source lengths are different. Slower but needed for some tasks.
  • -optim <string> (accepted: sgd, adagrad, adadelta, adam; default: sgd)
    Optimization method.
  • -learning_rate <number> (default: 1)
    Starting learning rate. If adagrad or adam is used, then this is the global learning rate. Recommended settings are: sgd = 1, adagrad = 0.1, adam = 0.0002.
  • -min_learning_rate <number> (default: 0)
    Do not continue the training past this learning rate value.
  • -max_grad_norm <number> (default: 5)
    Clip the gradients norm to this value.
  • -learning_rate_decay <number> (default: 0.7)
    Learning rate decay factor: learning_rate = learning_rate * learning_rate_decay.
  • -start_decay_at <number> (default: 9)
    In "default" decay mode, start decay after this epoch.
  • -start_decay_ppl_delta <number> (default: 0)
    Start decay when validation perplexity improvement is lower than this value.
  • -decay <string> (accepted: default, perplexity_only; default: default)
    When to apply learning rate decay. default: decay after each epoch past -start_decay_at or as soon as the validation perplexity is not improving more than -start_decay_ppl_delta, perplexity_only: only decay when validation perplexity is not improving more than -start_decay_ppl_delta.

Saver options

  • -save_model <string>
    Model filename (the model will be saved as <save_model>_epochN_PPL.t7 where PPL is the validation perplexity.
  • -train_from <string>
    Path to a checkpoint.
  • -continue
    If set, continue the training where it left off.

Crayon options

  • -exp_host <string> (default:
    Crayon server IP.
  • -exp_port <string> (default: 8889)
    Crayon server port.
  • -exp <string>
    Crayon experiment name.

Cuda options

  • -gpuid <string> (default: 0)
    List of comma-separated GPU identifiers (1-indexed). CPU is used when set to 0.
  • -fallback_to_cpu
    If GPU can't be used, rollback on the CPU.
  • -fp16
    Use half-precision float on GPU.
  • -no_nccl
    Disable usage of nccl in parallel mode.

Logger options

  • -log_file <string>
    Output logs to a file under this path instead of stdout.
  • -disable_logs
    If set, output nothing.
  • -log_level <string> (accepted: DEBUG, INFO, WARNING, ERROR; default: INFO)
    Output logs at this level and above.

Other options

  • -disable_mem_optimization
    Disable sharing of internal buffers between clones for visualization or development.
  • -profiler
    Generate profiling logs.
  • -seed <number> (default: 3435)
    Random seed.