edit

tools/learn_bpe.lua

learn_bpe.lua options:

  • -h [<boolean>] (default: false)
    This help.
  • -md [<boolean>] (default: false)
    Dump help in Markdown format.
  • -config <string> (default: '')
    Load options from this file.
  • -save_config <string> (default: '')
    Save options to this file.

BPE options

  • -size <string> (default: 30000)
    The number of merge operations to learn.
  • -t [<boolean>] (default: false)
    Tokenize the input with tokenizer, the same options as tokenize.lua, but only -mode is taken into account for BPE training.
  • -mode <string> (accepted: conservative, aggressive; default: conservative)
    Define how aggressive should the tokenization be. aggressive only keeps sequences of letters/numbers, conservative allows a mix of alphanumeric as in: "2,000", "E65", "soft-landing", etc.
  • -segment_case [<boolean>] (default: false)
    Segment case feature, splits AbC to Ab C to be able to restore case
  • -lc [<boolean>] (default: false)
    Lowercase the output from the tokenizer before learning BPE.
  • -bpe_mode <string> (accepted: suffix, prefix, both, none; default: suffix)
    Define the BPE mode. prefix: append <w> to the begining of each word to learn prefix-oriented pair statistics; suffix: append </w> to the end of each word to learn suffix-oriented pair statistics, as in the original Python script; both: suffix and prefix; none: no suffix nor prefix.
  • -save_bpe <string> (default: '')
    Path to save the output model.

Logger options

  • -log_file <string> (default: '')
    Output logs to a file under this path instead of stdout.
  • -disable_logs [<boolean>] (default: false)
    If set, output nothing.
  • -log_level <string> (accepted: DEBUG, INFO, WARNING, ERROR; default: INFO)
    Output logs at this level and above.