OpenNMT provides native implementation of scoring metrics - BLEU, TER, DLRATION
All metrics can be used as a validation metric (see option
-validation_metric) during training or standalone using
$ th tools/score.lua REFERENCE [-sample SN] [-scorer bleu|ter|dlratio] PARAMS < OUT
The actual metric is selected with
scorer option and the output is a line with 3 field, tab separated like:
34.73 +/-0.83 BLEU = 34.77, 79.8/49.1/29.6/17.6 (BP=0.919, ratio=0.922, hyp_len=26742, ref_len=28995) 54.77 TER = 54.77 (Ins 1.8, Del 4.4, Sub 9.6, Shft 1.9, WdSh 2.6)
The fields are:
- numeric value of the score
- 95% confidence error margin (1.96*standard deviation) for k samples of half-size
- formated scorer output
Error margin is a simple way to know if score variation is part of metric calculation variation or is significant.
BLEU is a metric widely used for evaluation of machine translation output.
Syntax follows multi-bleu.perl syntax:
$ th tools/score.lua REFERENCE [-sample SN] [-scorer bleu] [-order N] < OUT
[06/17/17 09:39:04 INFO] 4 references, 1002 sentences BLEU = 34.77 +/- 0.43, 79.8/49.1/29.6/17.6 (BP=0.919, ratio=0.922, hyp_len=26742, ref_len=28995)
REFERENCEis either a single file, or a prefix for multiple-reference
-orderis bleu n-gram order (default 4)
TER is an error metric for machine translation that messures the number of edits required to change a system output into one of the references. It is generally prefered to BLEU for estimation of sentence post-editing effort.
Damerau-Levenshtein edit distance is edit distance between 2 sentences. It is a simplified version of
TER (in particular,
TER that also integrates numbers of sequence shift).