OpenNMT provides native implementation of scoring metrics - BLEU, TER, DLRATIO
All metrics can be used as a validation metric (see option -validation_metric
) during training or standalone using tools/score.lua
:
$ th tools/score.lua REFERENCE [-sample SN] [-scorer bleu|ter|dlratio] PARAMS < OUT
The actual metric is selected with scorer
option and the output is a line with 3 field, tab separated like:
34.73 +/-0.83 BLEU = 34.77, 79.8/49.1/29.6/17.6 (BP=0.919, ratio=0.922, hyp_len=26742, ref_len=28995) 54.77 TER = 54.77 (Ins 1.8, Del 4.4, Sub 9.6, Shft 1.9, WdSh 2.6)
The fields are:
- numeric value of the score
- 95% confidence error margin (1.96*standard deviation) for k samples of half-size
- formated scorer output
Tip
Error margin is a simple way to know if score variation is part of metric calculation variation or is significant.
The scorer use by default space
tokenization suited for evaluation of tokenized translation. For evaluation of tokenized translation,
you can use -tokenizer max
option applying on the fly the following tokenization options and suited for most language pairs:
-mode=aggressive -segment_alphabet=Han,Kanbun,Katakana,Hiragana -segment_alphabet_change
Alternatively, you can tokenize both translation output and translation reference with your favorite tokenization options and score
on this corpus using default space
tokenization.
BLEU¶
BLEU is a metric widely used for evaluation of machine translation output.
Syntax follows multi-bleu.perl syntax:
$ th tools/score.lua REFERENCE [-sample SN] [-scorer bleu] [-order N] < OUT
generating:
[06/17/17 09:39:04 INFO] 4 references, 1002 sentences BLEU = 34.77 +/- 0.43, 79.8/49.1/29.6/17.6 (BP=0.919, ratio=0.922, hyp_len=26742, ref_len=28995)
where:
REFERENCE
is either a single file, or a prefix for multiple-referenceREFERENCE0
,REFERENCE1
, ...-order
is bleu n-gram order (default 4)
TER¶
TER is an error metric for machine translation that messures the number of edits required to change a system output into one of the references. It is generally prefered to BLEU for estimation of sentence post-editing effort.
DLRATIO¶
Damerau-Levenshtein edit distance is edit distance between 2 sentences. It is a simplified version of TER
(in particular, TER
that also integrates numbers of sequence shift).