Webmaster's Note: The whole dataset is available Here. Please download the dataset instead of crawling the website.
Basic Info:
id: P02-1040
title: Bleu: A Method For Automatic Evaluation Of Machine Translation
venue: ACL
year: 2002
pdf: link
title: Bleu: A Method For Automatic Evaluation Of Machine Translation
venue: ACL
year: 2002
pdf: link
Abstract
Human evaluations of machine translation are extensive but expensive. Human eval- uations can take months to finish and in- volve human labor that can not be reused. We propose a method of automatic ma- chine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evalu- ation, and that has little marginal cost per run. We present this method as an auto- mated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations.1
| Stat | Rank | Value |
|---|---|---|
| Incoming Citations | 5(5) | 272(270) |
| Outgoing Citations | 10583(9875) | 0(0) |
| PageRank | 57 | 1503 |
| PageRank per Year | 9 | 250.5 |
| By Title |
|---|
| By Abstract |
|---|
| By Full Text |
|---|
| By Co-citation |
|---|
| Citing sentences |
|---|
| P07-1001 1 125:185 We measure translation performance by the BLEU score (Papineni et al. , 2002) and Translation Error Rate (TER) (Snover et al. , 2006) with one reference for each hypothesis. |
| P06-1090 2 89:135 We report results using the well-known automatic evaluation metrics Bleu (Papineni et al. , 2002). |
| P07-1039 3 95:170 The quality of the translation output is evaluated using BLEU (Papineni et al. , 2002). |
| C04-1168 4 73:197 The following four metrics were used speci cally in this study: BLEU (Papineni et al. , 2002): A weighted geometric mean of the n-gram matches between test and reference sentences multiplied by a brevity penalty that penalizes short translation sentences. |
| W05-0828 5 44:60 3.2 Results and Discussion The BLEU scores (Papineni et al. , 2002) for 10 direct translations and 4 sets of heuristic selections 4Admittedly, in typical instances of such chains, English would appear earlier. |
| W05-1510 6 141:201 The accuracy of the generator outputs was evaluated by the BLEU score (Papineni et al. , 2001), which is commonly used for the evaluation of machine translation and recently used for the evaluation of generation (Langkilde-Geary, 2002; Velldal and Oepen, 2005). |
| C04-1015 7 100:201 BLEU: Automatic evaluation by BLEU score (Papineni et al. , 2002). |
| W08-0328 8 43:74 Table 1 shows the evaluation of all the systems in terms of BLEU score (Papineni et al., 2002) with the best score highlighted. |
| P07-1111 9 31:176 Since the introduction of BLEU (Papineni et al. , 2002) the basic n-gram precision idea has been augmented in a number of ways. |
| W07-0716 10 12:171 Och showed thatsystemperformanceisbestwhenparametersare optimizedusingthesameobjectivefunctionthatwill be used for evaluation; BLEU (Papineni et al. , 2002) remains common for both purposes and is often retained for parameter optimization even when alternative evaluation measures are used, e.g., (Banerjee and Lavie, 2005; Snover et al. , 2006). |
| W08-0320 11 73:89 We used these weights in a beam search decoder to produce translations for the test sentences, which we compared to the WMT07 gold standard using Bleu (Papineni et al., 2002). |
| H05-1117 12 51:168 3 Previous Work The idea of employing n-gram co-occurrence statistics to score the output of a computer system against one or more desired reference outputs was first successfully implemented in the BLEU metric for machine translation (Papineni et al. , 2002). |
| P07-1091 13 135:196 (Case-sensitive) BLEU-4 (Papineni et al. , 2002) is used as the evaluation metric. |
| W07-0704 14 71:182 We employ the phrase-based SMT framework (Koehn et al. , 2003), and use the Moses toolkit (Koehn et al. , 2007), and the SRILM language modelling toolkit (Stolcke, 2002), and evaluate our decoded translations using the BLEU measure (Papineni et al. , 2002), using a single reference translation. |
| W04-2203 15 50:184 3.1 Golden-standard-based criteria In the domain of machine translation systems, an increasingly accepted way to measure the quality of a system is to compare the outputs it produces with a set of reference translations, considered as an approximation of a golden standard (Papineni et al. , 2002; hovy et al. , 2002). |
| C04-1064 16 77:172 As an example of it s application, N-gram co-occurrence is used for evaluating machine translations (Papineni et al. , 2002). |
| W05-0820 17 69:91 Translation performance was measured using the BLEU score (Papineni et al. , 2002), which measures n-gram overlap with a reference translation. |
| N06-1003 18 58:146 To set the weights, m, we performed minimum error rate training (Och, 2003) on the development set using Bleu (Papineni et al. , 2002) as the objective function. |
| W06-3102 19 13:125 Although the BLEU (Papineni et al. , 2002) score from Finnish to English is 21.8, the score in the reverse direction is reported as 13.0 which is one of the lowest scores in 11 European languages scores (Koehn, 2005). |
| P07-1089 20 135:179 Our evaluation metric is BLEU-4 (Papineni et al. , 2002), as calculated by the script mteval-v11b.pl with its default setting except that we used case-sensitive matching of n-grams. |
| N07-1063 21 17:163 We present results in the form of search error analysis and translation quality as measured by the BLEU score (Papineni et al. , 2002) on the IWSLT 06 text translation task (Eck and Hori, 2005)1, comparing Cube Pruning with our two-pass approach. |
| P03-1040 22 75:201 Performance is also measured by the BLEU score (Papineni et al. , 2002), which measures similarity to the reference translation taken from the English side of the parallel corpus. |
| W06-3110 23 69:125 To measure the translation quality, we use the BLEU score (Papineni et al. , 2002) and the NIST score (Doddington, 2002). |
| P05-1018 24 131:224 Existing automatic evaluation measures such as BLEU (Papineni et al. , 2002) and ROUGE (Lin 2The collections are available from http://www.csail. |
| P06-2070 25 36:155 BLEU and NIST have been shown to correlate closely with human judgments in ranking MT systems with different qualities (Papineni et al. , 2002; Doddington, 2002). |
| P06-2070 26 33:155 2 Recap of BLEU, ROUGE-W and METEOR The most commonly used automatic evaluation metrics, BLEU (Papineni et al. , 2002) and NIST (Doddington, 2002), are based on the assumption that The closer a machine translation is to a promt1: Life is like one nice chocolate in box ref: Life is just like a box of tasty chocolate ref: Life is just like a box of tasty chocolate mt2: Life is of one nice chocolate in box Figure 1: Alignment Example for ROUGE-W fessional human translation, the better it is (Papineni et al. , 2002). |
| W06-1608 27 112:168 3.2 Translation quality Table 2 presents the impact of parse quality on a treelet translation system, measured using BLEU (Papineni et al. , 2002). |
| N03-2013 28 21:118 Expansion of the equivalent sentence set can be applied to automatic evaluation of machine translation quality (Papineni et al. , 2002; Akiba et al. , 2001), for example. |
| W08-0409 29 112:167 The translation output is measured using BLEU (Papineni et al., 2002). |
| P07-1038 30 56:203 The well-known BLEU (Papineni et al. , 2002) is based on the number of common n-grams between the translation hypothesis and human reference translations of the same sentence. |
| P07-1038 31 7:203 Reference-based metrics such as BLEU (Papineni et al. , 2002) have rephrased this subjective task as a somewhat more objective question: how closely does the translation resemble sentences that are known to be good translations for the same source? |
| W05-0831 32 141:215 5.2 Evaluation Criteria For the automatic evaluation, we used the criteria from the IWSLT evaluation campaign (Akiba et al. , 2004), namely word error rate (WER), positionindependent word error rate (PER), and the BLEU and NIST scores (Papineni et al. , 2002; Doddington, 2002). |
| H05-2007 33 14:47 We can incorporate each model into the system in turn, and rank the results on a test corpus using BLEU (Papineni et al. , 2002). |
| H05-2007 34 4:47 1 Introduction Over the last few years, several automatic metrics for machine translation (MT) evaluation have been introduced, largely to reduce the human cost of iterative system evaluation during the development cycle (Papineni et al. , 2002; Melamed et al. , 2003). |
| P05-1069 35 165:243 Experimental results are reported in Table 2: here cased BLEU results are reported on MT03 Arabic-English test set (Papineni et al. , 2002). |
| P05-3026 36 89:104 METEOR was chosen since, unlike the more commonly used BLEU metric (Papineni et al. , 2002), it provides reasonably reliable scores for individual sentences. |
| W05-1203 37 7:105 Text similarity has been also used for relevance feedback and text classification (Rocchio, 1971), word sense disambiguation (Lesk, 1986), and more recently for extractive summarization (Salton et al. , 1997b), and methods for automatic evaluation of machine translation (Papineni et al. , 2002) or text summarization (Lin and Hovy, 2003). |
| W07-0703 38 21:186 BLEU (Papineni et al, 2002) was devised to provide automatic evaluation of MT output. |
| W05-0906 39 80:155 This idea of employing n-gram co-occurrence statistics to score the output of a computer system against one or more desired reference outputs has its roots in the BLEU metric for machine translation (Papineni et al. , 2002) and the ROUGE (Lin and Hovy, 2003) metric for summarization. |
| W06-3112 40 5:135 1 Introduction Since their appearance, BLEU (Papineni et al. , 2002) and NIST (Doddington, 2002) have been the standard tools used for evaluating the quality of machine translation. |
| W06-3112 41 27:135 Even the creators of BLEU point out that it may not correlate particularly well with human judgment at the sentence level (Papineni et al. , 2002), a problem also noted by (Och et al. , 2003) and (Russo-Lassner et al. , 2005). |
| N07-1005 42 116:194 In our research, 23 scores, namely BLEU (Papineni et al. , 2002) with maximum n-gram lengths of 1, 2, 3, and 4, NIST (NIST, 2002) with maximum n-gram lengths of 1, 2, 3, 4, and 5, GTM (Turian et al. , 2003) with exponents of 1.0, 2.0, and 3.0, METEOR (exact) (Banerjee and Lavie, 2005), WER (Niessen et al. , 2000), PER (Leusch et al. , 2003), and ROUGE (Lin, 2004) with n-gram lengths of 1, 2, 3, and 4 and 4 variants (LCS, S,SU, W-1.2), were used to calculate each similarity S i . Therefore, the value of m in Eq. |
| N07-1005 43 115:194 Many methods for calculating the similarity have been proposed (Niessen et al. , 2000; Akiba et al. , 2001; Papineni et al. , 2002; NIST, 2002; Leusch et al. , 2003; Turian et al. , 2003; Babych and Hartley, 2004; Lin and Och, 2004; Banerjee and Lavie, 2005; Gimenez et al. , 2005). |
| N07-1005 44 13:194 In recent years, many researchers have tried to automatically evaluate the quality of MT and improve the performance of automatic MT evaluations (Niessen et al. , 2000; Akiba et al. , 2001; Papineni et al. , 2002; NIST, 2002; Leusch et al. , 2003; Turian et al. , 2003; Babych and Hartley, 2004; Lin and Och, 2004; Banerjee and Lavie, 2005; Gimenez et al. , 2005) because improving the performance of automatic MT evaluation is expected to enable us to use and improve MT systems efficiently. |
| W03-1001 45 115:171 These blocks are used to compute the results in the fourth column: the BLEU score (Papineni et al. , 2002) with a153 reference translation using a153 -grams along with 95% confidence interval is reported 4. |
| D07-1055 46 67:198 A popular metric for evaluating machine translation quality is the Bleu score (Papineni et al. , 2002). |
| D07-1055 47 65:198 There exists a variety of different metrics, e.g., word error rate, position-independent word error rate, BLEU score (Papineni et al. , 2002), NIST score (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), GTM (Turian et al. , 2003). |
| P06-1002 48 9:186 Other metrics assess the impact of alignments externally, e.g., different alignments are tested by comparing the corresponding MT outputs using automated evaluation metrics (e.g. , BLEU (Papineni et al. , 2002) or METEOR (Banerjee and Lavie, 2005)). |
| P06-1002 49 93:186 MT output was evaluated using the standard evaluation metric BLEU (Papineni et al. , 2002).2 The parameters of the MT System were optimized for BLEU metric on NIST MTEval2002 test sets using minimum error rate training (Och, 2003), and the systems were tested on NIST MTEval2003 test sets for both languages. |
| C08-1014 50 152:197 Our evaluation metrics are BLEU (Papineni et al., 2002) and NIST, which are to perform caseinsensitive matching of n-grams up to n = 4. |
| P04-1027 51 137:190 From this point of view, some of the measures used in the evaluation of Machine Translation systems, such as BLEU (Papineni et al. , 2002), have been imported into the summarization task. |
| W08-0309 52 169:288 The automatic metrics that were evaluated in this years shared task were the following: Bleu (Papineni et al., 2002)Bleu remains the de facto standard in machine translation evaluation. |
| W08-0401 53 116:232 4 5 Experiments 5.1 Evaluation Measures We evaluated the proposed method using four evaluation measures, BLEU (Papineni et al., 2002), NIST (Doddington 2002), WER(word error rate), and PER(position independent word error rate). |
| W08-0903 54 88:138 Techniques that analyze n-gram precision such as BLEU score (Papineni et al., 2002) have been developed with the goal of comparing candidate translations against references provided by human experts in order to determine accuracy; although in our application the candidate translator is a student and not a machine, the principle is the same, and we wish to adapt their technique to our context. |
| W08-0903 55 108:138 A summary of the differences between our proposed approach and that of (Papineni et al., 2002) would include: The reliance of BLEU on the diversity of multiple reference translations in order to capture some of the acceptable alternatives in both word choice and word ordering that we have shown above. |
| P07-1108 56 97:179 The translation quality was evaluated using a well-established automatic measure: BLEU score (Papineni et al. , 2002). |
| P07-1108 57 19:179 Using BLEU (Papineni et al. , 2002) as a metric, our method achieves an absolute improvement of 0.06 (22.13% relative) as compared with the standard model trained with 5,000 L f -L e sentence pairs for French-Spanish translation. |
| P04-1079 58 121:149 A similar observation was made in (Papineni et al. , 2002: 313). |
| P04-1079 59 9:149 Some of them use human reference translations, e.g., the BLEU method (Papineni et al. , 2002), which is based on comparison of N-gram models in MT output and in a set of human reference translations. |
| P04-1079 60 28:149 Besides saving cost, the ability to dependably work with a single human translation has an additional advantage: it is now possible to create Recall-based evaluation measures for MT, which has been problematic for evaluation with multiple reference translations, since only one of the choices from the reference set is used in translation (Papineni et al. 2002:314). |
| P04-1079 61 109:149 On the one hand using 1 human reference with uniform results is essential for our methodology, since it means that there is no more trouble with Recall (Papineni et al. , 2002:314) a systems ability to avoid under-generation of N-grams can now be reliably measured. |
| P04-1079 62 124:149 Automatic evaluation methods such as BLEU (Papineni et al. , 2002), RED (Akiba et al. , 2001), or the weighted N-gram model proposed here may be more consistent in judging quality as compared to human evaluators, but human judgments remain the only criteria for metaevaluating the automatic methods. |
| P06-1011 63 131:172 Translation performance is measured using the automatic BLEU (Papineni et al. , 2002) metric, on one reference translation. |
| N04-4003 64 65:102 Word Error Rate (WER), which penalizes the edit distance against reference translations (Su et al. , 1992) BLEU: the geometric mean of n-gram precision for the translation results found in reference translations (Papineni et al. , 2002) Translation Accuracy (ACC): subjective evaluation ranks ranging from A to D (A: perfect, B: fair, C: acceptable and D: nonsense), judged blindly by a native speaker (Sumita et al. , 1999) In contrast to WER, higher BLEU and ACC scores indicate better translations. |
| W06-3101 65 17:92 2 Related Work There is a number of publications dealing with various automatic evaluation measures for machine translation output, some of them proposing new measures, some proposing improvements and extensions of the existing ones (Doddington, 2002; Papineni et al. , 2002; Babych and Hartley, 2004; Matusov et al. , 2005). |
| W06-3101 66 11:92 The most widely used are Word Error Rate (WER), Position Independent Word Error Rate (PER), the BLEU score (Papineni et al. , 2002) and the NIST score (Doddington, 2002). |
| P06-1091 67 137:210 We show translation results in terms of the automatic BLEU evaluation metric (Papineni et al. , 2002) on the MT03 Arabic-English DARPA evaluation test set consisting of a212a89a212a89a87 sentences with a98a89a212a161a213a89a214a89a215 Arabic words with a95 reference translations. |
| E06-1032 68 4:157 1 Introduction Over the past five years progress in machine translation, and to a lesser extent progress in natural language generation tasks such as summarization, has been driven by optimizing against n-grambased evaluation metrics such as Bleu (Papineni et al. , 2002). |
| W06-3122 69 16:91 Although Phramer provides decoding functionality equivalent to Pharaohs, we preferred to use Pharaoh for this task because it is much faster than Phramer between 2 and 15 times faster, depending on the configuration and preliminary tests showed that there is no noticeable difference between the output of these two in terms of BLEU (Papineni et al. , 2002) score. |
| H05-1095 70 74:253 Unfortunately, this is not the case for such widely used MT evaluation metrics as BLEU (Papineni et al. , 2002) and NIST (Doddington, 2002). |
| W06-3121 71 15:69 The release has implementations for BLEU (Papineni et al. , 2002), WER and PER error criteria and it has decoding interfaces for Phramer and Pharaoh. |
| D07-1008 72 151:382 To counteract this, we introduce two brevity penalty measures (BP) inspired by BLEU (Papineni et al. , 2002) which we incorporate into the loss function, using a product, loss = 1PrecBP: BP1 = exp(1max(1, rc)) (6) BP2 = exp(1max(cr, rc)) where r is the reference length and c is the candidate length. |
| H05-1098 73 42:140 The feature weights are learned by maximizing the BLEU score (Papineni et al. , 2002) on held-out data,usingminimum-error-ratetraining(Och,2003) as implemented by Koehn. |
| H05-1098 74 67:140 5 Analysis Over the last few years, several automatic metrics for machine translation evaluation have been introduced, largely to reduce the human cost of iterative system evaluation during the development cycle (Lin and Och, 2004; Melamed et al. , 2003; Papineni et al. , 2002). |
| N07-1006 75 6:159 The most commonly used metric, BLEU, correlates well over large test sets with human judgments (Papineni et al. , 2002), but does not perform as well on sentence-level evaluation (Blatz et al. , 2003). |
| N07-1006 76 39:159 2 Three New Features for MT Evaluation Since our source-sentence constrained n-gram precision and discriminative unigram precision are both derived from the normal n-gram precision, it is worth describing the original n-gram precision metric, BLEU (Papineni et al. , 2002). |
| P08-1064 77 130:210 The evaluation metric is case-sensitive BLEU-4 (Papineni et al., 2002). |
| P06-1067 78 20:241 This new model leads to significant improvements in MT quality as measured by BLEU (Papineni et al. , 2002). |
| J06-4004 79 204:388 Translation accuracy is measured in terms of the BLEU score (Papineni et al. 2002), which is computed here for translations generated by using the tuple n-gram model alone, in the case of Table 2, and by using the tuple n-gram model along with the additional four feature functions described in Section 3.2, in the case of Table 3. |
| J06-4004 80 245:388 In our SMT system implementation, this optimization procedure is performed by using a tool developed in-house, which is based on a simplex method (Press et al. 2002), and the BLEU score (Papineni et al. 2002) is used as a translation quality measurement. |
| H05-1023 81 203:217 We report case sensitive Bleu (Papineni et al. , 2002)scoreBleuCforallexperiments. |
| W07-0403 82 201:234 Results on the provided 2000sentence development set are reported using the BLEU metric (Papineni et al. , 2002). |
| W05-0823 83 59:86 This algorithm adjusts the log-linear weights so that BLEU (Papineni et al. , 2002) is maximized over a given development set. |
| W05-1204 84 16:174 Consequently, here we employ multiple references to evaluate MT systems like BLEU (Papineni et al. , 2002) and NIST (Doddington, 2002). |
| W07-0729 85 53:159 Translation scores are reported using caseinsensitive BLEU (Papineni et al. , 2002) with a single reference translation. |
| C08-1128 86 148:207 We evaluated performance by measuring WER (word error rate), PER (position-independent word error rate), BLEU (Papineni et al., 2002) and TER (translation error rate) (Snover et al., 2006) using multiple references. |
| P06-1130 87 133:183 4.2 String-Based Evaluation We evaluate the output of our generation system against the raw strings of Section 23 using the Simple String Accuracy and BLEU (Papineni et al. , 2002) evaluation metrics. |
| N06-2029 88 59:90 For evaluation, we used the BLEU metrics, which calculates the geometric mean of n-gram precision for the MT outputs found in reference translations (Papineni et al. , 2002). |
| N06-1013 89 142:176 MT output is evaluated using the standard MT evaluation metric BLEU (Papineni et al. , 2002). |
| D07-1049 90 96:215 All evaluation is in terms of the BLEU score on our test set (Papineni et al. , 2002). |
| W08-0307 91 167:224 5http://opennlp.sourceforge.net/ We use the standard four-reference NIST MTEval data sets for the years 2003, 2004 and 2005 (henceforth MT03, MT04 and MT05, respectively) for testing and the 2002 data set for tuning.6 BLEU4 (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005) and multiple-reference Word Error Rate scores are reported. |
| C04-1114 92 73:180 Both calculate the precision of a translation by comparing it to a reference translation and incorporating a length penalty (Doddington, 2001; Papineni et al. , 2002).] |
| D07-1092 93 170:365 Results in terms of word-error-rate (WER) and BLEU score (Papineni et al. , 2002) are reported in Table 4 for those sentences that contain at least one unknown word. |
| W06-1112 94 17:168 They are a bit controversial in a proper machine translation, where the popular BLEU score (Papineni et al. , 2002), although widely accepted as a measure of translation accuracy, seems to favor stochastic approaches based on 91 an n-gram model over other MT methods (see the results in (Nist, 2001)). |
| W03-0501 95 141:177 5.2 Bleu: Automatic Evaluation BLEU (Papineni et al, 2002) is a system for automatic evaluation of machine translation. |
| W07-0411 96 8:166 1 Introduction Since their appearance, string-based evaluation metrics such as BLEU (Papineni et al. , 2002) and NIST (Doddington, 2002) have been the standard tools used for evaluating MT quality. |
| W07-0411 97 51:166 Even the creators of BLEU point out that it may not correlate particularly well with human judgment at the sentence level (Papineni et al. , 2002). |
| C08-1138 98 142:198 The evaluation metric is casesensitive BLEU-4 (Papineni et al., 2002). |
| N07-2006 99 66:89 The baseline score using all phrase pairs was 59.11 (BLEU, Papineni et al. , 2002) with a 95% confidence interval of [57.13, 61.09]. |
| D07-1030 100 22:173 In our experiments using BLEU (Papineni et al. , 2002) as the metric, the interpolated synthetic model achieves a relative improvement of 11.7% over the best RBMT system that is used to produce the synthetic bilingual corpora. |
| D07-1030 101 90:173 The translation quality is evaluated using a well-established automatic measure: BLEU score (Papineni et al. , 2002). |
| W05-0822 102 37:90 Once this is accomplished, a variant of Powells algorithm is used to find weights that optimize BLEU score (Papineni et al, 2002) over these hypotheses, compared to reference translations. |
| P03-1057 103 17:230 Another current topic of machine translation is automatic evaluation of MT quality (Papineni et al. , 2002; Yasuda et al. , 2001; Akiba et al. , 2001). |
| P03-1057 104 46:230 3 Automatic Evaluation of MT Quality We utilize BLEU (Papineni et al. , 2002) for the automatic evaluation of MT quality in this paper. |
| I05-2021 105 16:135 However, recent progress in machine translation and the continuous improvement on evaluation metrics such as BLEU (Papineni et al. , 2002) suggest that SMT systems are already very good at choosing correct word translations. |
| W08-0317 106 67:95 De-En En-De Baseline 26.95 20.16 Factored baseline 27.43 20.27 Submitted system 27.63 20.46 Table 1: Bleu scores for Europarl (test2007) De-En En-De Baseline 19.54 14.31 Factored baseline 20.16 14.37 Submitted system 20.61 14.77 Table 2: Bleu scores for News Commentary (nc-test2007) 5 Results Case-sensitive Bleu scores4 (Papineni et al., 2002) for the Europarl devtest set (test2007) are shown in table 1. |
| W05-0712 107 132:175 of Words Person names 803 1749 Organization names 312 867 Location names 345 614 The BLEU score (Papineni et al. , 2002) with a single reference translation was deployed for evaluation. |
| N07-1046 108 171:233 Therefore, having correct transliterations would give only small improvements in terms of BLEU (Papineni et al. , 2002) and NIST scores. |
| P08-1022 109 17:157 Moreover, the overall BLEU (Papineni et al., 2002) and METEOR (Lavie and Agarwal, 2007) scores, as well as numbers of exact string matches (as measured against to the original sentences in the CCGbank) are higher for the hypertagger-seeded realizer than for the preexisting realizer. |
| D08-1028 110 204:215 There are however other similarity metrics (e.g. BLEU (Papineni et al., 2002)) which could be used equally well. |
| N07-1007 111 25:188 We also show that integrating our case prediction model improves the quality of translation according to BLEU (Papineni et al. , 2002)g2 and human evaluation. |
| P08-1086 112 108:148 Instead we report BLEU scores (Papineni et al., 2002) of the machine translation system using different combinations of wordand classbased models for translation tasks from English to Arabic and Arabic to English. |
| P07-2045 113 28:103 It also contains tools for tuning these models using minimum error rate training (Och 2003) and evaluating the resulting translations using the BLEU score (Papineni et al. 2002). |
| D08-1060 114 145:222 We also report the result of our translation quality in terms of both BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) against four human reference translations. |
| N06-1031 115 36:157 In the nal step, we score our translations with 4-gram BLEU (Papineni et al. , 2002). |
| P08-1011 116 145:191 In addition to precision and recall, we also evaluate the Bleu score (Papineni et al., 2002) changes before and after applying our measure word generation method to the SMT output. |
| E06-1031 117 15:291 State-of-the-art measures such as BLEU (Papineni et al. , 2002) or NIST (Doddington, 2002) aim at measuring the translation quality rather on the document level1 than on the level of single sentences. |
| W08-0306 118 92:125 BLEU For all translation tasks, we report caseinsensitive NIST BLEU scores (Papineni et al., 2002) using 4 references per sentence. |
| I05-2042 119 9:168 Although, there are various manual/automatic evaluation methods for these systems, e.g., BLEU (Papineni et al. 2002), these methods are basically incapable of dealing with an MTsystem and a w/p-MT-system at the same time, as they have different output forms. |
| W08-0321 120 27:99 Table 2 shows results in lowercase BLEU (Papineni et al., 2002) for both the baseline (B) and the improved baseline systems (B5) on development and held151 out evaluation sets. |
| C08-1141 121 7:183 The state-of-the-art methods for automatic MT evaluation are using an n-gram based metric represented by BLEU (Papineni et al., 2002) and its variants. |
| C04-1030 122 157:215 This score measures the precision of unigrams, bigrams, trigrams and fourgrams with respect to a reference translation with a penalty for too short sentences (Papineni et al. , 2002). |
| J06-4002 123 19:281 For instance, several studies have shown that BLEU correlates with human ratings on machine translation quality (Papineni et al. 2002; Doddington 2002; Coughlin 2003). |
| J06-4002 124 24:281 However, they can be usefully employed during system development, for example, for quickly assessing modeling ideas or for comparing across different system configurations (Papineni et al. 2002; Bangalore, Rambow, and Whittaker 2000). |
| D08-1090 125 154:231 All conditions were optimized using BLEU (Papineni et al., 2002) and evaluated using both BLEU and Translation Edit Rate (TER) (Snover et al., 2006). |
| W05-0833 126 52:152 In order to create the necessary SMT language and translation models, they used: Giza++ (Och & Ney, 2003);2 the CMU-Cambridge statistical toolkit;3 the ISI ReWrite Decoder.4 Translation was performed from EnglishFrench and FrenchEnglish, and the resulting translations were evaluated using a range of automatic metrics: BLEU (Papineni et al. , 2002), Precision and Recall 2http://www.isi.edu/och/Giza++.html 3http://mi.eng.cam.ac.uk/prc14/toolkit.html 4http://www.isi.edu/licensed-sw/rewrite-decoder/ 185 (Turian et al. , 2003), and Wordand Sentence Error Rates. |
| W05-0833 127 17:152 We provide results using a range of automatic evaluation metrics: BLEU (Papineni et al. , 2002), Precision and Recall (Turian et al. , 2003), and Wordand Sentence Error Rates. |
| W07-0711 128 156:235 84 5.2 Machine translation on Europarl corpus We further tested our WDHMM on a phrase-based machine translation system to see whether our improvement on word alignment can also improve MT accuracy measured by BLEU score (Papineni et al. , 2002). |
| W03-1612 129 11:155 BLEU (Papineni et al. , 2002b) is one of the methods for automatic evaluation of translation quality. |
| W03-1612 130 13:155 High correlation is reported between the BLEU score and human evaluations for translations from Arabic, Chinese, French, and Spanish to English (Papineni et al. , 2002a). |
| W03-1612 131 66:155 Empirically the BLEU score has a high correlation with human evaluation when N = 4 for English translation evaluations (Papineni et al. , 2002b). |
| W03-1612 132 34:155 2 Background: Overview of BLEU This section briefly describes the original BLEU (Papineni et al. , 2002b)1, which was designed for English translation evaluation, so English sentences are used as examples in this section. |
| H05-1109 133 80:233 For extrinsic evaluation of machine translation, we use the BLEU metric (Papineni et al. , 2002). |
| H05-1049 134 33:196 3 Semantic Representation 3.1 The Need for Dependencies Perhaps the most common representation of text for assessing content is Bag-Of-Words or Bag-of-NGrams (Papineni et al. , 2002). |
| P06-2101 135 11:219 The ongoing evaluationliteratureisperhapsmostobviousinthe machine translation communitys efforts to better BLEU (Papineni et al. , 2002). |
| P07-1092 136 106:201 The parameters, j, were trained using minimum error rate training (Och, 2003) to maximise the BLEU score (Papineni et al. , 2002) on a 150 sentence development set. |
| P06-1119 137 147:233 First, we compared our system output to human reference translations using Bleu (Papineni, et al. , 2002), a widelyaccepted objective metric for evaluation of machine translations. |
| W08-0405 138 169:208 Translation results are given in terms of the automaticBLEUevaluation metric (Papineni et al., 2002) as well as the TER metric (Snover et al., 2006). |
| P08-1112 139 91:190 Unfortunately, as was shown by Fraser and Marcu (2007) AER can have weak correlation with translation performance as measured by BLEU score (Papineni et al., 2002), when the alignments are used to train a phrase-based translation system. |
| W08-0308 140 127:159 TheChinesesentencefromtheselected pair is used as the single reference to tune and evaluate the MT system with word-based BLEU-4 (Papineni et al., 2002). |
| P07-1005 141 114:177 Following (Chiang, 2005), we used the version 11a NIST BLEU script with its default settings to calculate the BLEU scores (Papineni et al. , 2002) based on case-insensitive ngram matching, where n is up to 4. |
| I08-1030 142 168:242 The translations are evaluated in terms of BLEU score (Papineni et al., 2002). |
| W05-0909 143 48:179 2 The METEOR Metric 2.1 Weaknesses in BLEU Addressed in METEOR The main principle behind IBMs BLEU metric (Papineni et al, 2002) is the measurement of the 66 overlap in unigrams (single words) and higher order n-grams of words, between a translation being evaluated and a set of one or more reference translations. |
| W05-0909 144 10:179 1 Introduction Automatic Metrics for machine translation (MT) evaluation have been receiving significant attention in the past two years, since IBM's BLEU metric was proposed and made available (Papineni et al 2002). |
| P06-2124 145 121:213 For word alignment accuracy, F-measure is reported, i.e., the harmonic mean of precision and recall against a gold-standard reference set; for translation quality, Bleu (Papineni et al. , 2002) and its variation of NIST scores are reported. |
| N06-1004 146 33:208 2 Disperp and Distortion Corpora 2.1 Defining Disperp The ultimate reason for choosing one SCM over another will be the performance of an MT system containing it, as measured by a metric like BLEU (Papineni et al. , 2002). |
| C04-1016 147 7:139 Automated metrics such as BLEU (Papineni et al. , 2002), RED (Akiba et al, 2001), Weighted N-gram model (WNM) (Babych, 2004), syntactic relation / semantic vector model (Rajman and Hartley, 2001) have been shown to correlate closely with scoring or ranking by different human evaluation parameters. |
| C04-1016 148 89:139 It was found to produce automated scores, which strongly correlate with human judgements about translation fluency (Papineni et al. , 2002). |
| W05-0904 149 7:146 The most commonly used automatic evaluation metrics, BLEU (Papineni et al. , 2002) and NIST (Doddington, 2002), are based on the assumption that The closer a machine translation is to a professional human translation, the better it is (Papineni et al. , 2002). |
| W05-0904 150 11:146 BLEU and NIST have been shown to correlate closely with human judgments in ranking MT systems with different qualities (Papineni et al. , 2002; Doddington, 2002). |
| P07-1044 151 12:186 BLEU (Papineni et al. , 2002) is a canonical example: in matching n-grams in a candidate translation text with those in a reference text, the metric measures faithfulness by counting the matches, and fluency by implicitly using the reference n-grams as a language model. |
| P06-2005 152 17:209 We use IBMs BLEU score (Papineni et al. , 2002) to measure the performance of SMS text normalization. |
| P06-2005 153 148:209 For evaluation, we use IBMs BLEU score (Papineni et al. , 2002) to measure the performance of the SMS normalization. |
| D08-1011 154 125:215 In the following experiments, the NIST BLEU score is used as the evaluation metric (Papineni et al., 2002), which is reported as a percentage in the following sections. |
| H05-1019 155 29:179 They reported that their method is superior to BLEU (Papineni et al. , 2002) in terms of the correlation between human assessment and automatic evaluation. |
| P07-2026 156 38:101 3.3 BLEU Score The BLEU score (Papineni et al. , 2002) measures the agreement between a hypothesiseI1 generated by the MT system and a reference translation eI1. |
| P06-1077 157 110:252 We evaluated the translation quality using the BLEU metric (Papineni et al. , 2002), as calculated by mteval-v11b.pl with its default setting except that we used case-sensitive matching of n-grams. |
| D07-1007 158 174:218 In addition to the widely used BLEU (Papineni et al. , 2002) and NIST (Doddington, 2002) scores, we also evaluate translation quality with the recently proposed Meteor (Banerjee and Lavie, 2005) and four edit-distance style metrics, Word Error Rate (WER), Positionindependent word Error Rate (PER) (Tillmann et al. , 1997), CDER, which allows block reordering (Leusch et al. , 2006), and Translation Edit Rate (TER) (Snover et al. , 2006). |
| W06-3103 159 136:183 5.2 Evaluation Metrics The commonly used criteria to evaluate the translation results in the machine translation community are: WER (word error rate), PER (positionindependent word error rate), BLEU (Papineni et al. , 2002), and NIST (Doddington, 2002). |
| W07-0401 160 159:352 6.2 Translation Results For the translation experiments, we report the two accuracy measures BLEU (Papineni et al. , 2002) and NIST (Doddington, 2002) as well as the two error rates word error rate (WER) and positionindependent word error rate (PER). |
| W08-2118 161 115:194 To optimize the parameters of the decoder, we performed minimum error rate training on IWSLT04 optimizing for the IBM-BLEU metric (Papineni et al., 2002). |
| W08-0301 162 126:195 (Case-insensitive) BLEU-4 (Papineni et al., 2002) is used as the evaluation metric. |
| P02-1039 163 137:224 As an overall decoding performance measure, we used the BLEU metric (Papineni et al. , 2002). |
| C08-1038 164 93:151 5.2 Experimental Results Following (Langkilde, 2002) and other work on general-purpose generators, BLEU score (Papineni et al., 2002), average NIST simple string accuracy (SSA) and percentage of exactly matched sentences are adopted as evaluation metrics. |
| P05-1009 165 133:155 We evaluate accuracy performance using two automatic metrics: an identity metric, ID, which measures the percent of sentences recreated exactly, and BLEU (Papineni et al. , 2002), which gives the geometric average of the number of uni-, bi-, tri-, and four-grams recreated exactly. |
| N07-1021 166 95:166 BLEU (Papineni et al. , 2002) is a precision metric that assesses the quality of a translation in terms of the proportion of its word n-grams (n 4 has become standard) that it shares with several reference translations. |
| I08-2088 167 89:145 The horizontal axis represents the weight for the outof-domain translation model, and the vertical axis 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Weight for out-of-domain translation model BLEU sco re 400 K 800 K 1.2 M 1.6 M 2.5 M Figure 2: Results of data selection and linear interpolation (BLEU) represents the automatic metric of translation quality (BLEU score (Papineni et al., 2002) in Fig. |
| W07-0409 168 47:156 2.2 Weight optimization A common criterion to optimize the coefficients of the log-linear combination of feature functions is to maximize the BLEU score (Papineni et al. , 2002) on a development set (Och and Ney, 2002). |
| N06-1058 169 194:238 Our scores fall within the range of previous researchers (Papineni et al. , 2002; Lin and Och, 2004). |
| N06-1058 170 65:238 Automatic Evaluation Measures A variety of automatic evaluation methods have been recently proposed in the machine translation community (NIST, 2002; Melamed et al. , 2003; Papineni et al. , 2002). |
| N06-1058 171 184:238 The Pearson correlation is calculated over these ten pairs (Papineni et al. , 2002; Stent et al. , 2005). |
| N06-1058 172 164:238 4.2 Impact of Paraphrases on Machine Translation Evaluation The standard way to analyze the performance of an evaluation metric in machine translation is to compute the Pearson correlation between the automatic metric and human scores (Papineni et al. , 2002; Koehn, 2004; Lin and Och, 2004; Stent et al. , 2005). |
| N06-1058 173 168:238 This strategy is commonly used in MT evaluation, because of BLEUs well-known problems with documents of small size (Papineni et al. , 2002; Koehn, 2004). |
| W07-0707 174 21:251 The BLEU metric (Papineni et al. , 2002) and the closely related NIST metric (Doddington, 2002) along with WER and PER 48 have been widely used by many machine translation researchers. |
| W07-0707 175 10:251 The most widely used are Word Error Rate (WER), Position independent word Error Rate (PER), the BLEU score (Papineni et al. , 2002) and the NIST score (Doddington, 2002). |
| W06-3106 176 135:161 Two error rates: the sentence error rate (SER) and the word error rate (WER) that we seek to minimize, and BLEU (Papineni et al. , 2002), that we seek to maximize. |
| W07-0713 177 11:228 The most widely known are the Word Error Rate (WER), the Position independent word Error Rate (PER), the NIST score (Doddington, 2002) and, especially in recent years, the BLEU score (Papineni et al. , 2002) and the Translation Error Rate (TER) (Snover et al. , 2005). |
| W07-0737 178 14:228 We further assume that the degree of difficulty of a phrase is directly correlated with the quality of the translation produced by the MT system, which can be approximated using an automatic evaluation metric, such as BLEU (Papineni et al. , 2002). |
| C08-1064 179 118:260 Optimization and measurement were done with the NIST implementation of case-insensitive BLEU 4n4r (Papineni et al., 2002).4 4.1 Baseline We compared translation by pattern matching with a conventional exact model representation using external prefix trees (Zens and Ney, 2007). |
| W08-0324 180 55:79 3 Evaluation We trained our model parameters on a subset of the provided dev2006 development set, optimizing for case-insensitive IBM-style BLEU (Papineni et al., 2002) with several iterations of minimum error rate training on n-best lists. |
| W08-0324 181 59:79 We report case-insensitive scores for version 0.6 of METEOR (Lavie and Agarwal, 2007) with all modules enabled, version 1.04 of IBM-style BLEU (Papineni et al., 2002), and version 5 of TER (Snover et al., 2006). |
| N07-1022 182 156:209 #Reference: If our player 2, 3, 7 or 5 has the ball and the ball is close to our goal line PHARAOH++: If player 3 has the ball is in 2 5 the ball is in the area near our goal line WASP1++: If players 2, 3, 7 and 5 has the ball and the ball is near our goal line Figure 4: Sample partial system output in the ROBOCUP domain ROBOCUP GEOQUERY BLEU NIST BLEU NIST PHARAOH 0.3247 5.0263 0.2070 3.1478 WASP1 0.4357 5.4486 0.4582 5.9900 PHARAOH++ 0.4336 5.9185 0.5354 6.3637 WASP1++ 0.6022 6.8976 0.5370 6.4808 Table 1: Results of automatic evaluation; bold type indicates the best performing system (or systems) for a given domain-metric pair (p < 0.05) 5.1 Automatic Evaluation Weperformed4runsof10-foldcrossvalidation,and measured the performance of the learned generators using the BLEU score (Papineni et al. , 2002) and the NIST score (Doddington, 2002). |
| N07-1029 183 47:215 The NIST BLEU-4 is a variant of BLEU (Papineni et al. , 2002) and is computed as a49a51a50 a2a16a52a53a6 a0a9a8a10a0a12a11a54a13a55a15 a26a57a56a33a58a60a59 a43 a61a63a62 a64 a65a67a66a69a68 a28a71a70a46a72a74a73 a65 a6 a0a9a8a10a0a3a11a54a13a19a75a77a76 a6 a0a9a8a10a0a3a11a54a13 (2) where a73 a65 a6 a0a78a8a10a0a3a11a54a13 is the precision of a79 -grams in the hypothesis a0 given the reference a0 a11 and a76 a6 a0a78a8a10a0a3a11a54a13a81a80 a43 is a brevity penalty. |
| W08-1113 184 13:155 Such metrics have been introduced in other fields, including PARADISE (Walker et al., 1997) for spoken dialogue systems, BLEU (Papineni et al., 2002) for machine translation,1 and ROUGE (Lin, 2004) for summarisation. |
| W08-0312 185 37:85 3 Extending Bleu and Ter with Flexible Matching Many widely used metrics like Bleu (Papineni et al., 2002) and Ter (Snover et al., 2006) are based on measuring string level similarity between the reference translation and translation hypothesis, just like Meteor . Most of them, however, depend on finding exact matches between the words in two strings. |
| W08-0312 186 7:85 The most commonly used MT evaluation metric in recent years has been IBMs Bleu metric (Papineni et al., 2002). |
| W03-2804 187 27:186 BLEU Score: BLEU is an automatic metric designed by IBM, which uses several references (Papineni et al., 2002). |
| W07-0701 188 123:168 6 Experiments We evaluated the translation quality of the system using the BLEU metric (Papineni et al. , 2002). |
| D08-1010 189 131:200 The translation quality is evaluated by BLEU metric (Papineni et al., 2002), as calculated by mtevalv11b.pl with case-insensitive matching of n-grams, where n =4. |
| D08-1023 190 147:221 Model performance is evaluated using the standard BLEU metric (Papineni et al., 2002) which measures average n-gram precision, n 4, and we use the NIST definition of the brevity penalty for multiple reference test sets. |
| W08-0302 191 119:197 Evaluation We evaluate translation output using three automatic evaluation measures: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), and METEOR (Banerjee and Lavie, 2005, version 0.6).5 All measures used were the case-sensitive, corpuslevel versions. |
| W08-0302 192 13:197 The weights 1,,M are typically learned to directly minimize a standard evaluation criterion on development data (e.g., the BLEU score; Papineni et al., (2002)) using numerical search (Och, 2003). |
| P08-2020 193 14:80 1.2 Evaluation In this paper we report results using the BLEU metric (Papineni et al., 2002), however as the evaluation criterion in GALE is HTER (Snover et al., 2006), we also report in TER (Snover et al., 2005). |
| W08-0509 194 152:208 To compare the performance of system, we recorded the total training time and the BLEU score, which is a standard automatic measurement of the translation quality(Papineni et al., 2002). |
| D08-1033 195 213:234 scored with lowercased, tokenized NIST BLEU, and exact match METEOR (Papineni et al., 2002; Lavie and Agarwal, 2007). |
| P08-1071 196 41:207 For example, in machine translation, BLEU score (Papineni et al., 2002) is developed to assess the quality of machine translated sentences. |
| W08-0304 197 7:189 This approach attempts to improve translation quality by optimizing an automatic translation evaluation metric, such as the BLEU score (Papineni et al., 2002). |
| J05-3002 198 200:557 This restriction is necessary because the problem of optimizing many-to-many alignments 5 Our preliminary experiments with n-gram-based overlap measures, such as BLEU (Papineni et al. 2002) and ROUGE (Lin and Hovy 2003), show that these metrics do not correlate with human judgments on the fusion task, when tested against two reference outputs. |
| W04-1014 199 33:144 To evaluate sentence automatically generated with taking consideration word concatenation into by using references varied among humans, various metrics using n-gram precision and word accuracy have been proposed: word string precision (Hori and Furui, 2000b) for summarization through word extraction, ROUGE (Lin and Hovy, 2003) for abstracts, and BLEU (Papineni et al. , 2002) for machine translation. |
| P05-1074 200 15:147 Examples of monolingual parallel corpora that have been used are multiple translations of classical French novels into English, and data created for machine translation evaluation methods such as Bleu (Papineni et al. , 2002) which use multiple reference translations. |
| N07-1061 201 36:313 To set the weights, m, we carried out minimum error rate training (Och, 2003) using BLEU (Papineni et al. , 2002) as the objective function. |
| W04-1016 202 14:184 Work in this area includes that of Lin and Hovy (2003) and Pastra and Saggion (2003), both of whom inspect the use of Bleu-like metrics (Papineni et al. , 2002) in summarization. |
| N03-1003 203 25:203 This could, for example, aid machine-translation evaluation, where it has become common to evaluate systems by comparing their output against a bank of several reference translations for the same sentences (Papineni et al. , 2002). |
| W05-0836 204 103:153 5.3 Evaluation Metric This paper focuses on the BLEU metric as presented in (Papineni et al. , 2002). |
| W05-0836 205 46:153 The piecewise linearity observation made in (Papineni et al. , 2002) is no longer applicable since we cannot move the log operation into the expected value. |
| I05-2039 206 76:91 Because it is not feasible here to have humans judge the quality of many sets of translated data, we rely on an array of well known automatic evaluation measures to estimate translation quality : BLEU (Papineni et al. 2002) is the geometric mean of the n-gram precisions in the output with respect to a set of reference translations. |
| P06-1139 207 126:231 When evaluated against the state-of-the-art, phrase-based decoder Pharaoh (Koehn, 2004), using the same experimental conditions translation table trained on the FBIS corpus (7.2M Chinese words and 9.2M English words of parallel text), trigram language model trained on 155M words of English newswire, interpolation weights a65 (Equation 2) trained using discriminative training (Och, 2003) (on the 2002 NIST MT evaluation set), probabilistic beam a90 set to 0.01, histogram beam a58 set to 10 and BLEU (Papineni et al. , 2002) as our metric, the WIDL-NGLM-Aa86 a129 algorithm produces translations that have a BLEU score of 0.2570, while Pharaoh translations have a BLEU score of 0.2635. |
| P08-2015 208 90:126 We use the standard NIST MTEval data sets for the years 2003, 2004 and 2005 (henceforth MT03, MT04 and MT05, respectively).6 We report results in terms of case-insensitive 4gram BLEU (Papineni et al., 2002) scores. |
| W07-0710 209 8:214 1 Introduction In recent years, statistical machine translation have experienced a quantum leap in quality thanks to automatic evaluation (Papineni et al. , 2002) and errorbased optimization (Och, 2003). |
| W07-0710 210 24:214 2.2.1 BLEU Evaluation The BLEU score (Papineni et al. , 2002) was defined to measure overlap between a hypothesized translation and a set of human references. |
| D07-1080 211 100:227 The algorithm is slightly different from other online training algorithms (Tillmann and Zhang, 2006; Liang et al. , 2006) in that we keep and update oracle translations, which is a set of good translations reachable by a decoder according to a metric, i.e. BLEU (Papineni et al. , 2002). |
| D07-1080 212 161:227 The translation quality is evaluated by case-sensitive NIST (Doddington, 2002) and BLEU (Papineni et al. , 2002)2. |
| D07-1080 213 134:227 4.2 Approximated BLEU We used the BLEU score (Papineni et al. , 2002) as the loss function computed by: BLEU(E; E) = exp 1N Nsummationdisplay n=1 log pn(E, E) BP(E, E) (7) where pn() is the n-gram precision of hypothesized translations E ={et}Tt=1 given reference translations E ={et}Tt=1 and BP()1 is a brevity penalty. |
| N04-1019 214 7:245 In machine translation, the rankings from the automatic BLEU method (Papineni et al. , 2002) have been shown to correlate well with human evaluation, and it has been widely used since and has even been adapted for summarization (Lin and Hovy, 2003). |
| P08-1009 215 158:223 4.2 Automatic Evaluation We first present our soft cohesion constraints effect on BLEU score (Papineni et al., 2002) for both our dev-test and test sets. |
| W04-1708 216 25:164 The core technology of the proposed method, i.e., the automatic evaluation of translations, was developed in research aiming at the efficient development of Machine Translation (MT) technology (Su et al. , 1992; Papineni et al. , 2002; NIST, 2002). |
| W04-1708 217 50:164 The unit of utterance corresponds to the unit of segment in the original BLEU and NIST studies (Papineni et al. , 2002; NIST, 2002). |
| W07-0410 218 87:193 On the other hand, both BLEU (Papineni et al. , 2002) and NIST (Doddington 2002) scores are higher for the baseline system (mteval-v11b.pl). |
| W06-3108 219 151:203 5.3 Translation Results For the translation experiments on the BTEC task, we report the two accuracy measures BLEU (Papineni et al. , 2002) and NIST (Doddington, 2002) as well as the two error rates: word error rate (WER) and position-independent word error rate (PER). |
| P07-1004 220 124:233 Evaluation Metrics We evaluated the generated translations using three different evaluation metrics: BLEU score (Papineni et al. , 2002), mWER (multi-reference word error rate), and mPER (multi-reference positionindependent word error rate) (Nieen et al. , 2000). |
| P06-2109 221 119:230 4 Experiment 4.1 Evaluation Method We evaluated each sentence compression method using word F-measures, bigram F-measures, and BLEU scores (Papineni et al. , 2002). |
| C08-1041 222 133:197 The translation quality is evaluated by BLEU metric (Papineni et al., 2002), as calculated by mteval-v11b.pl with case-insensitive matching of n-grams, where n =4. |
| P06-2103 223 87:271 One of the most successful metrics for judging machine-generated text is BLEU (Papineni et al. , 2002). |
| N06-2051 224 45:59 We optimized separately for both the NIST (Doddington, 2002) and the BLEU metrics (Papineni et al. , 2002). |
| W07-0735 225 17:302 To further emphasize the importance of morphology in MT to Czech, we compare the standard BLEU (Papineni et al. , 2002) of a baseline phrasebased translation with BLEU which disregards word forms (lemmatized MT output is compared to lemmatized reference translation). |
| N07-2013 226 72:110 BLEU score In order to measure the extent to which whole chunks of text from the prompt are reproduced in the student essays, we used the BLEU score, known from studies of machine translation (Papineni et al. 2002). |
| D08-1063 227 120:196 7 Experiments To show the effectiveness of cross-language mention propagation information in improving mention detection system performance in Arabic, Chinese and Spanish, we use three SMT systems with very competitive performance in terms of BLEU11 (Papineni et al., 2002). |
| N04-1008 228 81:175 4.4.1 N-gram Co-Occurrence Statistics for Answer Extraction N-gram co-occurrence statistics have been successfully used in automatic evaluation (Papineni et al. 2002, Lin and Hovy 2003), and more recently as training criteria in statistical machine translation (Och 2003). |
| D07-1036 229 143:249 The translation quality is evaluated by BLEU metric (Papineni et al. , 2002), as calculated by mteval-v11b.pl 6 with case-sensitive matching of n-grams. |
| W06-3508 230 140:178 What, therefore, has to be explored are various similarity metrics, defining similarity in a concrete way and evaluate the results against human annotations (see Papineni et al. , 2002). |
| D08-1078 231 165:241 6.1 Evaluation of Translation Performance We use the BLEU score (Papineni et al., 2002) to evaluate our systems. |
| D08-1066 232 202:243 Performance is measured by computing the BLEU scores (Papineni et al., 2002) of the systems translations, when compared against a single reference translation per sentence. |
| P07-1040 233 51:212 2 Evaluation Metrics Currently, the most widely used automatic MT evaluation metric is the NIST BLEU-4 (Papineni et al. , 2002). |
| E06-1040 234 11:162 While studies have shown that ratings of MT systems by BLEU and similar metrics correlate well with human judgments (Papineni et al. , 2002; Doddington, 2002), we are not aware of any studies that have shown that corpus-based evaluation metrics of NLG systems are correlated with human judgments; correlation studies have been made of individual components (Bangalore et al. , 2000), but not of systems. |
| E06-1040 235 26:162 The BLEU metric (Papineni et al. , 2002) in MT has been particularly successful; for example MT05, the 2005 NIST MT evaluation exercise, used BLEU-4 as the only method of evaluation. |
| E06-1040 236 7:162 Some NLG researchers are impressed by the success of the BLEU evaluation metric (Papineni et al. , 2002) in Machine Translation (MT), which has transformed the MT field by allowing researchers to quickly and cheaply evaluate the impact of new ideas, algorithms, and data sets. |
| E06-1040 237 30:162 Properly calculated BLEU scores have been shown to correlate reliably with human judgments (Papineni et al. , 2002). |
| P05-1066 238 145:229 We use BLEU scores (Papineni et al. , 2002) to measure translation accuracy. |
| D07-1077 239 133:287 The reordered sentence is then re-tokenized to be consistent with the baseline system, which uses a different tokenization scheme that is more friendly to the MT system.3 We use BLEU scores as the performance measure in our evaluation (Papineni et al. , 2002). |
| N03-1010 240 167:227 distance (MSD) and the maximum swap segment size (MSSS) ranging from 0 to 10 and evaluated the translations with the BLEU7 metric (Papineni et al. , 2002). |
| P03-1039 241 128:155 BLEU: BLEU score, which computes the ratio of n-gram for the translation results found in reference translations (Papineni et al. , 2002). |
| W07-0714 242 5:140 1 Introduction Since the creation of BLEU (Papineni et al. , 2002) and NIST (Doddington, 2002), the subject of automatic evaluation metrics for MT has been given quite a lot of attention. |
| W07-0714 243 49:140 Even the 3 A demo of the parser can be found at http://lfgdemo.computing.dcu.ie/lfgparser.html creators of BLEU point out that it may not correlate particularly well with human judgment at the sentence level (Papineni et al. , 2002). |
| W08-0329 244 72:107 The translation quality is measured by three MT evaluation metrics: TER (Snover et al., 2006), BLEU (Papineni et al., 2002), and METEOR (Lavie and Agarwal, 2007). |
| N03-2016 245 39:92 For the evaluation of translation quality, we used the BLEU metric (Papineni et al. , 2002), which measures the n-gram overlap between the translated output and one or more reference translations. |
| D08-1051 246 166:207 EsEn 63.00.9 59.20.9 6.01.4 EnEs 63.80.9 60.51.0 5.21.6 DeEn 71.60.8 69.00.9 3.61.3 EnDe 75.90.8 73.50.9 3.21.2 FrEn 62.90.9 59.21.0 5.91.6 EnFr 63.40.9 60.00.9 5.41.4 bined in a log-linear fashion by adjusting a weight for each of them by means of the MERT (Och, 2003) procedure, optimising the BLEU (Papineni et al., 2002) score obtained on the development partition. |
| D08-1051 247 38:207 To this purpose, different authors (Papineni et al., 1998; Och and Ney, 2002) propose the use of the so-called log-linear models, where the decision rule is given by the expression y = argmax y Msummationdisplay m=1 mhm(x,y) (3) where hm(x,y) is a score function representing an important feature for the translation of x into y, M is the number of models (or features) and m are the weights of the log-linear combination. |
| N03-2036 248 57:78 The third column reports the BLEU score (Papineni et al. , 2002) along with 95% confidence interval. |
| P04-1063 249 45:208 Regressive FLM (rFLM) h(FLM(e,j)) = w1 FLM(e,j)+b Regressive ALM (rALM) h(ALM(e,j)) = w1 ALM(e,j)+b Notice that h() here is supposed to relate FLM or ALM to some independent evaluation metric such as BLEU (Papineni et al. , 2002), not the log likelihood of a translation. |
| P05-1033 250 116:249 Our evaluation metric was BLEU (Papineni et al. , 2002), as calculated by the NIST script (version 11a) with its default settings, which is to perform case-insensitive matching of n-grams up to n = 4, and to use the shortest (as opposed to nearest) reference sentence for the brevity penalty. |
| W07-0734 251 9:94 The most commonly used MT evaluation metric in recent years has been IBM?s Bleu metric (Papineni et al. , 2002). |
| W08-0322 252 57:68 The results evaluated by BLEU score (Papineni et al., 2002) is shown in Table 2. |
| P08-2040 253 87:107 Our evaluation metric is BLEU (Papineni et al., 2002). |
| P05-1048 254 98:160 Using our WSD model to constrain the translation candidates given to the decoder hurts translation quality, as measured by the automated BLEU metric (Papineni et al. , 2002). |
| C08-1144 255 102:207 Translation quality is automatically evaluated by the IBM-BLEU metric (Papineni et al., 2002) (case-sensitive, using length of the closest reference translation) on the following publicly 1148 Ch.-En. |
| N04-4015 256 56:115 Translation qualities are measured by uncased BLEU (Papineni et al. 2002) with 4 reference translations, sysids: ahb, ahc, ahd, ahe. |
| P05-1067 257 186:217 Our MT system was evaluated using the n-gram based Bleu (Papineni et al. , 2002) and NIST machine translation evaluation software. |
| W02-1022 258 199:207 While recent proposals for evaluation of MT systems have involved multi-parallel corpora (Thompson and Brew, 1996; Papineni et al. , 2002), statistical MT algorithms typically only use one-parallel data. |
| N03-1013 259 93:158 7For details about the Bleu evaluation metric, see (Papineni et al. , 2002). |
| P04-1078 260 77:122 For comparison purposes, we also computed the value of R 2 for adequacy using the BLEU score formula given in (Papineni et al. , 2002), for the 7 systems using the same one reference, and we obtain a similar value, 83.91%; computing the value of R 2 for adequacy using the BLEU scores computed with all 4 references available also yielded a lower value for R 2, 62.21%. |
| P04-1078 261 73:122 For comparison purposes, we also computed the value of R 2 for fluency using the BLEU score formula given in (Papineni et al. , 2002), for the 7 systems using the same one reference, and we obtained a similar value, 78.52%; computing the value of R 2 for fluency using the BLEU scores computed with all 4 references available yielded a lower value for R 2, 64.96%, although BLEU scores obtained with multiple references are usually considered more reliable. |
| P04-1078 262 4:122 1 Introduction With the introduction of the BLEU metric for machine translation evaluation (Papineni et al, 2002), the advantages of doing automatic evaluation for various NLP applications have become increasingly appreciated: they allow for faster implement-evaluate cycles (by by-passing the human evaluation bottleneck), less variation in evaluation performance due to errors in human assessor judgment, and, not least, the possibility of hill-climbing on such metrics in order to improve system performance (Och 2003). |
| H05-1005 263 113:209 The table also shows the popular BLEU (Papineni et al. , 2002) and NIST2 MT metrics. |
| D07-1090 264 160:316 For each training data size, we report the size of the resulting language model, the fraction of 5-grams from the test data that is present in the language model, and the BLEU score (Papineni et al. , 2002) obtained by the machine translation system. |
| W08-1112 265 104:133 Following (Langkilde, 2002) and other work on general-purpose generators, we adopt BLEU score (Papineni et al., 2002), average simple string accuracy (SSA) and percentage of exactly matched sentences for accuracy evaluation.6 For coverage evaluation, we measure the percentage of input fstructures that generate a sentence. |
| N07-2037 266 122:135 We measure translation performance by the BLEU score (Papineni et al, 2002) with one reference for each hypothesis. |
| P05-1032 267 63:147 They used the Bleu evaluation metric (Papineni et al. , 2002), but capped the n-gram precision at 4-grams. |
| P05-1032 268 128:147 We calculated the translation quality using Bleus modified n-gram precision metric (Papineni et al. , 2002) for n-grams of up to length four. |
| W05-0806 269 87:124 4.2 Translation Results The evaluation metrics used in our experiments are WER (Word Error Rate), PER (Positionindependent word Error Rate) and BLEU (BiLingual Evaluation Understudy) (Papineni et al. , 2002). |
| W08-0402 270 146:177 As shown in Table 1, the JAVA decoder (without explicit parallelization) is 22 times faster than the PYTHON decoder, while achieving slightly better translation quality as measured by BLEU-4 (Papineni et al., 2002). |
| P08-1010 271 141:204 We measure translation performance by the BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005) scores with multiple translation references. |
| P08-1010 272 51:204 Other possibilities for the weighting include assigning constant one or the exponential of the final score etc. One of the advantages of the proposed phrase training algorithm is that it is a parameterized procedure that can be optimized jointly with the trans82 lation engine to minimize the final translation errors measured by automatic metrics such as BLEU (Papineni et al., 2002). |
| D07-1054 273 19:267 This approach gave an improvement of 2.7 in BLEU (Papineni et al. , 2002) score on the IWSLT05 Japanese to English evaluation corpus (improving the score from 52.4 to 55.1). |
| W06-3111 274 107:194 (Papineni et al. , 2002). |
| P07-1066 275 129:180 During evaluation two performance metrics, BLEU (Papineni et al. , 2002) and NIST, were computed. |
| P08-1007 276 8:175 Among all the automatic MT evaluation metrics, BLEU (Papineni et al., 2002) is the most widely used. |
| P08-1007 277 31:175 2.1 BLEU BLEU (Papineni et al., 2002) is essentially a precision-based metric and is currently the standard metric for automatic evaluation of MT performance. |
| W07-0715 278 105:155 Since this trade-off is also affected by the settings of various pruning parameters, we compared decoding time and translation quality, as measured by BLEU score (Papineni et al, 2002), for the two models on our first test set over a broad range of settings for the decoder pruning parameters. |
| E06-1005 279 113:186 3.2 Evaluation Criteria Well-established objective evaluation measures like the word error rate (WER), positionindependent word error rate (PER), and the BLEU score (Papineni et al. , 2002) were used to assess the translation quality. |
| J05-4003 280 271:416 5 Translation performance was measured using the automatic BLEU evaluation metric (Papineni et al. 2002) on four reference translations. |
| I05-5008 281 9:211 Automatic measures like BLEU (PAPINENI et al. , 2001) or NIST (DODDINGTON, 2002) do so by counting sequences of words in such paraphrases. |