Webmaster's Note: The whole dataset is available Here. Please download the dataset instead of crawling the website.
Basic Info:
id: P03-1021
title: Minimum Error Rate Training In Statistical Machine Translation
venue: ACL
year: 2003
pdf: link
title: Minimum Error Rate Training In Statistical Machine Translation
venue: ACL
year: 2003
pdf: link
Abstract
Often, the training procedure for statisti- cal machine translation models is based on maximum likelihood or related criteria. A general problem of this approach is that there is only a loose relation to the final translation quality on unseen text. In this paper, we analyze various training criteria which directly optimize translation qual- ity. These training criteria make use of re- cently proposed automatic evaluation met- rics. We describe a new algorithm for effi- cient training an unsmoothed error count. We show that significantly better results can often be obtained if the final evalua- tion criterion is taken directly into account as part of the training procedure.
| Stat | Rank | Value |
|---|---|---|
| Incoming Citations | 10(11) | 218(209) |
| Outgoing Citations | 2997(4869) | 6(3) |
| PageRank | 140 | 831 |
| PageRank per Year | 23 | 166.2 |
| By Title |
|---|
| By Abstract |
|---|
| By Full Text |
|---|
| By Co-citation |
|---|
| Citing sentences |
|---|
| W06-1607 1 32:162 To model p(t,a|s), we use a standard loglinear approach: p(t,a|s) exp bracketleftBiggsummationdisplay i ifi(s,t,a) bracketrightBigg where each fi(s,t,a) is a feature function, and weights i are set using Ochs algorithm (Och, 2003) to maximize the systems BLEU score (Papineni et al. , 2001) on a development corpus. |
| W06-1607 2 65:162 In fact, a limitation of the experiments described in this paper is that the loglinear weights for the glass-box techniques were optimized for BLEU using Ochs algorithm (Och, 2003), while the linear weights for 55 black-box techniques were set heuristically. |
| P06-1028 3 115:242 Several non-linear objective functions, such as F-score for text classification (Gao et al. , 2003), and BLEU-score and some other evaluation measures for statistical machine translation (Och, 2003), have been introduced with reference to the framework of MCE criterion training. |
| P04-1059 4 57:233 For each feature function, there is a model parameter i . The best word segmentation W * is determined by the decision rule as = == M i ii W M W WSfWSScoreW 0 0 * ),(maxarg),,(maxarg (2) Below we describe how to optimize s. Our method is a discriminative approach inspired by the Minimum Error Rate Training method proposed in Och (2003). |
| P04-1059 5 219:233 An alternative to linear models is the log-linear models suggested by Och (2003). |
| P04-1059 6 55:233 It is also related to loglinear models for machine translation (Och, 2003). |
| P07-1039 7 103:170 4.3 Baseline We use a standard log-linear phrase-based statistical machine translation system as a baseline: GIZA++ implementation of IBM word alignment model 4 (Brown et al. , 1993; Och and Ney, 2003),8 the refinement and phrase-extraction heuristics described in (Koehn et al. , 2003), minimum-error-rate training 7More specifically, we choose the first English reference from the 7 references and the Chinese sentence to construct new sentence pairs. |
| P07-1039 8 26:170 To quickly (and approximately) evaluate this phenomenon, we trained the statistical IBM wordalignment model 4 (Brown et al. , 1993),1 using the GIZA++ software (Och and Ney, 2003) for the following language pairs: ChineseEnglish, Italian English, and DutchEnglish, using the IWSLT-2006 corpus (Takezawa et al. , 2002; Paul, 2006) for the first two language pairs, and the Europarl corpus (Koehn, 2005) for the last one. |
| P07-1039 9 89:170 : there is : want to : need not : in front of : as soon as : look at Figure 2: Examples of entries from the manually developed dictionary 4 Experimental Setting 4.1 Evaluation The intrinsic quality of word alignment can be assessed using the Alignment Error Rate (AER) metric (Och and Ney, 2003), that compares a systems alignment output to a set of gold-standard alignment. |
| P07-1039 10 109:170 Running words 1,864 14,437 Vocabulary size 569 1,081 Table 2: ChineseEnglish corpus statistics (Och, 2003) using Phramer (Olteanu et al. , 2006), a 3-gram language model with Kneser-Ney smoothing trained with SRILM (Stolcke, 2002) on the English side of the training data and Pharaoh (Koehn, 2004) with default settings to decode. |
| C04-1168 11 80:197 The Powells algorithm used in this work is similar as the one from (Press et al. , 2000) but we modi ed the line optimization codes, a subroutine of Powells algorithm, with reference to (Och, 2003). |
| C04-1168 12 102:197 The training of IBM model 4 was implemented by the GIZA++ package (Och and Ney, 2003). |
| C04-1168 13 119:197 We adopted an N-best hypothesis approach (Och, 2003) to train. |
| C04-1168 14 184:197 Indeed, the proposed speech translation paradigm of log-linear models have been shown e ective in many applications (Beyerlein, 1998) (Vergyri, 2000) (Och, 2003). |
| D08-1093 15 190:194 Moreover, rather than predicting an intrinsic metric such as the PARSEVAL Fscore, the metric that the predictor learns to predict can be chosen to better fit the final metric on which an end-to-end system is measured, in the style of (Och, 2003). |
| W08-0328 16 40:74 This set of 800 sentences was used for Minimum Error Rate Training (Och, 2003) to tune the weights of our system with respect to BLEU score. |
| W08-0328 17 7:74 This setup provides an elegant solution to the fairly complex task of integrating multiple MT results that may differ in word order using only standard software modules, in particular GIZA++ (Och and Ney, 2003) for the identification of building blocks and Moses for the recombination, but the authors were not able to observe improvements in 1see http://www.statmt.org/moses/ terms of BLEU score. |
| D07-1103 18 26:214 To model p(t,a|s), we use a standard loglinear approach: p(t,a|s) exp bracketleftBiggsummationdisplay i ifi(s,t,a) bracketrightBigg where each fi(s,t,a) is a feature function, and weights i are set using Ochs algorithm (Och, 2003) to maximize the systems BLEU score (Papineni et aal. |
| P07-1111 19 164:176 We want to avoid training a metric that as5Or, in a less adversarial setting, a system may be performing minimum error-rate training (Och, 2003) signs a higher than deserving score to a sentence that just happens to have many n-gram matches against the target-language reference corpus. |
| P07-1111 20 32:176 Metrics in the Rouge family allow for skip n-grams (Lin and Och, 2004a); Kauchak and Barzilay (2006) take paraphrasing into account; metrics such as METEOR (Banerjee and Lavie, 2005) and GTM (Melamed et al. , 2003) calculate both recall and precision; METEOR is also similar to SIA (Liu and Gildea, 2006) in that word class information is used. |
| W07-0716 21 10:171 Och (2003) introduced minimum error rate training (MERT), a technique for optimizing log-linear modelparametersrelativetoameasureoftranslation quality. |
| W07-0716 22 36:171 ??Initial phrase pairs are identified following the procedure typically employed in phrase based systems (Koehn et al. , 2003; Och and Ney, 2004). |
| W07-0716 23 41:171 Oncetraininghastakenplace,minimumerrorrate training (Och, 2003) is used to tune the parameters i. Finally, decoding in Hiero takes place using a CKY synchronous parser with beam search, augmented to permit efficient incorporation of language model scores (Chiang, 2007). |
| W08-0320 24 72:89 We set the feature weights by optimizing the Bleu score directly using minimum error rate training (Och, 2003) on the development set. |
| P07-1091 25 113:196 All the feature weights (s) were trained using our implementation of Minimum Error Rate Training (Och, 2003). |
| P07-1091 26 144:196 We use the Stanford parser (Klein and Manning, 2003) with its default Chinese grammar, the GIZA++ (Och and Ney, 2000) alignment package with its default settings, and the ME tool developed by (Zhang, 2004). |
| D07-1056 27 26:196 There have been considerable amount of efforts to improve the reordering model in SMT systems, ranging from the fundamental distance-based distortion model (Och and Ney, 2004; Koehn et al. , 2003), flat reordering model (Wu, 1996; Zens et al. , 2004; Kumar et al. , 2005), to lexicalized reordering model (Tillmann, 2004; Kumar et al. , 2005; Koehn et al. , 2005), hierarchical phrase-based model (Chiang, 2005), and maximum entropy-based phrase reordering model (Xiong et al. , 2006). |
| D07-1056 28 100:196 3.3 Features Similar to the default features in Pharaoh (Koehn, Och and Marcu 2003), we used following features to estimate the weight of our grammar rules. |
| D07-1056 29 152:196 Based on the word alignment results, if the aligned target words of any two adjacent foreign linguistic phrases can also be formed into two valid adjacent phrase according to constraints proposed in the phrase extraction algorithm by Och (2003a), they will be extracted as a reordering training sample. |
| D07-1056 30 149:196 6 Training Similar to most state-of-the-art phrase-based SMT systems, we use the SRI toolkit (Stolcke, 2002) for language model training and Giza++ toolkit (Och and Ney, 2003) for word alignment. |
| D07-1056 31 120:196 We just assign these rules a constant score trained using our implementation of Minimum Error Rate Training (Och, 2003b), which is 0.7 in our system. |
| W05-0820 32 61:91 (2004)), better language-specific preprocessing (Koehn and Knight, 2003) and restructuring (Collins et al. , 2005), additional feature functions such as word class language models, and minimum error rate training (Och, 2003) to optimize parameters. |
| W05-0820 33 31:91 The field of statistical machine translation has been blessed with a long tradition of freely available software tools such as GIZA++ (Och and Ney, 2003) and parallel corpora such as the Canadian Hansards2. |
| W05-0820 34 46:91 In addition, we also made a word alignment available, which was derived using a variant of the current default method for word alignment Och and Ney (2003)s refined method. |
| I08-4028 35 29:103 The decision rule here is: W 0 = argmax W {Pr(W|C)} = argmax W { M summationdisplay m=1 m h m (W, C)} (3) The parameters M 1 of this model can be optimized by standard approaches, such as the Minimum Error Rate Training used in machine translation (Och, 2003). |
| N06-1003 36 19:146 2 The Problem of Coverage in SMT Statistical machine translation made considerable advances in translation quality with the introduction of phrase-based translation (Marcu and Wong, 2002; Koehn et al. , 2003; Och and Ney, 2004). |
| N06-1003 37 58:146 To set the weights, m, we performed minimum error rate training (Och, 2003) on the development set using Bleu (Papineni et al. , 2002) as the objective function. |
| W06-3119 38 32:125 Given a source sentence f, the preferred translation output is determined by computing the lowest-cost derivation (combination of hierarchical and glue rules) yielding f as its source side, where the cost of a derivation R1 Rn with respective feature vectors v1,,vn Rm is given by msummationdisplay i=1 i nsummationdisplay j=1 (vj)i. Here, 1,,m are the parameters of the loglinear model, which we optimize on a held-out portion of the training set (2005 development data) using minimum-error-rate training (Och, 2003). |
| P07-1089 39 136:179 To perform minimum error rate training (Och, 2003) to tune the feature weights to maximize the systems BLEU score on development set, we used the script optimizeV5IBMBLEU.m (Venugopal and Vogel, 2005). |
| P07-1089 40 137:179 We ran GIZA++ (Och and Ney, 2000) on the training corpus in both directions using its default setting, and then applied the refinement rule diagand described in (Koehn et al. , 2003) to obtain a single many-to-many word alignment for each sentence pair. |
| W06-1615 41 10:260 Furthermore, end-to-end systems like speech recognizers (Roark et al. , 2004) and automatic translators (Och, 2003) use increasingly sophisticated discriminative models, which generalize well to new data that is drawn from the same distribution as the training data. |
| N07-1063 42 125:163 Parameters used to calculate P(D) are trained using MER training (Och, 2003) on development data. |
| W06-3110 43 36:125 The model scaling factors M1 are trained with respect to the final translation quality measured by an error criterion (Och, 2003). |
| P07-1059 44 15:239 We present two approaches to SMT-based query expansion, both of which are implemented in the framework of phrase-based SMT (Och and Ney, 2004; Koehn et al. , 2003). |
| P07-1059 45 61:239 4 SMT-Based Query Expansion Our SMT-based query expansion techniques are based on a recent implementation of the phrasebased SMT framework (Koehn et al. , 2003; Och and Ney, 2004). |
| P07-1059 46 88:239 as follows: p(synI1|trgI1) = ( Iproductdisplay i=1 p(syni|trgi) (4) pprime(trgi|syni)prime pw(syni|trgi)w pwprime(trgi|syni)wprime pd(syni,trgi)d) lw(synI1)l c(synI1)c pLM(synI1)LM For estimation of the feature weights vector defined in equation (4) we employed minimum error rate (MER) training under the BLEU measure (Och, 2003). |
| W08-0127 47 202:205 We also plan to employ this evaluation metric as feedback in building dialogue coherence models as is done in machine translation (Och, 2003). |
| W06-1608 48 42:168 The weights for these models are determined using the method described in (Och, 2003). |
| W08-0409 49 103:167 4.3 Baselines 4.3.1 Word Alignment We used the GIZA++ implementation of IBM word alignment model 4 (Brown et al., 1993; Och and Ney, 2003) for word alignment, and the heuristics described in (Och and Ney, 2003) to derive the intersection and refined alignment. |
| W08-0409 50 108:167 73 ment and phrase-extraction heuristics described in (Koehn et al., 2003), minimum-error-rate training (Och, 2003), a trigram language model with KneserNey smoothing trained with SRILM (Stolcke, 2002) on the English side of the training data, and Moses (Koehn et al., 2007) to decode. |
| W08-0409 51 110:167 Slightly differently from (Och and Ney, 2003), we use possible alignments in computing recall. |
| W08-0409 52 91:167 Since manual word alignment is an ambiguous task, we also explicitly allow for ambiguous alignments, i.e. the links are marked as sure (S) or possible (P) (Och and Ney, 2003). |
| P05-1069 53 21:243 Instead of directly minimizing error as in earlier work (Och, 2003), we decompose the decoding process into a sequence of local decision steps based on Eq. |
| P05-1069 54 30:243 2 Block Orientation Bigrams This section describes a phrase-based model for SMT similar to the models presented in (Koehn et al. , 2003; Och et al. , 1999; Tillmann and Xia, 2003). |
| P05-1069 55 228:243 As far as the log-linear combination of float features is concerned, similar training procedures have been proposed in (Och, 2003). |
| N07-1064 56 95:182 Feature function weights in the loglinear model are set using Ochs minium error rate algorithm (Och, 2003). |
| W07-0703 57 70:186 Weights on the loglinear features are set using Och's algorithm (Och, 2003) to maximize the system's BLEU score on a development corpus. |
| H05-1027 58 74:258 3.3 Grid Line Search Our implementation of a grid search is a modified version of that proposed in (Och 2003). |
| H05-1027 59 245:258 The line search is an extension of that described in (Och 2003; Quirk et al. 2005. |
| H05-1027 60 75:258 The modifications are made to deal with the efficiency issue due to the fact that there is a very large number of features and training samples in our task, compared to only 8 features used in (Och 2003). |
| N07-1005 61 14:194 For example, Och reported that the quality of MT results was improved by using automatic MT evaluation measures for the parameter tuning of an MT system (Och, 2003). |
| N07-1005 62 115:194 Many methods for calculating the similarity have been proposed (Niessen et al. , 2000; Akiba et al. , 2001; Papineni et al. , 2002; NIST, 2002; Leusch et al. , 2003; Turian et al. , 2003; Babych and Hartley, 2004; Lin and Och, 2004; Banerjee and Lavie, 2005; Gimenez et al. , 2005). |
| N07-1005 63 13:194 In recent years, many researchers have tried to automatically evaluate the quality of MT and improve the performance of automatic MT evaluations (Niessen et al. , 2000; Akiba et al. , 2001; Papineni et al. , 2002; NIST, 2002; Leusch et al. , 2003; Turian et al. , 2003; Babych and Hartley, 2004; Lin and Och, 2004; Banerjee and Lavie, 2005; Gimenez et al. , 2005) because improving the performance of automatic MT evaluation is expected to enable us to use and improve MT systems efficiently. |
| P06-1002 64 93:186 MT output was evaluated using the standard evaluation metric BLEU (Papineni et al. , 2002).2 The parameters of the MT System were optimized for BLEU metric on NIST MTEval2002 test sets using minimum error rate training (Och, 2003), and the systems were tested on NIST MTEval2003 test sets for both languages. |
| P06-1002 65 24:186 2 Related Work Starting with the IBM models (Brown et al. , 1993), researchers have developed various statistical word alignment systems based on different models, such as hidden Markov models (HMM) (Vogel et al. , 1996), log-linear models (Och and Ney, 2003), and similarity-based heuristic methods (Melamed, 2000). |
| D07-1055 66 95:198 However, as pointed out in (Och, 2003), there is no reason to believe that the resulting parameters are optimal with respect to translation quality measured with the Bleu score. |
| D07-1055 67 56:198 The current state-of-the-art is to use minimum error rate training (MERT) as described in (Och, 2003). |
| D07-1055 68 29:198 The current state-of-the-art is to optimize these parameters with respect to the final evaluation criterion; this is the so-called minimum error rate training (Och, 2003). |
| D07-1055 69 92:198 Therefore, (Och and Ney, 2002; Och, 2003) defined the translation candidate with the minimum word-error rate as pseudo reference translation. |
| D07-1055 70 13:198 We will show that some achieve significantly better results than the standard minimum error rate training of (Och, 2003). |
| D07-1055 71 117:198 Note that the minimum error rate training (Och, 2003) uses only the target sentence with the maximum posterior probability whereas, here, the whole probability distribution is taken into account. |
| W05-0814 72 50:74 We wish to minimize this error function, so we select accordingly: argmin summationdisplay a E(a)(a, (argmax a p(a, f|e))) (4) Maximizing performance for all of the weights at once is not computationally tractable, but (Och, 2003) has described an efficient one-dimensional search for a similar problem. |
| W05-0814 73 56:74 The discriminative training regimen is otherwise similar to (Och, 2003). |
| W05-0814 74 20:74 We applied the union, intersection and refined symmetrization metrics (Och and Ney, 2003) to the final alignments output from training, as well as evaluating the two final alignments directly. |
| W05-0814 75 8:74 For symmetrization, we found that Och and Neys refined technique described in (Och and Ney, 2003) produced the best AER for this data set under all experimental conditions. |
| W05-0814 76 7:74 The system used for baseline experiments is two runs of IBM Model 4 (Brown et al. , 1993) in the GIZA++ (Och and Ney, 2003) implementation, which includes smoothing extensions to Model 4. |
| C08-1014 77 36:197 By introducing the hidden word alignment variable a (Brown et al., 1993), the optimal translation can be searched for based on the following criterion: * 1 , arg max( ( , , )) M mm m ea eh = = efa (1) where is a string of phrases in the target language, e f fa is the source language string of phrases, he are feature functions, weights (, , ) m m are typically optimized to maximize the scoring function (Och, 2003). |
| C08-1014 78 6:197 1 Introduction State-of-the-art Statistical Machine Translation (SMT) systems usually adopt a two-pass search strategy (Och, 2003; Koehn, et al., 2003) as shown in Figure 1. |
| C08-1014 79 37:197 Our MT baseline system is based on Moses decoder (Koehn et al., 2007) with word alignment obtained from GIZA++ (Och et al., 2003). |
| N06-3004 80 9:68 This is also true for reranking and discriminative training, where the k-best list of candidates serves as an approximation of the full set (Collins, 2000; Och, 2003; McDonald et al. , 2005). |
| W08-0309 81 104:288 The word alignments were created with Giza++ (Och and Ney, 2003) applied to a parallel corpus containing the complete Europarl training data, plus sets of 4,051 sentence pairs created by pairing the test sentences with the reference translations, and the test sentences paired with each of the system translations. |
| W08-0309 82 262:288 A large database of human judgments might also be useful as an objective function for minimum error rate training (Och, 2003) or in other system development tasks. |
| W08-0401 83 139:232 For tuning of decoder parameters, we conducted minimum error training (Och 2003) with respect to the BLEU score using 916 development sentence pairs. |
| W08-0401 84 14:232 One of the popular statistical machine translation paradigms is the phrase-based model (PBSMT) (Marcu et al., 2002; Koehn et al., 2003; Och et al., 2004). |
| W08-0401 85 136:232 For phrase-based translation model training, we used the GIZA++ toolkit (Och et al., 2003). |
| P07-1108 86 8:179 1 Introduction For statistical machine translation (SMT), phrasebased methods (Koehn et al. , 2003; Och and Ney, 2004) and syntax-based methods (Wu, 1997; Alshawi et al. 2000; Yamada and Knignt, 2001; Melamed, 2004; Chiang, 2005; Quick et al. , 2005; Mellebeek et al. , 2006) outperform word-based methods (Brown et al. , 1993). |
| P07-1108 87 96:179 We run the decoder with its default settings and then use Koehn's implementation of minimum error rate training (Och, 2003) to tune the feature weights on the development set. |
| W04-1513 88 14:222 By having the advantage of leveraging large parallel corpora, the statistical MT approach outperforms the traditional transfer based approaches in tasks for which adequate parallel corpora is available (Och, 2003). |
| N04-1023 89 161:201 In our experiments, we will use 4 different kinds of feature combinations: a157 Baseline: The 6 baseline features used in (Och, 2003), such as cost of word penalty, cost of aligned template penalty. |
| N04-1023 90 148:201 The minimum error training (Och, 2003) was used on the development data for parameter estimation. |
| N04-1023 91 153:201 Six features from (Och, 2003) were used as baseline features. |
| N04-1023 92 42:201 SMT Team (2003) also used minimum error training as in Och (2003), but used a large number of feature functions. |
| N04-1023 93 44:201 By reranking a 1000-best list generated by the baseline MT system from Och (2003), the BLEU (Papineni et al. , 2001) score on the test dataset was improved from 31.6% to 32.9%. |
| N04-1023 94 39:201 Och (2003) described the use of minimum error training directly optimizing the error rate on automatic MT evaluation metrics such as BLEU. |
| N04-1023 95 11:201 Recently so-called reranking techniques, such as maximum entropy models (Och and Ney, 2002) and gradient methods (Och, 2003), have been applied to machine translation (MT), and have provided significant improvements. |
| D07-1091 96 104:185 The feature weights i in the log-linear model are determined using a minimum error rate training method, typically Powells method (Och, 2003). |
| W07-0717 97 30:158 To model p(t,a|s), we use a standard loglinear approach: p(t,a|s) ??exp bracketleftBiggsummationdisplay i ifi(s,t,a) bracketrightBigg (1) where each fi(s,t,a) is a feature function, and weights i are set using Och?s algorithm (Och, 2003) to maximize the system?s BLEU score (Papineni et al. , 2001) on a development corpus. |
| P08-1049 98 31:184 Moreover, our approach integrates the abbreviation translation component into the baseline system in a natural way, and thus is able to make use of the minimum-error-rate training (Och, 2003) to automatically adjust the model parameters to reflect the change of the integrated system over the baseline system. |
| P08-1049 99 104:184 Once we obtain the augmented phrase table, we should run the minimum-error-rate training (Och, 2003) with the augmented phrase table such that the model parameters are properly adjusted. |
| P08-1049 100 111:184 The feature functions are combined under a log-linear framework, andtheweights aretuned bytheminimum-error-rate training (Och, 2003) using BLEU (Papineni et al., 2002) as the optimization metric. |
| P08-1049 101 155:184 4.5.2 BLEU on NIST MT Test Sets We use MT02 as the development set4 for minimum error rate training (MERT) (Och, 2003). |
| N06-1002 102 167:217 Word alignments were produced by GIZA++ (Och and Ney 2003) with a standard training regimen of five iterations of Model 1, five iterations of the HMM Model, and five iterations of Model 4, in both directions. |
| N06-1002 103 171:217 Finally we trained model weights by maximizing BLEU (Och 2003) and set decoder optimization parameters (n-best list size, timeouts 14 etc) on a development test set of 200 held-out sentences each with a single reference translation. |
| N06-1002 104 174:217 We used the heuristic combination described in (Och and Ney 2003) and extracted phrasal translation pairs from this combined alignment as described in (Koehn et al. , 2003). |
| N06-1002 105 176:217 Model weights were also trained following Och (2003). |
| P06-1091 106 27:210 The current approach does not use specialized probability features as in (Och, 2003) in any stage during decoder parameter training. |
| P06-1091 107 192:210 While error-driven training techniques are commonly used to improve the performance of phrasebased translation systems (Chiang, 2005; Och, 2003), this paper presents a novel block sequence translation approach to SMT that is similar to sequential natural language annotation problems 727 such as part-of-speech tagging or shallow parsing, both in modeling and parameter training. |
| P06-1091 108 30:210 The novel algorithm differs computationally from earlier work in discriminative training algorithms for SMT (Och, 2003) as follows: a90 No computationally expensive a57 -best lists are generated during training: for each input sentence a single block sequence is generated on each iteration over the training data. |
| P06-1091 109 59:210 Although the training algorithm can handle realvalued features as used in (Och, 2003; Tillmann and Zhang, 2005) the current paper intentionally excludes them. |
| W07-0733 110 16:94 are combined in a log-linear model to obtainthescoreforthetranslationeforaninputsentence f: score(e,f) = exp summationdisplay i i hi(e,f) (1) The weights of the components i are set by a discriminative training method on held-out development data (Och, 2003). |
| E06-1032 111 118:157 The remaining six entries were all fully automatic machine translation systems; in fact, they were all phrase-based statistical machine translation system that had been trained on the same parallel corpus and most used Bleubased minimum error rate training (Och, 2003) to optimize the weights of their log linear models feature functions (Och and Ney, 2002). |
| E06-1032 112 5:157 The statistical machine translation community relies on the Bleu metric for the purposes of evaluating incremental system changes and optimizing systems through minimum error rate training (Och, 2003). |
| E06-1032 113 156:157 For example, work which failed to detect improvements in translation quality with the integration of word sense disambiguation (Carpuat and Wu, 2005), or work which attempted to integrate syntactic information but which failed to improve Bleu (Charniak et al. , 2003; Och et al. , 2004) may deserve a second look with a more targeted manual evaluation. |
| P08-1059 114 63:185 The features are similar to the ones used in phrasal systems, and their weights are trained using max-BLEU training (Och, 2003). |
| W06-3122 115 15:91 It generates a vector of 5 numeric values for each phrase pair: phrase translation probability: ( f|e) = count( f, e) count(e),(e| f) = count( f, e) count( f) 2http://www.phramer.org/ Java-based open-source phrase based SMT system 3http://www.isi.edu/licensed-sw/carmel/ 4http://www.speech.sri.com/projects/srilm/ 5http://www.iccs.inf.ed.ac.uk/pkoehn/training.tgz 150 lexical weighting (Koehn et al. , 2003): lex( f|e,a) = nproductdisplay i=1 1 |{j|(i, j) a}| summationdisplay (i,j)a w(fi|ej) lex(e|f,a) = mproductdisplay j=1 1 |{i|(i, j) a}| summationdisplay (i,j)a w(ej|fi) phrase penalty: ( f|e) = e; log(( f|e)) = 1 2.2 Decoding We used the Pharaoh decoder for both the Minimum Error Rate Training (Och, 2003) and test dataset decoding. |
| W06-3122 116 91:91 The size of the development set used to generate 1 and 2 (1000 sentences) compensates the tendency of the unsmoothed MERT algorithm to overfit (Och, 2003) by providing a high ratio between number of variables and number of parameters to be estimated. |
| H05-1095 117 7:253 1 Introduction Possibly the most remarkable evolution of recent years in statistical machine translation is the step from word-based models to phrase-based models (Och et al. , 1999; Marcu and Wong, 2002; Yamada and Knight, 2002; Tillmann and Xia, 2003). |
| H05-1095 118 68:253 Instead, and as suggested by Och (2003), we chose to maximize directly the quality of the translations produced by the system, as measured with a machine translation evaluation metric. |
| H05-1095 119 116:253 A first family of libraries was based on a word alignment A, produced using the Refined method described in (Och and Ney, 2003) (combination of two IBM-Viterbi alignments): we call these the A libraries. |
| H05-1095 120 42:253 The first is to align the words using a standard word alignement technique, such as the Refined Method described in (Och and Ney, 2003) (the intersection of two IBM Viterbi alignments, forward and reverse, enriched with alignments from the union) and then generate bi-phrases by combining together individual alignments that co-occur in the same pair of sentences. |
| H05-1095 121 43:253 This is the strategy that is usually adopted in other phrase-based MT approaches (Zens and Ney, 2003; Och and Ney, 2004). |
| W06-3121 122 5:69 In this paper, we present Phramer, an open-source system that embeds a phrase-based decoder, a minimum error rate training (Och, 2003) module and various tools related to Machine Translation (MT). |
| W06-3121 123 24:69 The software also required GIZA++ word alignment tool(Och and Ney, 2003). |
| W06-3121 124 14:69 The MERT module is a highly modular, efficient and customizable implementation of the algorithm described in (Och, 2003). |
| H05-1098 125 42:140 The feature weights are learned by maximizing the BLEU score (Papineni et al. , 2002) on held-out data,usingminimum-error-ratetraining(Och,2003) as implemented by Koehn. |
| H05-1098 126 67:140 5 Analysis Over the last few years, several automatic metrics for machine translation evaluation have been introduced, largely to reduce the human cost of iterative system evaluation during the development cycle (Lin and Och, 2004; Melamed et al. , 2003; Papineni et al. , 2002). |
| N04-1033 127 192:290 The model scaling factors are optimized on the development corpus with respect to mWER similar to (Och, 2003). |
| N04-1033 128 198:290 This method has the advantage that it is not limited to the model scaling factors as the method described in (Och, 2003). |
| N04-1033 129 22:290 Alternatively, one can train them with respect to the final translation quality measured by some error criterion (Och, 2003). |
| D07-1105 130 126:270 on test BLEU BP BLEU BP pair-CI 95% BLEU BP 3 01 03 32.98 0.92 33.03 0.93 [ -0.23, +0.34] 33.60 0.93 4 01 04 33.44 0.93 33.46 0.93 [ -0.26, +0.29] 34.97 0.94 5 01 05 33.07 0.92 33.14 0.93 [ -0.29, +0.43] 34.33 0.93 6 01 06 32.86 0.92 33.53 0.93 [+0.26, +1.08] 34.43 0.93 7 01 07 33.08 0.93 33.51 0.93 [+0.04, +0.82] 34.49 0.93 8 01 08 33.12 0.93 33.47 0.93 [ -0.06, +0.75] 34.50 0.94 9 01 09 33.15 0.93 33.22 0.93 [ -0.35, +0.51] 34.68 0.93 10 01 10 33.01 0.93 33.59 0.94 [+0.18, +0.96] 34.79 0.94 11 01 11 32.84 0.94 33.40 0.94 [+0.13, +0.98] 34.76 0.94 12 01 12 32.73 0.93 33.49 0.94 [+0.34, +1.18] 34.83 0.94 13 01 13 32.71 0.93 33.54 0.94 [+0.39, +1.26] 34.91 0.94 14 01 14 32.66 0.93 33.69 0.94 [+0.58, +1.47] 34.97 0.94 15 01 15 32.47 0.93 33.57 0.94 [+0.63, +1.57] 34.99 0.94 16 01 16 32.51 0.93 33.62 0.94 [+0.62, +1.59] 35.00 0.94 3.2 Non-Uniform System Prior Weights As pointed out in Section 2.1, a useful property of the MBR-like system selection method is that system prior weights can easily be trained using the Minimum Error Rate Training (Och, 2003). |
| D07-1105 131 53:270 Using the components of the row-vector bm as feature function values for the candidate translation em (m a16 1,,M), the system prior weights can easily be trained using the Minimum Error Rate Training described in (Och, 2003). |
| D07-1105 132 16:270 For instance, word alignment models are often trained using the GIZA++ toolkit (Och and Ney, 2003); error minimizing training criteria such as the Minimum Error Rate Training (Och, 2003) are employed in order to learn feature function weights for log-linear models; and translation candidates are produced using phrase-based decoders (Koehn et al. , 2003) in combination with n-gram language models (Brants et al. , 2007). |
| D07-1105 133 158:270 Note that all systems were optimized using a non-deterministic implementation of the Minimum Error Rate Training described in (Och, 2003). |
| D07-1105 134 170:270 For instance, changing the training procedure for word alignment models turned out to be most beneficial; for details see (Och and Ney, 2003). |
| P07-2046 135 45:108 The weighting parameters of these features were optimized in terms of BLEU by the approach of minimum error rate training (Och, 2003). |
| P07-2046 136 8:108 1 Introduction Raw parallel data need to be preprocessed in the modern phrase-based SMT before they are aligned by alignment algorithms, one of which is the wellknown tool, GIZA++ (Och and Ney, 2003), for training IBM models (1-4). |
| N07-1006 137 34:159 This type of direct optimization is known as Minimum Error Rate Training (Och, 2003) in the MT community, and is an essential component in building the stateof-art MT systems. |
| P08-1064 138 132:210 For the MER training (Och, 2003), we modified Koehns MER trainer (Koehn, 2004) for our tree sequence-based system. |
| P08-1064 139 8:210 1 Introduction Phrase-based modeling method (Koehn et al., 2003; Och and Ney, 2004a) is a simple, but powerful mechanism to machine translation since it can model local reorderings and translations of multiword expressions well. |
| W08-0310 140 17:96 These fourteen scores are weighted and linearly combined (Och and Ney, 2002; Och, 2003); their respective weights are learned on development data so as to maximize the BLEU score. |
| W08-0310 141 15:96 translation systems (Och and Ney, 2004; Koehn et al., 2003) and use Moses (Koehn et al., 2007) to search for the best target sentence. |
| N06-1032 142 119:167 Minimum-error-rate training was done using Koehns implementation of Ochs (2003) minimum-error-rate model. |
| N06-1032 143 100:167 number of words in target string These statistics are combined into a log-linear model whose parameters are adjusted by minimum error rate training (Och, 2003). |
| N06-1032 144 18:167 (2003), and component weights are adjusted by minimum error rate training (Och, 2003). |
| N06-1032 145 5:167 1 Introduction Recent approaches to statistical machine translation (SMT) piggyback on the central concepts of phrasebased SMT (Och et al. , 1999; Koehn et al. , 2003) and at the same time attempt to improve some of its shortcomings by incorporating syntactic knowledge in the translation process. |
| W07-0403 146 26:234 The surface heuristic can define consistency according to any word alignment; but most often, the alignment is provided by GIZA++ (Och and Ney, 2003). |
| W07-0403 147 200:234 Weights for the log-linear model are set using the 500-sentence tuning set provided for the shared task with minimum error rate training (Och, 2003) as implemented by Venugopal and Vogel (2005). |
| W07-0403 148 29:234 Many-to-many alignments can be created by combining two GIZA++ alignments, one where English generates Foreign and another with those roles reversed (Och and Ney, 2003). |
| W07-0403 149 178:234 We report precision, recall and balanced F-measure (Och and Ney, 2003). |
| D08-1089 150 130:184 Parameters were tuned with minimum error-rate training (Och, 2003) on the NIST evaluation set of 2006 (MT06) for both C-E and A-E. |
| D08-1089 151 7:184 1 Introduction Statistical phrase-based systems (Och and Ney, 2004; Koehn et al., 2003) have consistently delivered state-of-the-art performance in recent machine translation evaluations, yet these systems remain weak at handling word order changes. |
| D08-1076 152 13:206 A class of training criteria that provides a tighter connection between the decision rule and the final error metric is known as Minimum Error Rate Training (MERT) and has been suggested for SMT in (Och, 2003). |
| D08-1076 153 45:206 The upper envelope is a convex hull and can be inscribed with a convex polygon whose edges are the segments of a piecewise linear function in (Papineni, 1999; Och, 2003): EnvD4fD5 AG max eC8C AWa D4e,fD5 A0 A4 bD4e,fD5 : C8 RB4 (6) 726 Score Error count 0 0 e1 e2 e5 e6 e8 e1e 2 e3 e4 e5e6e 7 e8 Figure 1: The upper envelope (bold, red curve) for a set of lines is the convex hull which consists of the topmost line segments. |
| D08-1076 154 19:206 Assuming that the corpusbased error count for some translations eS1 is additively decomposable into the error counts of the individual sentences, i.e., ED4rS1 ,eS1D5 AG EWSs AG1 ED4rs,esD5,the MERT criterion is given as: M1 AG argmin M1 AZ S F4 sAG1 EA0rs,eD4fs;M1 D5A8 B7 (3) AG argmin M1 AZ S F4 sAG1 K F4 kAG1 ED4rs,es,kD5A0eD4fs;M1 D5,es,kA8 B7 with e D4fs;M1 D5 AG argmaxe AZ M F4 mAG1 mhmD4e,fsD5 B7 (4) In (Och, 2003), it was shown that linear models can effectively be trained under the MERT criterion using a special line optimization algorithm. |
| D08-1076 155 41:206 Starting from an initial point M1 , computing the most probable sentence hypothesis out of a set of K candidate translations Cs AG D8e1,,eKD9 along the line M1 A0 A4 dM1 results in the following optimization problem (Och, 2003): e D4fs;D5 AG argmax eC8Cs AX D4 M 1 A0 A4 d M 1 D5 C2 A4 hM1 D4e,fsD5 B5 AG argmax eC8Cs AY F4 m mhmD4e,fsD5 D0D3D3D3D3D3D3D3D3D1D3D3D3D3D3D3D3D3D2 AGaD4e,fsD5 A0 A4 F4 m dmhmD4e,fsD5 D0D3D3D3D3D3D3D3D3D1D3D3D3D3D3D3D3D3D2 AGbD4e,fsD5 B6 AG argmax eC8Cs AWa D4e,fsD5 A0 A4 bD4e,fsD5 D0D3D3D3D3D3D3D3D3D3D3D3D1D3D3D3D3D3D3D3D3D3D3D3D2 D4A6D5 B4 (5) Hence, the total score D4A6D5 for any candidate translation corresponds to a line in the plane with as the independent variable. |
| D08-1076 156 150:206 6 Related Work As suggested in (Och, 2003), an alternative method for the optimization of the unsmoothed error count is Powells algorithm combined with a grid-based line optimization (Press et al., 2007, p. 509). |
| W07-0729 157 52:159 Feature weight tuning was carried out using minimum error rate training, maximizing BLEU scores on a held-out development set (Och, 2003). |
| N06-1013 158 8:176 Maximum entropy (ME) models have been used in bilingual sense disambiguation, word reordering, and sentence segmentation (Berger et al. , 1996), parsing, POS tagging and PP attachment (Ratnaparkhi, 1998), machine translation (Och and Ney, 2002), and FrameNet classification (Fleischman et al. , 2003). |
| N06-1013 159 145:176 The parameters of the MT system were optimized on MTEval02 data using minimum error rate training (Och, 2003). |
| N06-1013 160 155:176 In a later study, Och and Ney (2003) present a loglinear combination of the HMM and IBM Model 4 that produces better alignments than either of those. |
| N06-1013 161 7:176 1 Introduction Word alignmentdetection of corresponding words between two sentences that are translations of each otheris usually an intermediate step of statistical machine translation (MT) (Brown et al. , 1993; Och and Ney, 2003; Koehn et al. , 2003), but also has been shown useful for other applications such as construction of bilingual lexicons, word-sense disambiguation, projection of resources, and crosslanguage information retrieval. |
| W08-0316 162 38:76 Word alignments were generated using GIZA++ (Och and Ney, 2003) over a stemmed version of the parallel text. |
| W08-0316 163 41:76 3.1 System Tuning Minimum error training (Och, 2003) under BLEU (Papineni et al., 2001) was used to optimise the feature weights of the decoder with respect to the dev2006 development set. |
| P08-2010 164 11:99 This shows that hypothesis features are either not discriminative enough, or that the reranking model is too weak This performance gap can be mainly attributed to two problems: optimization error and modeling error (see Figure 1).1 Much work has focused on developing better algorithms to tackle the optimization problem (e.g. MERT (Och, 2003)), since MT evaluation metrics such as BLEU and PER are riddled with local minima and are difficult to differentiate with respect to re-ranker parameters. |
| W06-3115 165 16:84 Feature function scaling factors m are optimized based on a maximum likelihood approach (Och and Ney, 2002) or on a direct error minimization approach (Och, 2003). |
| W06-3115 166 23:84 First, manyto-many word alignments are induced by running a one-to-many word alignment model, such as GIZA++ (Och and Ney, 2003), in both directions and by combining the results based on a heuristic (Och and Ney, 2004). |
| W06-3115 167 50:84 For each differently tokenized corpus, we computed word alignments by a HMM translation model (Och and Ney, 2003) and by a word alignment refinement heuristic of grow-diagfinal (Koehn et al. , 2003). |
| P08-1087 168 101:143 A Greek model was trained on 440,082 aligned sentences of Europarl v.3, tuned with Minimum Error Training (Och, 2003). |
| P07-1037 169 96:160 The bidirectional word alignmentisusedtoobtainlexicalphrasetranslationpairs using heuristics presented in (Och & Ney, 2003) and (Koehn et al. , 2003). |
| P07-1037 170 117:160 The NIST MT03 test set is used for development, particularly for optimizing the interpolation weights using Minimum Error Rate training (Och, 2003). |
| P07-1037 171 41:160 Firstly, rather than induce millions of xRS rules from parallel data, we extract phrase pairs in the standard way (Och & Ney, 2003) and associate with each phrase-pair a set of target language syntactic structures based on supertag sequences. |
| P07-1037 172 50:160 The bidirectional word alignment is used to obtain phrase translation pairs using heuristics presented in 2http://www.fjoch.com/GIZA++.html 289 (Och & Ney, 2003) and (Koehn et al. , 2003), and the Moses decoder was used for phrase extraction and decoding.3 Let t and s be the target and source language sentences respectively. |
| C08-1125 173 117:218 Then we use both Moses decoder and its suppo We run the decoder with its d then use Moses' implementation of minimum error rate training (Och, 2003) to tune the feature weights on the development set. |
| C08-1125 174 55:218 3 Baseline MT System The phrase-based SMT system used in our experiments is Moses, phrase translation pro ing probabilities, and languag ties are combined in the log-linear model to obtain the best translation best e of the source sentence f : = = M p | )(maxarg fee ebest (2) m mm h 1 ,(maxarg f)e e The weights are set by a discriminative training method using a held-out data set as describ in (Och, 2003). |
| N07-2022 175 14:92 In order to improve translation quality, this tuning can be effectively performed by minimizing translation error over a development corpus for which manually translated references are available (Och, 2003). |
| N07-2022 176 17:92 Unsupervised systems (Och and Ney, 2003; Liang et al. , 2006) are based on generative models trained with the EM algorithm. |
| H05-1087 177 15:127 This is analogous, and in a certain sense equivalent, to empirical risk minimization, which has been used successfully in related areas, such as speech recognition (Rahim and Lee, 1997), language modeling (Paciorek and Rosenfeld, 2000), and machine translation (Och, 2003). |
| W08-0336 178 49:196 We tuned the parameters of these features with Minimum Error Rate Training (MERT) (Och, 2003) on the NIST MT03 Evaluation data set (919 sentences), and then test the MT performance on NIST MT03 and MT05 Evaluation data (878 and 1082 sentences, respectively). |
| W08-0336 179 46:196 We build phrase translations by first acquiring bidirectional GIZA++ (Och and Ney, 2003) alignments, and using Moses grow-diag alignment symmetrization heuristic.1 We set the maximum phrase length to a large value (10), because some segmenters described later in this paper will result in shorter 1In our experiments, this heuristic consistently performed better than the default, grow-diag-final. |
| D07-1030 180 87:173 We run the decoder with its default settings (maximum phrase length 7) and then use Koehn's implementation of minimum error rate training (Och, 2003) to tune the feature weights on the de2 The full name of HTRDP is National High Technology Research and Development Program of China, also named as 863 Program. |
| D07-1030 181 34:173 SMT has evolved from the original word-based approach (Brown et al. , 1993) into phrase-based approaches (Koehn et al. , 2003; Och and Ney, 2004) and syntax-based approaches (Wu, 1997; Alshawi et al. , 2000; Yamada and Knignt, 2001; Chiang, 2005). |
| W05-0822 182 35:90 s To set weights on the components of the loglinear model, we implemented Ochs algorithm (Och, 2003). |
| W07-0730 183 81:108 Unfortunately, longer sentences (up to 100 tokens, rather than 40), longer phrases (up to 10 tokens, rather than 7), two LMs (rather than just one), higher-order LMs (order 7, rather than 3), multiple higher-order lexicalized re-ordering models (up to 3), etc. all contributed to increased system?s complexity, and, as a result, time limitations prevented us from performing minimum-error-rate training (MERT) (Och, 2003) for ucb3, ucb4 and ucb5. |
| N07-1007 184 48:188 (2003), a trigram target language model, an order model, word count, phrase count, average phrase size functions, and whole-sentence IBM Model 1 logprobabilities in both directions (Och et al. 2004). |
| N07-1007 185 49:188 The weights of these models are determined using the max-BLEU method described in Och (2003). |
| N07-1007 186 8:188 Most stateof-the-art SMT systems treat grammatical elements in exactly the same way as content words, and rely on general-purpose phrasal translations and target language models to generate these elements (e.g. , Och and Ney, 2002; Koehn et al. , 2003; Quirk et al. , 2005; Chiang, 2005; Galley et al. , 2006). |
| P06-1066 187 10:243 One is distortion model (Och and Ney, 2004; Koehn et al. , 2003) which penalizes translations according to their jump distance instead of their content. |
| P06-1066 188 86:243 The k-best list is very important for the minimum error rate training (Och, 2003a) which is used for tuning the weights for our model. |
| P06-1066 189 130:243 Line 4 and 5 are similar to the phrase extraction algorithm by Och (2003b). |
| P08-1086 190 120:148 The weights are trained using minimum error rate training (Och, 2003) with BLEU score as the objective function. |
| P07-2045 191 29:103 Moses uses standard external tools for some of the tasks to avoid duplication, such as GIZA++ (Och and Ney 2003) for word alignments and SRILM for language modeling. |
| P07-2045 192 28:103 It also contains tools for tuning these models using minimum error rate training (Och 2003) and evaluating the resulting translations using the BLEU score (Papineni et al. 2002). |
| D08-1060 193 119:222 The standard Minimum Error Rate training (Och, 2003) was applied to tune the weights for all feature types. |
| D08-1060 194 129:222 We use MER (Och, 2003) to tune the decoders parameters using a development data set. |
| W07-0731 195 44:149 The feature weights i are trained in concert with the LM weight via minimum error rate (MER) training (Och, 2003). |
| P08-1114 196 36:172 Each i is a weight associated with feature i, and these weights are typically optimized using minimum error rate training (Och, 2003). |
| P06-1032 197 42:159 N-best results for phrasal alignment and ordering models in the decoder were optimized by lambda training via Maximum Bleu, along the lines described in (Och, 2003). |
| P05-1057 198 46:247 In order to incorporate a new dependency which contains extra information other than the bilingual sentence pair, we modify Eq.2 by adding a new variable v: Pr(a|e,f,v) = exp[ summationtextM m=1 mhm(a,e,f,v)]summationtext aprime exp[ summationtextM m=1 mhm(aprime,e,f,v)](4) Accordingly, we get a new decision rule: a = argmax a braceleftbigg Msummationdisplay m=1 mhm(a,e,f,v) bracerightbigg (5) Note that our log-linear models are different from Model 6 proposed by Och and Ney (2003), which defines the alignment problem as finding the alignment a that maximizes Pr(f, a|e) given e. 3 Feature Functions In this paper, we use IBM translation Model 3 as the base feature of our log-linear models. |
| P05-1057 199 135:247 After that, we used three types of methods for performing a symmetrization of IBM models: intersection, union, and refined methods (Och and Ney, 2003). |
| P05-1057 200 22:247 Och and Ney (2003) proposed Model 6, a log-linear combination of IBM translation models and HMM model. |
| P05-1057 201 132:247 We used GIZA++ package (Och and Ney, 2003) to train IBM translation models. |
| P05-1057 202 168:247 It is promising to optimize the model parameters directly with respect to AER as suggested in statistical machine translation (Och, 2003). |
| P05-1057 203 14:247 Studies reveal that statistical alignment models outperform the simple Dice coefficient (Och and Ney, 2003). |
| W08-0306 204 84:125 After maximum BLEU tuning (Och, 2003a) on a held-out tuning set, we evaluate translation quality on a held-out test set. |
| W08-0306 205 6:125 GIZA++ (Och and Ney, 2003), an implementation of the IBM (Brown et al., 1993) and HMM (?) |
| W08-0306 206 18:125 We show that link 1For a complete discussion of alignment symmetrization heuristics, including union, intersection, and refined, refer to (Och and Ney, 2003). |
| W08-0306 207 86:125 3.2 Evaluation Metrics AER (Alignment Error Rate) (Och and Ney, 2003) is the most widely used metric of alignment quality, but requires gold-standard alignments labelled with sure/possible annotations to compute; lacking such annotations, we can compute alignment fmeasure instead. |
| W08-0306 208 9:125 GIZA++ refined alignments have been used in state-of-the-art phrase-based statistical MT systems such as (Och, 2004); variations on the refined heuristic have been used by (Koehn et al., 2003) (diag and diag-and) and by the phrase-based system Moses (grow-diag-final) (Koehn et al., 2007). |
| W08-0306 209 99:125 The feature weights are tuned using minimum error rate training (Och and Ney, 2003) to optimize BLEU score on a held-out development set. |
| W06-1606 210 101:175 The weights of the models are computed automatically using a variant of the Maximum Bleu training procedure proposed by Och (2003). |
| W06-1606 211 114:175 We concatenate the lists and we learn a new combination of weights that maximizes the Bleu score of the combined nbest list using the same development corpus we used for tuning the individual systems (Och, 2003). |
| W06-1606 212 108:175 The decoder is capable of producing nbest derivations and nbest lists (Knight and Graehl, 2005), which are used for Maximum Bleu training (Och, 2003). |
| W06-1606 213 4:175 1 Introduction During the last four years, various implementations and extentions to phrase-based statistical models (Marcu and Wong, 2002; Koehn et al. , 2003; Och and Ney, 2004) have led to significant increases in machine translation accuracy. |
| W08-0321 214 24:99 Instead of interpolating the two language models, we explicitly used them in the decoder and optimized their weights via minimumerror-rate (MER) training (Och, 2003). |
| W08-0321 215 15:99 Following the guidelines of the workshop we built baseline systems, using the lower-cased Europarl parallel corpus (restricting sentence length to 40 words), GIZA++ (Och and Ney, 2003), Moses (Koehn et al., 2007), and the SRI LM toolkit (Stolcke, 2002) to build 5-gram LMs. |
| W08-0321 216 33:99 For example, in IBM Model 1 the lexicon probability of source word f given target word e is calculated as (Och and Ney, 2003): p(f|e) = summationtext k c(f|e;e k,fk) summationtext k,f c(f|e;e k,fk) (1) c(f|e;ek,fk) = summationdisplay ek,fk P(ek,fk)summationdisplay a P(a|ek,fk) (2) summationdisplay j (f,fkj )(e,ekaj) Therefore, the distribution of P(ek,fk) will affect the alignment results. |
| J06-4002 217 253:281 Furthermore, statistical generation systems (Lapata 2003; Barzilay and Lee 2004; Karamanis and Manurung 2002; Mellish et al. 1998) could use as a means of directly optimizing information ordering, much in the same way MT systems optimize model parameters using BLEU as a measure of translation quality (Och 2003). |
| C04-1030 218 21:215 Alternatively, one can train them with respect to the final translation quality measured by some error criterion (Och, 2003). |
| W05-0833 219 52:152 In order to create the necessary SMT language and translation models, they used: Giza++ (Och & Ney, 2003);2 the CMU-Cambridge statistical toolkit;3 the ISI ReWrite Decoder.4 Translation was performed from EnglishFrench and FrenchEnglish, and the resulting translations were evaluated using a range of automatic metrics: BLEU (Papineni et al. , 2002), Precision and Recall 2http://www.isi.edu/och/Giza++.html 3http://mi.eng.cam.ac.uk/prc14/toolkit.html 4http://www.isi.edu/licensed-sw/rewrite-decoder/ 185 (Turian et al. , 2003), and Wordand Sentence Error Rates. |
| W05-0833 220 77:152 Accordingly, in this section we describe a set of experiments which extends the work of (Way and Gough, 2005) by evaluating the Marker-based EBMT system of (Gough & Way, 2004b) against a phrase-based SMT system built using the following components: Giza++, to extract the word-level correspondences; The Giza++ word alignments are then refined and used to extract phrasal alignments ((Och & Ney, 2003); or (Koehn et al. , 2003) for a more recent implementation); Probabilities of the extracted phrases are calculated from relative frequencies; The resulting phrase translation table is passed to the Pharaoh phrase-based SMT decoder which along with SRI language modelling toolkit5 performs translation. |
| W05-0833 221 47:152 (Koehn et al. , 2003); (Och, 2003)). |
| W07-0711 222 172:235 In the experiment, only the first 500 sentences were used to train the log-linear model weight vector, where minimum error rate (MER) training was used (Och, 2003). |
| W08-0510 223 47:155 GIZA++ (Och and Ney 2003) is a very popular system within SMT for creating word alignment from parallel corpus, in fact, the Moses training scripts uses it. |
| W08-0510 224 138:155 These include scripts for creating alignments from a parallel corpus, creating phrase tables and language models, binarizing phrase tables, scripts for weight optimization using MERT (Och 2003), and testing scripts. |
| W08-0319 225 28:77 We use the minimum-error rate training procedure by Och (2003) as implemented in the Moses toolkit to set the weights of the various translation and language models, optimizing for BLEU. |
| D08-1088 226 11:219 This operation can be used in applications like Minimum Error Rate Training (Och, 2003), or optimizing system combination as described by Hillard et al. |
| P06-2101 227 115:219 To find the optimal coefficients for a loglinear combination of these experts, we use separate development data, using the following procedure due to Och (2003): 1. |
| P06-2101 228 44:219 Och (2003) found that such smoothing during training gives almost identical results on translation metrics. |
| P06-2101 229 40:219 Och (2003) observed, however, that the piecewiseconstant property could be exploited to characterize the function exhaustively along any line in parameter space, and hence to minimize it globally along that line. |
| P06-2101 230 18:219 Despite these difficulties, some work has shown it worthwhile to minimize error directly (Och, 2003; Bahl et al. , 1988). |
| P07-1092 231 106:201 The parameters, j, were trained using minimum error rate training (Och, 2003) to maximise the BLEU score (Papineni et al. , 2002) on a 150 sentence development set. |
| P07-1092 232 108:201 The translation models and lexical scores were estimated on the training corpus whichwasautomaticallyalignedusingGiza++(Och et al. , 1999) in both directions between source and target and symmetrised using the growing heuristic (Koehn et al. , 2003). |
| P07-1092 233 94:201 As an alternative to linear interpolation, we also employ a weighted product for phrase-table combination: p(s|t) productdisplay j pj(s|t)j (3) This has the same form used for log-linear training of SMT decoders (Och, 2003), which allows us to treateachdistributionasafeature,andlearnthemixing weights automatically. |
| P07-1092 234 34:201 A single translation is then selected by finding the candidate that yields the best overall score (Och and Ney, 2001; Utiyama and Isahara, 2007) or by cotraining (Callison-Burch and Osborne, 2003). |
| P07-1005 235 120:177 6.1 Hiero Results Using the MT 2002 test set, we ran the minimumerror rate training (MERT) (Och, 2003) with the decoder to tune the weights for each feature. |
| P07-1005 236 12:177 To perform translation, state-of-the-art MT systems use a statistical phrase-based approach (Marcu and Wong, 2002; Koehn et al. , 2003; Och and Ney, 2004) by treating phrases as the basic units of translation. |
| I08-1030 237 37:242 2 Phrase-based statistical machine translation Phrase-based SMT uses a framework of log-linear models (Och, 2003) to integrate multiple features. |
| I08-1030 238 9:242 In the training phase, bilingual parallel sentences are preprocessed and aligned using alignment algorithms or tools such as GIZA++ (Och and Ney, 2003). |
| N06-1004 239 8:208 1 Introduction: Defining SCMs The work presented here was done in the context of phrase-based MT (Koehn et al. , 2003; Och and Ney, 2004). |
| N06-1004 240 166:208 Weights on the components were assigned using the (Och, 2003) method for max-BLEU training on the development set. |
| J04-4002 241 144:482 A comparison of the two approaches can be found in Koehn, Och, and Marcu (2003). |
| J04-4002 242 78:482 (1993) and Och and Ney (2003). |
| J04-4002 243 84:482 The alignment a J 1 that has the highest probability (under a certain model) is also called the Viterbi alignment (of that model): a J 1 = argmax a J 1 p (f J 1, a J 1 | e I 1 ) (8) A detailed comparison of the quality of these Viterbi alignments for various statistical alignment models compared to human-made word alignments can be found in Och and Ney (2003). |
| J04-4002 244 68:482 An alternative training criterion therefore directly optimizes translation quality as measured by an automatic evaluation criterion (Och 2003). |
| J04-4002 245 37:482 Looking at the results of the recent machine translation evaluations, this approach seems currently to give the best results, and an increasing number of researchers are working on different methods for learning phrase translation lexica for machine translation purposes (Marcu and Wong 2002; Venugopal, Vogel, and Waibel 2003; Tillmann 2003; Koehn, Och, and Marcu 2003). |
| J04-4002 246 197:482 An efficient algorithm for performing this tuning for a larger number of model parameters can be found in Och (2003). |
| W05-0904 247 85:146 The translations were generated by the alignment template system of Och (2003). |
| P07-1024 248 133:195 This setting is reminiscent of the problem of optimizing feature weights for reranking of candidate machine translation outputs, and we employ an optimization technique similar to that used by Och (2003) for machine translation. |
| P07-2026 249 37:101 The model scaling factors M1 are optimized with respect to the BLEU score as described in (Och, 2003). |
| W07-0702 250 42:308 The factored translation model combines features in a log-linear fashion (Och, 2003). |
| P06-1077 251 111:252 5.1 Pharaoh The baseline system we used for comparison was Pharaoh (Koehn et al. , 2003; Koehn, 2004), a freely available decoder for phrase-based translation models: p(e|f) = p(f|e) pLM(e)LM pD(e,f)D length(e)W(e) (10) We ran GIZA++ (Och and Ney, 2000) on the training corpus in both directions using its default setting, and then applied the refinement rule diagand described in (Koehn et al. , 2003) to obtain a single many-to-many word alignment for each sentence pair. |
| P06-1077 252 114:252 To perform minimum error rate training (Och, 2003) to tune the feature weights to maximize the systems BLEU score on development set, we used optimizeV5IBMBLEU.m (Venugopal and Vogel, 2005). |
| P06-1077 253 7:252 1 Introduction Phrase-based translation models (Marcu and Wong, 2002; Koehn et al. , 2003; Och and Ney, 2004), which go beyond the original IBM translation models (Brown et al. , 1993) 1 by modeling translations of phrases rather than individual words, have been suggested to be the state-of-theart in statistical machine translation by empirical evaluations. |
| D07-1007 254 131:218 The loglinear model weights are learned using Chiangs implementation of the maximum BLEU training algorithm (Och, 2003), both for the baseline, and the WSD-augmented system. |
| D07-1007 255 129:218 The phrase bilexicon is derived from the intersection of bidirectional IBM Model 4 alignments, obtained with GIZA++ (Och and Ney, 2003), augmented to improve recall using the grow-diag-final heuristic. |
| W06-3103 256 40:183 The model scaling factors M1 are trained with respect to the final translation quality measured by an error criterion (Och, 2003). |
| N07-2053 257 46:103 Finally, to estimate the parameters i of the weighted linear model, we adopt the popular minimum error rate training procedure (Och, 2003) which directly optimizes translation quality as measured by the BLEU metric. |
| W07-0401 258 98:352 Here, we train word alignments in both directions with GIZA++ (Och and Ney, 2003). |
| W07-0401 259 61:352 Alternatively, one can train them with respect to the final translation quality measured by an error criterion (Och, 2003). |
| D07-1006 260 163:193 (Och and Ney, 2003) invented heuristic symmetriza57 FRENCH/ENGLISH ARABIC/ENGLISH SYSTEM F-MEASURE ( = 0.4) BLEU F-MEASURE ( = 0.1) BLEU GIZA++ 73.5 30.63 75.8 51.55 (FRASER AND MARCU, 2006B) 74.1 31.40 79.1 52.89 LEAF UNSUPERVISED 74.5 72.3 LEAF SEMI-SUPERVISED 76.3 31.86 84.5 54.34 Table 3: Experimental Results tion of the output of a 1-to-N model and a M-to-1 model resulting in a M-to-N alignment, this was extended in (Koehn et al. , 2003). |
| D07-1006 261 86:193 (Och and Ney, 2003) discussed efficient implementation. |
| D07-1006 262 112:193 For all non-LEAF systems, we take the best performing of the union, refined and intersection symmetrization heuristics (Och and Ney, 2003) to combine the 1-to-N and M-to-1 directions resulting in a M-to-N alignment. |
| D07-1006 263 69:193 2.2 Unsupervised Parameter Estimation We can perform maximum likelihood estimation of the parameters of this model in a similar fashion to that of Model 4 (Brown et al. , 1993), described thoroughly in (Och and Ney, 2003). |
| D07-1006 264 111:193 4.2 Experiments To build all alignment systems, we start with 5 iterations of Model 1 followed by 4 iterations of HMM (Vogel et al. , 1996), as implemented in GIZA++ (Och and Ney, 2003). |
| D07-1006 265 70:193 We use Viterbi training (Brown et al. , 1993) but neighborhood estimation (Al-Onaizan et al. , 1999; Och and Ney, 2003) or pegging (Brown et al. , 1993) could also be used. |
| D07-1006 266 61:193 (Och and Ney, 2003) presented results suggesting that the additional parameters required to ensure that a model is not deficient result in inferior performance, but we plan to study whether this is the case for our generative model in future work. |
| D07-1006 267 128:193 For French/English translation we use a state of the art phrase-based MT system similar to (Och and Ney, 2004; Koehn et al. , 2003). |
| D07-1006 268 181:193 Our work is most similar to work using discriminative log-linear models for alignment, which is similar to discriminative log-linear models used for the SMT decoding (translation) problem (Och and Ney, 2002; Och, 2003). |
| H05-1034 269 69:160 MSR thus adopts the method proposed by Och (2003). |
| D07-1038 270 119:170 We obtain weights for the combinations of the features by performing minimum error rate training (Och, 2003) on held-out data. |
| P08-2041 271 69:104 We perform minimum-error-rate training (Och, 2003) to tune the feature weights of the translation model to maximize the BLEU score on development set. |
| I08-2088 272 68:145 3.2.2 Features We used eight features (Och and Ney, 2003; Koehn et al., 2003) and their weights for the translations. |
| I08-2088 273 67:145 We used the preprocessed data to train the phrase-based translation model by using GIZA++ (Och and Ney, 2003) and the Pharaoh tool kit (Koehn et al., 2003). |
| I08-2088 274 77:145 Target language model probability (weight = 0.5) According to a previous study, the minimum error rate training (MERT) (Och, 2003), which is the optimization of feature weights by maximizing the BLEU score on the development set, can improve the performance of a system. |
| W05-1506 275 23:254 For example, Och (2003) shows how to train a log-linear translation model not by maximizing the likelihood of training data, but maximizing the BLEU score (among other metrics) of the model on 53 the data. |
| W08-0305 276 26:200 The de-facto answer came during the 1990s from the research community on Statistical Machine Translation, who made use of statistical tools based on a noisy channel model originally developed for speech recognition (Brown et al., 1994; Och and Weber, 1998; R.Zens et al., 2002; Och and Ney, 2001; Koehn et al., 2003). |
| W08-0305 277 62:200 These models can be tuned using minimum error rate training (Och, 2003). |
| W08-0305 278 63:200 Moses uses standard external tools for some of these tasks, such as GIZA++ (Och and Ney, 2003) for word alignments and SRILM (Stolcke, 2002) for language modeling. |
| W05-0834 279 136:242 More details on these standard criteria can be found for instance in (Och, 2003). |
| W05-0834 280 66:242 The model scaling factors are optimized with respect to some evaluation criterion (Och, 2003). |
| W05-0834 281 33:242 (Och et al. , 2003). |
| W08-0326 282 36:80 Assuming that the parameters P(etk|fsk) are known, the most likely alignment is computed by a simple dynamic-programming algorithm.1 Instead of using an Expectation-Maximization algorithm to estimate these parameters, as commonly done when performing word alignment (Brown et al., 1993; Och and Ney, 2003), we directly compute these parameters by relying on the information contained within the chunks. |
| W08-0326 283 64:80 We tuned our system on the development set devtest2006 for the EuroParl tasks and on nc-test2007 for CzechEnglish, using minimum error-rate training (Och, 2003) to optimise BLEU score. |
| W08-0326 284 20:80 For example, our system configuration for the shared task incorporates a wrapper around GIZA++ (Och and Ney, 2003) for word alignment and a wrapper around Moses (Koehn et al., 2007) for decoding. |
| W07-0713 285 144:228 Still, a confidence range for BLEU can be estimated by bootstrapping (Och, 2003; Zhang and Vogel, 2004). |
| P06-1001 286 147:238 Decoding weights are optimized using Ochs algorithm (Och, 2003) to set weights for the four components of the loglinear model: language model, phrase translation model, distortion model, and word-length feature. |
| C08-1064 287 35:260 Our baseline uses Giza++ alignments (Och and Ney, 2003) symmetrized with the grow-diag-final-and heuristic (Koehn et al., 2003). |
| C08-1064 288 130:260 This may be because their system was not tuned using minimum error rate training (Och, 2003). |
| C08-1064 289 155:260 5We use deterministic sampling, which is useful for reproducibility and for minimum error rate training (Och, 2003). |
| C08-1064 290 117:260 In all experiments that follow, each system configuration was independently optimized on the NIST 2003 Chinese-English test set (919 sentences) using minimum error rate training (Och, 2003) and tested on the NIST 2005 Chinese-English task (1082 sentences). |
| C08-1064 291 116:260 Except where noted, each system was trained on 27 million words of newswire data, aligned with GIZA++ (Och and Ney, 2003) and symmetrized with the grow-diag-final-and heuristic (Koehn et al., 2003). |
| H05-1096 292 44:156 The model scaling factors 1,,5 and the word and phrase penalties are optimized with respect to some evaluation criterion (Och, 2003), e.g. BLEU score. |
| H05-1096 293 29:156 Nowadays, most of the state-of-the-art SMT systems are based on bilingual phrases (Bertoldi et al. , 2004; Koehn et al. , 2003; Och and Ney, 2004; Tillmann, 2003; Vogel et al. , 2004; Zens and Ney, 2004). |
| N07-1022 294 77:209 In WASP, GIZA++ (Och and Ney, 2003) is used to obtain the best alignments from the training examples. |
| N07-1022 295 139:209 The model parameters are trained using minimum error-rate training (Och, 2003). |
| N07-1029 296 56:215 The modified Powells method has been previously used in optimizing the weights of a standard feature-based MT decoder in (Och, 2003) where a more efficient algorithm for log-linear models was proposed. |
| N07-1029 297 96:215 If the alignments are not available, they can be automatically generated; e.g., using GIZA++ (Och and Ney, 2003). |
| D07-1029 298 60:279 (3) s in Equation 1 are the weights of different feature functions, learned to maximize development set BLEU scores using a method similar to (Och, 2003). |
| W08-0312 299 8:85 Bleu is fast and easy to run, and it can be used as a target function in parameter optimization training procedures that are commonly used in state-of-the-art statistical MT systems (Och, 2003). |
| C08-1127 300 143:196 For the efficiency of minimum-error-rate training (Och, 2003), we built our development set (580 sentences) using sentences not exceeding 50 characters from the NIST MT-02 evaluation test data. |
| C08-1127 301 172:196 This wrong translation of content words is similar to the incorrect omission reported in (Och et al., 2003), which both hurt translation adequacy. |
| C08-1127 302 72:196 Firstly, we run GIZA++ (Och and Ney, 2000) on the training corpus in both directions and then apply the ogrow-diag-finalprefinement rule (Koehn et al., 2003) to obtain many-to-many word alignments. |
| I08-1067 303 55:124 The weights for the various components of the model (phrase translation model, language model, distortion model etc.) are set by minimum error rate training (Och, 2003). |
| P08-1023 304 99:135 We use the standard minimum error-rate training (Och, 2003) to tune the feature weights to maximize the systems BLEU score on the dev set. |
| C08-5001 305 200:227 The k-best list is also frequently used in discriminative learning to approximate the whole set of candidates which is usually exponentially large (Och, 2003; McDonald et al., 2005). |
| W07-0701 306 139:168 The comparison phrasal system was constructed using the same GIZA++ alignments and the heuristic combination described in (Och & Ney, 2003). |
| W07-0701 307 142:168 Model weights were trained separately for all 3 systems using minimum error rate training to maximize BLEU (Och, 2003) on the development set (dev). |
| D08-1010 308 148:200 We perform minimum error rate training (Och, 2003) to tune the feature weights for the log-linear modeltomaximizethesystemssBLEUscoreonthe development set. |
| D08-1023 309 188:221 We benchmark our results against a model (Hiero) which was directly trained to optimise BLEUNIST using the standard MERT algorithm (Och, 2003) and the full set of translation and lexical weight features described for the Hiero model (Chiang, 2007). |
| D08-1023 310 7:221 Most work on discriminative training for SMT has focussed on linear models, often with margin based algorithms (Liang et al., 2006; Watanabe et al., 2006), or rescaling a product of sub-models (Och, 2003; Ittycheriah and Roukos, 2007). |
| W08-0302 311 54:197 To combine the many differently-conditioned features into a single model, we provide them as features to the linear model (Equation 2) and use minimum error-rate training (Och, 2003) to obtain interpolation weights m. This is similar to an interpolation of backed-off estimates, if we imagine that all of the different contextsaredifferently-backedoffestimatesofthe complete context. |
| W08-0302 312 114:197 Baseline We use the Moses MT system (Koehn et al., 2007) as a baseline and closely follow the example training procedure given for the WMT-07 and WMT-08 shared tasks.4 In particular, we perform word alignment in each direction using GIZA++ (Och and Ney, 2003), apply the grow-diag-finaland heuristic for symmetrization and use a maximum phrase length of 7. |
| W08-0302 313 13:197 The weights 1,,M are typically learned to directly minimize a standard evaluation criterion on development data (e.g., the BLEU score; Papineni et al., (2002)) using numerical search (Och, 2003). |
| W08-0302 314 22:197 The mixture coefficients are trained in the usual way (minimum error-rate training, Och, 2003), so that the additional context is exploited when it is useful and ignored when it isnt. The paper proceeds as follows. |
| W08-0302 315 116:197 Minimum error-rate (MER) training (Och, 2003) was applied to obtain weights (m in Equation 2) for these features. |
| W07-0726 316 32:61 3see http://www.statmt.org/moses/ 194 4 Implementation Details 4.1 Alignment of MT output The input text and the output text of the MT systems was aligned by means of GIZA++ (Och and Ney, 2003), a tool with which statistical models for alignment of parallel texts can be trained. |
| W07-0726 317 40:61 The optimal weights for the different columns can then be assigned with the help of minimum error rate training (Och, 2003). |
| P08-1066 318 195:243 Following (Och, 2003), the k-best results are accumulated as the input of the optimizer. |
| P08-1066 319 211:243 Hierarchical rules were extracted from a subset which has about 35M/41M words5, and the rest of the training data were used to extract phrasal rules as in (Och, 2003; Chiang, 2005). |
| P08-1066 320 141:243 Given sentence-aligned bi-lingual training data, we first use GIZA++ (Och and Ney, 2003) to generate word level alignment. |
| D08-1033 321 207:234 The parameters for each phrase table were tuned separately using minimum error rate training (Och, 2003). |
| D08-1033 322 202:234 5.1 Baseline System We trained Moses on all Spanish-English Europarl sentences up to length 20 (177k sentences) using GIZA++ Model 4 word alignments and the growdiag-final-and combination heuristic (Koehn et al., 2007; Och and Ney, 2003; Koehn, 2002), which performed better than any alternative combination heuristic.13 The baseline estimates (Heuristic) come fromextractingphrasesuptolength7fromtheword alignment. |
| H05-1022 323 93:196 Alignment performance is measured by the Alignment Error Rate (AER) (Och and Ney, 2003) AER(B;B) = 12|B B|/(|B|+|B|) where B is a set reference word links, and B are the word links generated automatically. |
| H05-1022 324 124:196 5 Phrase Pair Induction A common approach to phrase-based translation is to extract an inventory of phrase pairs (PPI) from bitext (Koehn et al. , 2003), For example, in the phraseextract algorithm (Och, 2002), a word alignment am1 is generated over the bitext, and all word subsequences ei2i1 and fj2j1 are found that satisfy : am1 : aj [i1,i2] iff j [j1,j2] . |
| H05-1022 325 187:196 Pooling the sets to form two large CE and AE test sets, the AE system improvements are significant at a 95% level (Och, 2003); the CE systems are only equivalent. |
| H05-1022 326 43:196 The hallucination process is motivated by the use of NULL alignments into Markov alignment models as done by (Och and Ney, 2003). |
| P08-2038 327 79:101 For the efficiency of minimum-errorrate training (Och, 2003), we built our development set (580 sentences) using sentences not exceeding 50 characters from the NIST MT-02 evaluation test data. |
| W08-0304 328 161:189 (2003) of running GIZA++ (Och & Ney, 2003) in both directions and then merging the alignments using the grow-diag-final heuristic. |
| W08-0304 329 124:189 Och (2003) claimed that this approximation achieved essentially equivalent performance to that obtained when directly using the loss as the objective, O = lscript. |
| W08-0304 330 82:189 The first, Powells method, was advocated by Och (2003) when MERT was first introduced for statistical machine translation. |
| W08-0304 331 79:189 This is seen in that each time we check for the nearest intersection to the current 1-best for some n-best list l, we Algorithm 1 Och (2003)s line search method to find the global minimum in the loss, lscript, when starting at the point w and searching along the direction d using the candidate translations given in the collection of n-best lists L. Input: L, w, d, lscript I {} for l L do for e l do m{e} e.features d b{e} e.features w end for bestn argmaxel m{e}{b{e} breaks ties} loop bestn+1 = argminel max parenleftBig 0, b{bestn}b{e}m{e}m{bestn} parenrightBig intercept max parenleftBig 0, b{bestn}b{bestn+1}m{bestn+1}m{bestn} parenrightBig if intercept > 0 then add(I, intercept) else break end if end loop end for add(I, max(I)+2epsilon1) ibest = argminiI evallscript(L,w+(iepsilon1)d) return w+(ibest epsilon1)d must calculate its intersection with all other candidate translations that have yet to be selected as the 1-best. |
| W08-0304 332 51:189 However, by exploiting the fact that the underlying scores assigned to competing hypotheses, w(e,h,f), vary linearly w.r.t. changes in the weight vector, w, Och (2003) proposed a strategy for finding the global minimum along any given search direction. |
| W08-0304 333 183:189 The first is a novel stochastic search strategy that appears to make better use of Och (2003)s algorithm for finding the global minimum along any given search direction than either coordinate descent or Powells method. |
| W08-0304 334 9:189 While the former is piecewise constant and thus cannot be optimized using gradient techniques, Och (2003) provides an approach that performs such training efficiently. |
| W08-0304 335 6:189 1 Introduction Och (2003) introduced minimum error rate training (MERT) as an alternative training regime to the conditional likelihood objective previously used with log-linear translation models (Och & Ney, 2002). |
| E06-2002 336 31:77 Starting from the parallel training corpus, provided with direct and inverted alignments, the socalled union alignment (Och and Ney, 2003) is computed. |
| E06-2002 337 30:77 This preprocessing step can be accomplished by applying the GIZA++ toolkit (Och and Ney, 2003) that provides Viterbi alignments based on IBM Model-4. |
| D07-1005 338 47:211 (2) We note that these posterior probabilities can be computed efficiently for some alignment models such as the HMM (Vogel et al. , 1996; Och and Ney, 2003), Models 1 and 2 (Brown et al. , 1993). |
| D07-1005 339 100:211 Minimum Error Rate Training (MERT) (Och, 2003) under BLEU criterion is used to estimate 20 feature function weights over the larger development set (dev1). |
| D07-1005 340 8:211 High quality word alignments can yield more accurate phrase-pairs which improve quality of a phrase-based SMT system (Och and Ney, 2003; Fraser and Marcu, 2006b). |
| D07-1005 341 189:211 Such an approach contrasts with the log-linear HMM/Model-4 combination proposed by Och and Ney (2003). |
| D07-1005 342 9:211 Much of the recent work in word alignment has focussed on improving the word alignment quality through better modeling (Och and Ney, 2003; Deng and Byrne, 2005; Martin et al. , 2005) or alternative approaches to training (Fraser and Marcu, 2006b; Moore, 2005; Ittycheriah and Roukos, 2005). |
| D07-1005 343 36:211 2 Word Alignment Framework A statistical translation model (Brown et al. , 1993; Och and Ney, 2003) describes the relationship between a pair of sentences in the source and target languages (f = fJ1,e = eI1) using a translation probability P(f|e). |
| D07-1005 344 109:211 Our human word alignments do not distinguish between Sure and Probable links (Och and Ney, 2003). |
| N07-1061 345 28:313 2 Phrase-based SMT We use a phrase-based SMT system, Pharaoh, (Koehn et al. , 2003; Koehn, 2004), which is based on a log-linear formulation (Och and Ney, 2002). |
| N07-1061 346 27:313 This is the shared task baseline system for the 2006 NAACL/HLT workshop on statistical machine translation (Koehn and Monz, 2006) and consists of the Pharaoh decoder (Koehn, 2004), SRILM (Stolcke, 2002), GIZA++ (Och and Ney, 2003), mkcls (Och, 1999), Carmel,1 and a phrase model training code. |
| N07-1061 347 36:313 To set the weights, m, we carried out minimum error rate training (Och, 2003) using BLEU (Papineni et al. , 2002) as the objective function. |
| E06-1006 348 95:157 The score combination weights are trained by a minimum error rate training procedure similar to (Och and Ney, 2003). |
| E06-1006 349 89:157 Phrases are then extracted from the word alignments using the method described in (Och and Ney, 2003). |
| P08-1024 350 11:227 However, while discriminative models promise much, they have not been shown to deliver significant gains 1We class approaches using minimum error rate training (Och, 2003) frequency count based as these systems re-scale a handful of generative features estimated from frequency counts and do not support large sets of non-independent features. |
| W05-0836 351 7:153 As discussed in (Och, 2003), the direct translation model represents the probability of target sentence English e = e1eI being the translation for a source sentence French f = f1 fJ through an exponential, or log-linear model p(e|f) = exp( summationtextm k=1 k hk(e,f))summationtext eprimeE exp( summationtextm k=1 k hk(eprime,f)) (1) where e is a single candidate translation for f from the set of all English translations E, is the parameter vector for the model, and each hk is a feature function of e and f. In practice, we restrict E to the set Gen(f) which is a set of highly likely translations discovered by a decoder (Vogel et al. , 2003). |
| W05-0836 352 25:153 In the following, we summarize the optimization algorithm for the unsmoothed error counts presented in (Och, 2003) and the implementation detailed in (Venugopal and Vogel, 2005). |
| W05-0836 353 19:153 2.1 Minimum Error Rate Training The predominant approach to reconciling the mismatch between the MAP decision rule and the evaluation metric has been to train the parameters of the exponential model to correlate the MAP choice with the maximum score as indicated by the evaluation metric on a development set with known references (Och, 2003). |
| W05-0836 354 15:153 In this paper we will compare and evaluate several aspects of these techniques, focusing on Minimum Error Rate (MER) training (Och, 2003) and Minimum Bayes Risk (MBR) decision rules, within a novel training environment that isolates the impact of each component of these methods. |
| I05-2039 355 78:91 It has a lower bound of 0, no upper bound, better scores indicate better translations, and it tends to be highly correlated with the adequacy of outputs ; mWER (Och 2003) or Multiple Word Error Rate is the edit distance in words between the system output and the closest reference translation in a set. |
| P06-1098 356 11:74 Feature function scaling factors m are optimized based on a maximum likely approach (Och and Ney, 2002) or on a direct error minimization approach (Och, 2003). |
| P06-1098 357 65:74 Many-to-many word alignments are induced by running a one-to-many word alignment model, such as GIZA++ (Och and Ney, 2003), in both directions and by combining the results based on a heuristic (Koehn et al. , 2003). |
| P06-1139 358 126:231 When evaluated against the state-of-the-art, phrase-based decoder Pharaoh (Koehn, 2004), using the same experimental conditions translation table trained on the FBIS corpus (7.2M Chinese words and 9.2M English words of parallel text), trigram language model trained on 155M words of English newswire, interpolation weights a65 (Equation 2) trained using discriminative training (Och, 2003) (on the 2002 NIST MT evaluation set), probabilistic beam a90 set to 0.01, histogram beam a58 set to 10 and BLEU (Papineni et al. , 2002) as our metric, the WIDL-NGLM-Aa86 a129 algorithm produces translations that have a BLEU score of 0.2570, while Pharaoh translations have a BLEU score of 0.2635. |
| P06-1139 359 96:231 The interpolation weights a65 (Equation 2) are trained using discriminative training (Och, 2003) using ROUGEa129 as the objective function, on the development set. |
| D08-1065 360 148:259 We then train word alignment models (Och and Ney, 2003) using 6 Model-1 iterations and 6 HMM iterations. |
| D08-1065 361 141:259 For each language pair, we use two development sets: one for Minimum Error Rate Training (Och, 2003; Macherey et al., 2008), and the other for tuning the scale factor for MBR decoding. |
| I08-2087 362 37:170 (2003), bilingual sentences are trained by GIZA++ (Och and Ney 2003) in two directions (from source to target and target to source). |
| I08-2087 363 107:170 The corresponding weight is trained through minimum error rate method (Och 2003). |
| W08-0403 364 168:207 Minimum-error-rate training (Och, 2003) are conducted on dev-set to optimize feature weights maximizing the BLEU score up to 4grams, and the obtained feature weights are blindly applied on the test-set. |
| W07-0710 365 106:214 We use the n-best generation scheme interleaved with optimization as described in (Och, 2003). |
| W07-0710 366 8:214 1 Introduction In recent years, statistical machine translation have experienced a quantum leap in quality thanks to automatic evaluation (Papineni et al. , 2002) and errorbased optimization (Och, 2003). |
| W07-0710 367 40:214 73 2.2.4 Minimum Error Rate Training A good way of training is to minimize empirical top-1 error on training data (Och, 2003). |
| D08-1022 368 130:171 These parameters 1 8 are tuned by minimum error rate training (Och, 2003) on the dev sets. |
| D07-1080 369 36:227 2 Statistical Machine Translation We use a log-linear approach (Och, 2003) in which a foreign language sentence f is translated into another language, for example English, e, by seeking a maximum solution: e = argmax e wT h( f, e) (1) where h( f, e) is a large-dimension feature vector. |
| D07-1080 370 6:227 1 Introduction The recent advances in statistical machine translation have been achieved by discriminatively training a small number of real-valued features based either on (hierarchical) phrase-based translation (Och and Ney, 2004; Koehn et al. , 2003; Chiang, 2005) or syntax-based translation (Galley et al. , 2006). |
| D07-1080 371 176:227 The baseline hierarchical phrase-based system is trained using standard max-BLEU training (MERT) without sparse features (Och, 2003). |
| D07-1080 372 149:227 The hierarchical phrase translation pairs are extracted in a standard way (Chiang, 2005): First, the bilingual data are word alignment annotated by running GIZA++ (Och and Ney, 2003) in two directions. |
| P08-1009 373 22:223 Candidate translations are scored by a linear combination of models, weighted according to Minimum Error Rate Training or MERT (Och, 2003). |
| P08-1009 374 27:223 Early experiments with syntactically-informed phrases (Koehn et al., 2003), and syntactic reranking of K-best lists (Och et al., 2004) produced mostly negative results. |
| P08-1009 375 146:223 Word alignments are provided by GIZA++ (Och and Ney, 2003) with grow-diag-final combination, with infrastructure for alignment combination and phrase extraction provided by the shared task. |
| W07-0410 376 140:193 Different optimization techniques are available, like the Simplex algorithm or the special Minimum Error Training as described in (Och 2003). |
| W06-3108 377 91:203 Then the alignments are symmetrized using a refined heuristic as described in (Och and Ney, 2003). |
| W06-3108 378 90:203 We train IBM Model 4 with GIZA++ (Och and Ney, 2003) in both translation directions. |
| W06-3108 379 40:203 The model scaling factors M1 are trained with respect to the final translation quality measured by an error criterion (Och, 2003). |
| P07-1004 380 31:233 Their weights are optimized w.r.t. BLEU score using the algorithm described in (Och, 2003). |
| C08-1041 381 132:197 We use minimum error rate training (Och, 2003) to tune the feature weights for the log-linear model. |
| W06-2606 382 38:179 Alternatively, one can train them with respect to the final translation quality measured by an error criterion (Och, 2003). |
| P06-2103 383 77:271 There are two necessary ingredients to implement Ochs (2003) training procedure. |
| P06-2103 384 9:271 In contrast, more recent research has focused on stochastic approaches that model discourse coherence at the local lexical (Lapata, 2003) and global levels (Barzilay and Lee, 2004), while preserving regularities recognized by classic discourse theories (Barzilay and Lapata, 2005). |
| P06-2103 385 75:271 The solution we employ here is the discriminative training procedure of Och (2003). |
| P08-1012 386 167:198 We also trained a baseline model with GIZA++ (Och and Ney, 2003) following a regimen of 5 iterations of Model 1, 5 iterations of HMM, and 5 iterations of Model 4. |
| P08-1012 387 178:198 Minimum Error Rate training (Och, 2003) over BLEU was used to optimize the weights for each of these models over the development test data. |
| J05-4005 388 336:855 It is also related to (log-)linear models described in Berger, Della Pietra, and Della Pietra (1996), Xue (2003); Och (2003), and Peng, Feng, and McCallum (2004). |
| W07-0727 389 46:145 To optimize the system towards a maximal BLEU or NIST score, we use Minimum Error Rate (MER) Training as described in (Och, 2003). |
| W06-3602 390 163:191 The real-valued features include the following: a block translation score derived from phrase occurrence statistics a4a9a113a77a11, a trigram language model to predict target words a4a179a112a229 a78a204a11, a lexical weighting score for the block internal words a4a127a202a204a11, a distortion model a4a0a207a229 a218a147a11 as well as the negative target phrase length a4a60a36a87a11 . The transition cost is computed as a19 a4a20a6 a23 a6 a39 a11a224a15 a27 a28 a30a89a32 a4a7a6 a83 a6a20a39a34a11, where a27 a199a230a227 a228 is a weight vector that sums up to a113a89a35a116 : a228 a13a26a17 a10 a27 a13a217a15a231a113a25a35a116 . The weights are trained using a procedure similar to (Och, 2003) on held-out test data. |
| W07-0735 391 56:302 3.1 Evaluation Measure and MERT We evaluate our experiments using the (lowercase, tokenized) BLEU metric and estimate the empirical confidence using the bootstrapping method described in Koehn (2004b).6 We report the scores obtained on the test section with model parameters tuned using the tuning section for minimum error rate training (MERT, (Och, 2003)). |
| W07-0735 392 52:302 In all experiments, word alignment was obtained using the grow-diag-final heuristic for symmetrizing GIZA++ (Och and Ney, 2003) alignments. |
| P06-1097 393 38:187 We use the union, re ned and intersection heuristics de ned in (Och and Ney, 2003) which are used in conjunction with IBM Model 4 as the baseline in virtually all recent work on word alignment. |
| P06-1097 394 4:187 1 Introduction The most widely applied training procedure for statistical machine translation IBM model 4 (Brown et al. , 1993) unsupervised training followed by post-processing with symmetrization heuristics (Och and Ney, 2003) yields low quality word alignments. |
| P06-1097 395 133:187 We run Maximum BLEU (Och, 2003) for 25 iterations individually for each system. |
| P06-1097 396 28:187 An additional translation set called the Maximum BLEU set is employed by the SMT system to train the weights associated with the components of its log-linear model (Och, 2003). |
| P06-1097 397 48:187 Och (2003) has described an ef cient exact one-dimensional error minimization technique for a similar search problem in machine translation. |
| P06-1097 398 33:187 For each training direction, we run GIZA++ (Och and Ney, 2003), specifying 5 iterations of Model 1, 4 iterations of the HMM model (Vogel et al. , 1996), and 4 iterations of Model 4. |
| P06-1097 399 157:187 However, union and rened alignments, which are many-to-many, are what are used to build competitive phrasal SMT systems, because intersection performs poorly, despite having been shown to have the best AER scores for the French/English corpus we are using (Och and Ney, 2003). |
| C04-1072 400 57:189 A natural fit to the existing statistical machine translation framework A metric that ranks a good translation high in an nbest list could be easily integrated in a minimal error rate statistical machine translation training framework (Och 2003). |
| C04-1072 401 48:189 For example, a statistical machine translation system such as ISIs AlTemp SMT system (Och 2003) can generate a list of n-best alternative translations given a source sentence. |
| C04-1072 402 118:189 To simulate real world scenario, we use n-best lists from ISIs state-of-the-art statistical machine translation system, AlTemp (Och 2003), and the 2002 NIST Chinese-English evaluation corpus as the test corpus. |
| C08-1074 403 5:167 1 Introduction Och (2003) introduced minimum error rate training (MERT) for optimizing feature weights in statistical machine translation (SMT) models, and demonstrated that it produced higher translation quality scores than maximizing the conditional likelihood of a maximum entropy model using the same features. |
| C08-1074 404 1:167 Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 585592 Manchester, August 2008 Random Restarts in Minimum Error Rate Training for Statistical Machine Translation Robert C. Moore and Chris Quirk Microsoft Research Redmond, WA 98052, USA bobmoore@microsoft.com, chrisq@microsoft.com Abstract Ochs (2003) minimum error rate training (MERT) procedure is the most commonly used method for training feature weights in statistical machine translation (SMT) models. |
| N04-1008 405 81:175 4.4.1 N-gram Co-Occurrence Statistics for Answer Extraction N-gram co-occurrence statistics have been successfully used in automatic evaluation (Papineni et al. 2002, Lin and Hovy 2003), and more recently as training criteria in statistical machine translation (Och 2003). |
| D07-1036 406 133:249 For the log-linear model training, we take minimum-error-rate training method as described in (Och, 2003). |
| P06-2061 407 34:217 For instance, the resulting word graph can be used in the prediction engine of a CAT system (Och et al. , 2003). |
| P06-2061 408 13:217 In the post-editing step, a prediction engine helps to decrease the amount of human interaction (Och et al. , 2003). |
| P06-2061 409 11:217 A statistical prediction engine provides the completions to what a human translator types (Foster et al. , 1997; Och et al. , 2003). |
| P06-2061 410 51:217 The model scaling factors M1 are trained on a development corpus according to the final recognition quality measured by the word error rate (WER)(Och, 2003). |
| N07-2047 411 65:128 Whilst, the parameters for the maximum entropy model are developed based on the minimum error rate training method (Och, 2003). |
| W07-0724 412 18:91 Their weights are optimized w.r.t. BLEU score using the algorithm described in (Och, 2003). |
| P08-1102 413 128:149 To obtain their corresponding weights, we adapted the minimum-error-rate training algorithm (Och, 2003) to train the outside-layer model. |
| P06-1096 414 206:222 The first approach is to reuse the components of a generative model, but tune their relative weights in a discriminative fashion (Och and Ney, 2002; Och, 2003; Chiang, 2005). |
| P06-1096 415 16:222 Unlike minimum error rate training (Och, 2003), our system is able to exploit large numbers of specific features in the same manner as static reranking systems (Shen et al. , 2004; Och et al. , 2004). |
| P06-1096 416 200:222 We tuned Pharaohs four parameters using minimum error rate training (Och, 2003) on DEV.12 We obtained an increase of 0.8 9As in the POS features, we map each phrase pair to its majority constellation. |
| D08-1066 417 54:243 These heuristics define a phrase pair to consist of a source and target ngrams of a word-aligned source-target sentence pair such that if one end of an alignment is in the one ngram, the other end is in the other ngram (and there is at least one such alignment) (Och and Ney, 2004; Koehn et al., 2003). |
| D08-1066 418 27:243 For evaluation we use a state-of-the-art baseline system (Moses) (Hoang and Koehn, 2008) which works with a log-linear interpolation of feature functions optimized by MERT (Och, 2003). |
| D08-1066 419 175:243 The f are optimized by Minimum-Error Training (MERT) (Och, 2003). |
| D08-1066 420 13:243 The heuristic estimator employs word-alignment (Giza++) (Och and Ney, 2003) and a few thumb rules for defining phrase pairs, and then extracts a multi-set of phrase pairs and estimates their conditional probabilities based on the counts in the multi-set. |
| D08-1066 421 53:243 (Koehn et al., 2003; Och and Ney, 2004)). |
| P07-1040 422 149:212 The same Powells method has been used to estimate feature weights of a standard feature-based phrasal MT decoder in (Och, 2003). |
| P07-1040 423 22:212 In (Matusov et al. , 2006), different word orderings are taken into account by training alignment models by considering all hypothesis pairs as a parallel corpus using GIZA++ (Och and Ney, 2003). |
| P05-1066 424 103:229 In practice, when training the parameters of an SMT system, for example using the discriminative methods of (Och, 2003), the cost for skips of this kind is typically set to a very high value. |
| P05-1066 425 46:229 Reranking methods have also been proposed as a method for using syntactic information (Koehn and Knight, 2003; Och et al. , 2004; Shen et al. , 2004). |
| P05-1066 426 13:229 For this reason there is currently a great deal of interest in methods which incorporate syntactic information within statistical machine translation systems (e.g. , see (Alshawi, 1996; Wu, 1997; Yamada and Knight, 2001; Gildea, 2003; Melamed, 2004; Graehl and Knight, 2004; Och et al. , 2004; Xia and McCord, 2004)). |
| P05-1066 427 8:229 1 Introduction Recent research on statistical machine translation (SMT) has lead to the development of phrasebased systems (Och et al. , 1999; Marcu and Wong, 2002; Koehn et al. , 2003). |
| P05-1066 428 31:229 More recently, phrase-based models (Och et al. , 1999; Marcu and Wong, 2002; Koehn et al. , 2003) have been proposed as a highly successful alternative to the IBM models. |
| W08-0335 429 71:228 The feature weights were optimized against the BLEU scores (Och, 2003). |
| W05-0908 430 11:148 In the area of statistical machine translation (SMT), recently a combination of the BLEU evaluation metric (Papineni et al. , 2001) and the bootstrap method for statistical significance testing (Efron and Tibshirani, 1993) has become popular (Och, 2003; Kumar and Byrne, 2004; Koehn, 2004b; Zhang et al. , 2004). |
| W05-0908 431 32:148 Our system is a re-implementation of the phrase-based system described in Koehn (2003), and uses publicly available components for word alignment (Och and Ney, 2003)1, decoding (Koehn, 2004a)2, language modeling (Stolcke, 2002)3 and finite-state processing (Knight and Al-Onaizan, 1999)4. |
| D08-1051 432 132:207 In the present work, we decided to use WSR instead of Key Stroke Ratio (KSR), which is used in other works on IMT such as (Och et al., 2003). |
| D08-1051 433 166:207 EsEn 63.00.9 59.20.9 6.01.4 EnEs 63.80.9 60.51.0 5.21.6 DeEn 71.60.8 69.00.9 3.61.3 EnDe 75.90.8 73.50.9 3.21.2 FrEn 62.90.9 59.21.0 5.91.6 EnFr 63.40.9 60.00.9 5.41.4 bined in a log-linear fashion by adjusting a weight for each of them by means of the MERT (Och, 2003) procedure, optimising the BLEU (Papineni et al., 2002) score obtained on the development partition. |
| D08-1051 434 12:207 An important contribution to interactive CAT technology was carried out around the TransType (TT) project (Langlais et al., 2002; Foster et al., 2002; Foster, 2002; Och et al., 2003). |
| D08-1051 435 78:207 This tolerant search uses the well known concept of Levenshtein distance in order to obtain the most similar string for the given prefix (see (Och et al., 2003) for more details). |
| D08-1051 436 63:207 In (Och et al., 2003), the use of a word graph is proposed as interface between an alignment-template SMT model and the IMT engine. |
| N06-2013 437 85:113 Decoding weights are optimized using Ochs algorithm (Och, 2003) to set weights for the four components of the log-linear model: language model, phrase translation model, distortion model, and word-length feature. |
| P05-1033 438 69:249 To do this, we first identify initial phrase pairs using the same criterion as previous systems (Och and Ney, 2004; Koehn et al. , 2003): Definition 1. |
| P05-1033 439 56:249 For our experiments we used the following features, analogous to Pharaohs default feature set: P( | ) and P( | ), the latter of which is not found in the noisy-channel model, but has been previously found to be a helpful feature (Och and Ney, 2002); the lexical weights Pw( | ) and Pw( | ) (Koehn et al. , 2003), which estimate how well the words in translate the words in ;2 a phrase penalty exp(1), which allows the model to learn a preference for longer or shorter derivations, analogous to Koehns phrase penalty (Koehn, 2003). |
| P05-1033 440 120:249 We ran the trainer with its default settings (maximum phrase length 7), and then used Koehns implementation of minimumerror-rate training (Och, 2003) to tune the feature weights to maximize the systems BLEU score on our development set, yielding the values shown in Table 2. |
| P05-1033 441 66:249 (2003), which is based on that of Och and Ney (2004). |
| P05-1033 442 20:249 Above the phrase level, these models typically have a simple distortion model that reorders phrases independently of their content (Och and Ney, 2004; Koehn et al. , 2003), or not at all (Zens and Ney, 2004; Kumar et al. , 2005). |
| N04-1022 443 104:155 For all performance metrics, we show the 70% confidence interval with respect to the MAP baseline computed using bootstrap resampling (Press et al. , 2002; Och, 2003). |
| N04-1022 444 138:155 Och (2003) developed a training procedure that incorporates various MT evaluation criteria in the training procedure of log-linear MT models. |
| W07-0734 445 10:94 Bleu is fast and easy to run, and it can be used as a target function in parameter optimization training procedures that are commonly used in state-of-the-art statistical MT systems (Och, 2003). |
| C08-1005 446 154:188 imum error rate training (MERT) (Och, 2003) to maximize BLEU score (Papineni et al., 2002). |
| C08-1144 447 25:207 2 Summary of approaches Given a source language sentence f, statistical machine translation defines the translation task as selecting the most likely target translation e under a model P(e|f), i.e.: e(f) = argmax e P(e|f) = argmax e msummationdisplay i=1 hi(e,f)i where the argmax operation denotes a search through a structured space of translation ouputs in the target language, hi(e,f) are bilingual features of e and f and monolingual features of e, and weights i are trained discriminitively to maximize translation quality (based on automatic metrics) on held out data (Och, 2003). |
| C08-1144 448 14:207 Starting with bilingualphrasepairsextractedfromautomatically aligned parallel text (Och and Ney, 2004; Koehn et al., 2003), these PSCFG approaches augment each contiguous (in source and target words) phrase pair with a left-hand-side symbol (like the VP in the example above), and perform a generalization procedure to form rules that include nonterminal symbols. |
| N04-1021 449 22:293 However, certain properties of the BLEU metric can be exploited to speed up search, as described in detail by Och (2003). |
| D08-1012 450 172:215 When different decoder settings are applied to the same model, MERT weights (Och, 2003) from the unprojected single pass setup are used and are kept constant across runs. |
| H05-1012 451 14:201 Although there is a modest cost associated with annotating data, we show that a reduction of 40% relative in alignment error (AER) is possible over the GIZA++ aligner (Och and Ney, 2003). |
| H05-1012 452 10:201 Current state of the art machine translation systems (Och, 2003) use phrasal (n-gram) features extracted automatically from parallel corpora. |
| N07-1008 453 46:189 The f are trained using a held-out corpus using maximum BLEU training (Och, 2003). |
| N07-1008 454 33:189 Unlike MaxEnt training, the method (Och, 2003) used for estimating the weight vector for BLEU maximization are not computationally scalable for a large number of feature functions. |
| W08-0404 455 120:179 The decision rule was based on the standard loglinear interpolation of several models, with weights tunedbyMERTonthedevelopmentset(Och,2003). |
| W08-0404 456 9:179 While minimum error training (Och, 2003) has by now become a standard tool for interpolating a small number of aggregate scores, it is not well suited for learning in high-dimensional feature spaces. |
| W07-0706 457 209:240 We selected 580 short sentences of length at most 50 characters from the 2002 NIST MT Evaluation test set as our development corpus and used it to tune s by maximizing the BLEU score (Och, 2003), and used the 2005 NIST MT Evaluation test set as our test corpus. |
| P04-1078 458 4:122 1 Introduction With the introduction of the BLEU metric for machine translation evaluation (Papineni et al, 2002), the advantages of doing automatic evaluation for various NLP applications have become increasingly appreciated: they allow for faster implement-evaluate cycles (by by-passing the human evaluation bottleneck), less variation in evaluation performance due to errors in human assessor judgment, and, not least, the possibility of hill-climbing on such metrics in order to improve system performance (Och 2003). |
| W08-0334 459 104:154 Decoding Conditions For tuning of the decoder's parameters, minimum error training (Och 2003) with respect to the BLEU score using was conducted using the respective development corpus. |
| H05-1021 460 116:173 For the combined set (ALL), we also show the 95% BLEU confidence interval computed using bootstrap resampling (Och, 2003). |
| H05-1021 461 142:173 Finally we use Minimum Error Training (MET) (Och, 2003) to train log-linear scaling factors that are applied to the WFSTs in Equation 1. |
| D07-1079 462 167:289 Tuning was done using Maximum BLEU hill-climbing (Och, 2003). |
| D07-1079 463 9:289 Approaches include word substitution systems (Brown et al. , 1993), phrase substitution systems (Koehn et al. , 2003; Och and Ney, 2004), and synchronous context-free grammar systems (Wu and Wong, 1998; Chiang, 2005), all of which train on string pairs and seek to establish connections between source and target strings. |
| D07-1079 464 83:289 A superset of the parallel data was word aligned by GIZA union (Och and Ney, 2003) and EMD (Fraser and Marcu, 2006). |
| W08-0402 465 70:177 Furthermore, techniques such as iterative minimum errorrate training (Och et al., 2003) as well as web-based MT services require the decoder to translate a large number of source-language sentences per unit time. |
| W08-0402 466 141:177 We use the GIZA toolkit (Och and Ney, 2000), a suffix-array architecture (Lopez, 2007), the SRILM toolkit (Stolcke, 2002), and minimum error rate training (Och et al., 2003) to obtain wordalignments, a translation model, language models, and the optimal weights for combining these models, respectively. |
| D07-1054 467 147:267 The translation models were pharse-based (Zen et al. , 2002) created using the GIZA++ toolkit (Och et al. , 2003). |
| D07-1054 468 153:267 For tuning of the decoders parameters, including the language model weight, minimum error training (Och 2003) with respect to the BLEU score using was conducted using the development corpus. |
| N07-1062 469 153:255 The model scaling factors are optimized using minimum error rate training (Och, 2003). |
| D08-1024 470 25:186 2 Learning algorithm The translation model is a standard linear model (Och and Ney, 2002), which we train using MIRA (Crammer and Singer, 2003; Crammer et al., 2006), following Watanabe et al. |
| D08-1024 471 8:186 1 Introduction Since its introduction by Och (2003), minimum error rate training (MERT) has been widely adopted for training statistical machine translation (MT) systems. |
| W06-3601 472 55:298 2 Previous Work It is helpful to compare this approach with recent efforts in statistical MT. Phrase-based models (Koehn et al. , 2003; Och and Ney, 2004) are good at learning local translations that are pairs of (consecutive) sub-strings, but often insufficient in modeling the reorderings of phrases themselves, especially between language pairs with very different word-order. |
| W06-3601 473 149:298 Feature weights of both systems are tuned on the same data set.3 For Pharaoh, we use the standard minimum error-rate training (Och, 2003); and for our system, since there are only two independent features (as we always fix = 1), we use a simple grid-based line-optimization along the language-model weight axis. |
| W07-0715 474 85:155 The feature weights for the overall translation models were trained using Och?s (2003) minimum-error-rate training procedure. |
| J05-4003 475 146:416 Using this alignment strategy, we follow (Och and Ney 2003) and compute one alignment for each translation direction ( f e and e f ), and then combine them. |
| J05-4003 476 267:416 All our MT systems were trained using a variant of the alignment template model described in (Och 2003). |