Paper: A Maximum Entropy Approach To Natural Language Processing

Webmaster's Note: The whole dataset is available Here. Please download the dataset instead of crawling the website.

Basic Info:

id: J96-1002
title: A Maximum Entropy Approach To Natural Language Processing
authors: Berger, Adam L. (Columbia University, New York NY), Della Pietra, Vincent J. (Renaissance Technologies, Stony Brook NY), Della Pietra, Stephen A. (Renaissance Technologies, Stony Brook NY)
venue: CL
year: 1996
pdf: link


Abstract






Incoming Citations
IdTitle
C96-1040Bilingual Knowledge Acquisition From Korean-English Parallel Corpus Using Alignment
W96-0213A Maximum Entropy Model For Part-Of-Speech Tagging
A97-1053Learning Probabilistic Subcategorization Preference By Identifying Case Dependencies And Optimal Noun Class Generalization Level
A97-1056Sequential Model Selection For Word Sense Disambiguation
P97-1048A Model Of Lexical Attraction And Repulsion
W97-0121Collocation Lattices And Maximum Entropy Models
W97-0123Maximum Entropy Model Learning Of Subcategorization Preference
W97-0301A Linear Observed Time Statistical Parser Based On Maximum Entropy Models
W97-0304Text Segmentation Using Exponential Models
W97-0319Probabilistic Coreference In Information Extraction
W97-1005A Statistical Decision Making Method: A Case Study On Prepositional Phrase Attachment
C98-1077Tagging Inflective Languages: Prediction of Morphological Categories for a Rich Structured Tagset
C98-1078Improving Data Driven Wordclass Tagging by System Combination
C98-2135Feature Lattices for Maximum Entropy Modelling
C98-2186Maximum Entropy Model Learning of the Translation Rules
C98-2209General-to-Specific Model Selection for Subcategorization Preference
P98-1080Tagging Inflective Languages: Prediction Of Morphological Categories For A Rich Structured Tagset
P98-2140Feature Lattices For Maximum Entropy Modelling
P98-2191Maximum Entropy Model Learning Of The Translation Rules
P98-2214General-To-Specific Model Selection For Subcategorization Preference
W98-0701General Word Sense Disambiguation Method Based On A Full Sentential Context
W98-1117A Maximum-Entropy Partial Parser For Unrestricted Text
W98-1118Exploiting Diverse Knowledge Sources Via Maximum Entropy In Named Entity Recognition
X98-1013Information Extraction Research And Applications: Current Progress And Future Directions
E99-1026Japanese Dependency Structure Analysis Based On Maximum Entropy Models
J99-1004Statistical Properties Of Probabilistic Context-Free Grammars
J99-2002Decomposable Modeling In Natural Language Processing
P99-1030Analysis System Of Speech Acts And Discourse Structures Using Maximum Entropy Model
P99-1069Estimators For Stochastic Unification-Based Grammars
A00-1019Unit Completion For A Computer-Aided Translation Typing System
A00-2018A Maximum-Entropy-Inspired Parser
A00-2026Trainable Methods For Surface Natural Language Generation
A00-2031Assigning Function Tags To Parsed Text
C00-1060A Hybrid Japanese Parser With Hand-Crafted Grammar And Statistics
C00-1061English-To-Korean Transliteration Using Multiple Unbounded Overlapping Phoneme Chunks
C00-1064Structural Feature Selection For English-Korean Statistical Machine Translation
C00-1082Bunsetsu Identification Using Category-Exclusive Rules
C00-2124Applying System Combination To Base Noun Phrase Identification
C00-2126Word Order Acquisition From Corpora
J00-3003Dialogue Act Modeling For Automatic Tagging And Recognition Of Conversational Speech
P00-1006A Maximum Entropy/Minimum Divergence Translation Model
P00-1042Named Entity Extraction Based On A Maximum Entropy Model And Transformation Rules
W00-0704The Role Of Algorithm Bias Vs. Information Source In Learning Algorithms For Morphosyntactic Disambiguation
W00-0707Incorporating Position Information Into A Maximum Entropy/Minimum Divergence Translation Model
W00-0714Using Perfect Sampling In Parameter Estimation Of A Whole Sentence Maximum Entropy Language Model
W00-0729Chunking With Maximum Entropy Models
W00-1308Enriching The Knowledge Sources Used In A Maximum Entropy Part-Of-Speech Tagger
J01-2002Improving Accuracy In Word Class Tagging Through The Combination Of Machine Learning Systems
N01-1005Question Answering Using Maximum-Entropy Components
P01-1003Improvement Of A Whole Sentence Maximum Entropy Language Model Using Grammatical Features
P01-1027Refined Lexicon Models For Statistical Machine Translation Using A Maximum Entropy Approach
P01-1042Joint And Conditional Estimation Of Tagging And Parsing Models
W01-0505Improving Lexical Mapping Model Of English-Korean Bitext Using Structural Features
W01-0510Information Extraction Using The Structured Language Model
W01-0512The Unknown Word Problem: A Morphological Analysis Of Japanese Using Maximum Entropy Aided By A Dictionary
W01-0712Learning Computational Grammars
W01-1007The Form Is The Substance: Classification Of Genres In Text
C02-1032Improving Alignment Quality In Statistical Machine Translation Using Context-Dependent Maximum Entropy Models
C02-1064Text Generation From Keywords
C02-1119SVM Answer Selection For Open-Domain Question Answering
C02-1143Simple Features For Chinese Word Sense Disambiguation
C02-1168Maximum Entropy Models For Word Sense Disambiguation
C02-2019Morphological Analysis Of The Spontaneous Speech Corpus
P02-1002Sequential Conditional Generalized Iterative Scaling
P02-1021Semi-Supervised Maximum Entropy Based Approach To Acronym And Abbreviation Normalization In Medical Texts
P02-1025A Study On Richer Syntactic Dependencies For Structured Language Modeling
P02-1038Discriminative Training And Maximum Entropy Models For Statistical Machine Translation
P02-1063Revision Learning And Its Application To Part-Of-Speech Tagging
W02-0301Tuning Support Vector Machines For Biomedical Named Entity Recognition
W02-0401Using Maximum Entropy For Sentence Extraction
W02-0811Combining Heterogeneous Classifiers For Word Sense Disambiguation
W02-0813Combining Contextual Features For Word Sense Disambiguation
W02-1002Conditional Structure Versus Conditional Estimation In NLP Models
W02-1011Thumbs Up? Sentiment Classification Using Machine Learning Techniques
W02-2018A Comparison Of Algorithms For Maximum Entropy Parameter Estimation
W02-2019Markov Models For Language-Independent Named Entity Recognition
E03-1007Using POS Information For SMT Into Morphologically Rich Languages
E03-1055Comparison Of Alignment Templates And Maximum Entropy Models For NLP
E03-1071Investigating GIS And Smoothing For Maximum Entropy Taggers
J03-1004A Machine Learning Approach To Modeling Scope Preferences
N03-1004In Question Answering Two Heads Are Better Than One
N03-1028Shallow Parsing With Conditional Random Fields
N03-2008A Maximum Entropy Approach To FrameNet Tagging
P03-1012A Probability Model To Improve Word Alignment
P03-1015Combining Deep And Shallow Approaches In Parsing German
P03-1040Feature-Rich Statistical Translation Of Noun Phrases
P03-1055Deep Syntactic Processing By Combining Shallow Methods
P03-1061Morphological Analysis Of A Large Spontaneous Speech Corpus In Japanese
W03-0318Input Sentence Splitting And Translating
W03-0401A Model Of Syntactic Disambiguation Based On Lexicalized Grammars
W03-0417Training A Naive Bayes Classifier Via The EM Algorithm With A Class Distribution Constraint
W03-0420Maximum Entropy Models For Named Entity Recognition
W03-0425Named Entity Recognition Through Classifier Combination
W03-0505Summarising Legal Texts: Sentential Tense And Argumentative Roles
W03-1007Maximum Entropy Models For FrameNet Classification
W03-1013Log-Linear Models For Wide-Coverage CCG Parsing
W03-1018Evaluation And Extension Of Maximum Entropy Models With Inequality Constraints
W03-1020A Fast Algorithm For Feature Selection In Conditional Maximum Entropy Modeling
W03-1021Training Connectionist Models For The Structured Language Model
W03-1025A Maximum Entropy Chinese Character-Based Parser
W03-1607Criterion For Judging Request Intention In Response Texts Of Open-Ended Questionnaires
W03-1703Utterance Segmentation Using Combined Approach Based On Bi-Directional N-Gram And Maximum Entropy
W03-1718Single Character Chinese Named Entity Recognition
C04-1017Splitting Input Sentence For Machine Translation Using Language Model With Sentence Similarity
C04-1067Chinese And Japanese Word Segmentation Using Word-Level And Character-Level Information
C04-1112A Lemma-Based Approach To A Maximum Entropy Word Sense Disambiguation System For Dutch
C04-1179FrameNet-Based Semantic Parsing Using Maximum Entropy Models
C04-1204Deep Linguistic Analysis For The Accurate Identification Of Predicate-Argument Relations
J04-4002The Alignment Template Approach To Statistical Machine Translation
N04-1001A Statistical Model For Multilingual Entity Detection And Tracking
N04-1034Improved Machine Translation Performance Via Parallel Sentence Extraction From Comparable Corpora
N04-1037The (Non)Utility Of Predicate-Argument Frequencies For Pronoun Interpretation
N04-1039Exponential Priors For Maximum Entropy Models
N04-2003Maximum Entropy Modeling In Sparse Semantic Tagging
N04-4009Competitive Self-Trained Pronoun Interpretation
P04-1014Parsing The WSJ Using CCG And Log-Linear Models
P04-1018A Mention-Synchronous Coreference Resolution Algorithm Based On The Bell Tree
P04-1020Learning Noun Phrase Anaphoricity To Improve Conference Resolution: Issues In Representation And Optimization
P04-1028Mining Metalinguistic Activity In Corpora To Create Lexical Resources Using Information Extraction Techniques: The MOP System
P04-1085Identifying Agreement And Disagreement In Conversational Speech: Use Of Bayesian Networks To Model Pragmatic Dependencies
W04-0701Multi-Document Person Name Resolution
W04-0832Senseval Automatic Labeling Of Semantic Roles Using Maximum Entropy Models
W04-0859The University Of Alicante Systems At Senseval-3
W04-0860The R2D2 Team At Senseval-3
W04-1007A Rhetorical Status Classifier For Legal Text Summarisation
W04-1108Combining Neural Networks And Statistics For Chinese Word Sense Disambiguation
W04-1802Metalinguistic Information Extraction For Terminology
W04-2419Semantic Role Labeling Using Maximum Entropy Model
W04-2501Strategies For Advanced Question Answering
W04-2502Answering Questions Using Advanced Semantics And Probabilistic Inference
W04-3209Comparing And Combining Generative And Posterior Probability Models: Some Advances In Sentence Boundary Detection In Speech
W04-3225Adaptive Language And Translation Models For Interactive Machine Translation
W04-3233NP Bracketing By Maximum Entropy Tagging And SVM Reranking
W04-3235Error Measures And Bayes Decision Rules Revisited With Applications To POS Tagging
W04-3237Adaptation Of Maximum Entropy Capitalizer: Little Data Can Help A Lot
W04-3248A New Approach For English-Chinese Named Entity Alignment
H05-1012A Maximum Entropy Word Aligner For Arabic-English Machine Translation
H05-1022HMM Word And Phrase Alignment For Statistical Machine Translation
H05-1059Bidirectional Inference With The Easiest-First Strategy For Tagging Sequence Data
H05-1083Multi-Lingual Coreference Resolution With Syntactic Features
H05-1087Maximum Expected F-Measure Training Of Logistic Regression Models
H05-1097Word-Sense Disambiguation For Machine Translation
I05-1004Automatic Image Annotation Using Maximum Entropy Model
I05-1010Automatic Discovery of Attribute Words from Web Documents
I05-1015High Efficiency Realization for a Wide-Coverage Unification Grammar
I05-1018Adapting a Probabilistic Disambiguation Model of an HPSG Parser to a New Domain
I05-1040An Ensemble of Grapheme and Phoneme for Machine Transliteration
I05-1045Exploring Syntactic Relation Patterns for Question Answering
I05-1068Semantic Role Labelling of Prepositional Phrases
I05-1081Towards Robust High Performance Word Sense Disambiguation of English Verbs Using Rich Linguistic Features
I05-2046Using Maximum Entropy to Extract Biomedical Named Entities without Dictionaries
I05-3031Two-Phase LMR-RC Tagging for Chinese Word Segmentation
J05-1003Discriminative Reranking For Natural Language Parsing
J05-4005Chinese Word Segmentation And Named Entity Recognition: A Pragmatic Approach
P05-1017Extracting Semantic Orientations Of Words Using Spin Model
P05-1020Machine Learning For Coreference Resolution: From Local Classification To Global Ranking
P05-1027Question Answering As Question-Biased Term Extraction: A New Approach Toward Multilingual QA
P05-1031Towards Finding And Fixing Fragments: Using ML To Identify Non-Sentential Utterances And Their Antecedents In Multi-Party Dialogue
P05-1037Digesting Virtual Geek Culture: The Summarization Of Technical Internet Relay Chats
P05-1057Log-Linear Models For Word Alignment
P05-1061Simple Algorithms For Complex Relation Extraction With Applications To Biomedical IE
P05-1066Clause Restructuring For Statistical Machine Translation
P05-2024Corpus-Oriented Development Of Japanese HPSG Parsers
W05-0509Climbing The Path To Grammar: A Maximum Entropy Model Of Subject/Object Learning
W05-0612An Expectation Maximization Approach To Pronoun Resolution
W05-0627Semantic Role Labeling System Using Maximum Entropy Classifier
W05-0709The Impact Of Morphological Stemming On Arabic Mention Detection And Coreference Resolution
W05-1304A Machine Learning Approach To Acronym Generation
W05-1505Corrective Modeling For Non-Projective Dependency Parsing
W05-1510Probabilistic Models For Disambiguation Of An HPSG-Based Chart Generator
W05-1511Efficacy Of Beam Thresholding Unification Filtering And Hybrid Parsing In Probabilistic HPSG Parsing
W05-1514Chunk Parsing Revisited
W05-1520Statistical Shallow Semantic Parsing Despite Little Training Data
E06-2002A Web-Based Demonstrator Of A Multi-Lingual Phrase-Based Translation System
E06-2015Semantic Role Labeling For Coreference Resolution
J06-4004N-gram-based Machine Translation
N06-1013A Maximum Entropy Approach To Combining Word Alignments
N06-1025Exploiting Semantic Role Labeling WordNet And Wikipedia For Coreference Resolution
N06-1026Identifying And Analyzing Judgment Opinions
N06-2036Word Domain Disambiguation Via Word Sense Disambiguation
N06-4011Automated Quality Monitoring For Call Centers Using Speech And NLP Technologies
P06-1026Learning The Structure Of Task-Driven Human-Human Dialogs
P06-1042Error Mining In Parsing Results
P06-1071A Progressive Feature Selection Algorithm For Ultra Large Feature Spaces
P06-1073Maximum Entropy Based Restoration Of Arabic Diacritics
P06-1089Guessing Parts-Of-Speech Of Unknown Words Using Global Information
P06-1112Exploring Correlation Of Dependency Relation Paths For Answer Extraction
P06-1129Exploring Distributional Similarity Based Models For Query Spelling Correction
P06-2018Using Machine-Learning To Assign Function Labels To Parser Output For Spanish
P06-2063Automatic Identification Of Pro And Con Reasons In Online Reviews
P06-2089A Best-First Probabilistic Shift-Reduce Parser
P06-2093Continuous Space Language Models For Statistical Machine Translation
P06-2109Trimming CFG Parse Trees For Sentence Compression Using Machine Learning Approaches
W06-0122On Using Ensemble Methods For Chinese Named Entity Recognition
W06-0126Word Segmentation And Named Entity Recognition For SIGHAN Bakeoff3
W06-0130Chinese Named Entity Recognition With Conditional Probabilistic Models
W06-0301Extracting Opinions Opinion Holders And Topics Expressed In Online News Media Text
W06-1314Automatically Detecting Action Items In Audio Meeting Recordings
W06-1617Semantic Role Labeling Of NomBank: A Maximum Entropy Approach
W06-1619Extremely Lexicalized Models For Accurate And Fast HPSG Parsing
W06-1633BESTCUT: A Graph Algorithm For Coreference Resolution
W06-1643A Skip-Chain Conditional Random Field For Ranking Meeting Utterances By Importance
W06-2601Maximum Entropy Tagging With Binary And Real-Valued Features
W06-2922Experiments With A Multilanguage Non-Projective Dependency Parser
W06-2928Dependency Parsing With Reference To Slovene Spanish And Swedish
W06-3108Discriminative Reordering Models For Statistical Machine Translation
D07-1019Improving Query Spelling Correction Using Web Search Results
D07-1051An Approach to Text Corpus Construction which Cuts Annotation Costs and Maintains Reusability of Annotated Data
D07-1053Methods to Integrate a Language Model with Semantic Information for a Word Prediction Component
D07-1077Chinese Syntactic Reordering for Statistical Machine Translation
D07-1082Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem
D07-1111Dependency Parsing and Domain Adaptation with LR Models and Parser Ensembles
N07-1001Exploiting Acoustic and Syntactic Features for Prosody Labeling in a Maximum Entropy Framework
N07-1009Structured Local Training and Biased Potential Functions for Conditional Random Fields with Application to Coreference Resolution
N07-1010Coreference or Not: A Twin Model for Coreference Resolution
N07-1030Joint Determination of Anaphoricity and Coreference Resolution using Integer Programming
N07-1046A Log-Linear Block Transliteration Model based on Bi-Stream HMMs
N07-2037Joint Morphological-Lexical Language Modeling for Machine Translation
N07-2043A High Accuracy Method for Semi-Supervised Information Extraction
P07-1020Statistical Machine Translation through Global Lexical Selection and Sentence Reconstruction
P07-1079HPSG Parsing with Shallow Dependency Constraints
P07-1096Guided Learning for Bidirectional Sequence Classification
P07-1113A Sequencing Model for Situation Entity Classification
P07-2056Automatic Part-of-Speech Tagging for Bengali: An Approach for Morphologically Rich Languages in a Poor Resource Scenario
W07-0401Chunk-Level Reordering of Source Language Sentences with Automatically Learned Rules for Statistical Machine Translation
W07-0413Three models for discriminative machine translation using Global Lexical Selection and Sentence Reconstruction
W07-0604High-accuracy Annotation and Parsing of CHILDES Transcripts
W07-0610The Topology of Synonymy and Homonymy Networks
W07-0710Training Non-Parametric Features for Statistical Machine Translation
W07-1027Developing Feature Types for Classifying Clinical Notes
W07-1033Reranking for Biomedical Named-Entity Recognition
W07-1110Semantic Labeling of Compound Nominalization in Chinese
W07-1709The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech
W07-2057PNNL: A Supervised Maximum Entropy Approach to Word Sense Disambiguation
W07-2059PU-BCD: Exponential Family Models for the Coarse- and Fine-Grained All-Words Tasks
W07-2202Evaluating Impact of Re-training a Lexical Disambiguation Model on Domain Adaptation of an HPSG Parser
W07-2208A log-linear model with an n-gram reference distribution for accurate HPSG parsing
C08-1041Improving Statistical Machine Translation using Lexicalized Rule Selection
C08-1079Exploring Domain Differences for the Design of a Pronoun Resolution System for Biomedical Text
C08-1083A Discriminative Alignment Model for Abbreviation Recognition
C08-1142Multi-Criteria-Based Strategy to Stop Active Learning for Data Annotation
C08-1143Active Learning with Sampling by Uncertainty and Density for Word Sense Disambiguation and Text Classification
C08-2016Exact Inference for Multi-label Classification using Sparse Graphical Models
D08-1047A Discriminative Candidate Generator for String Transformations
D08-1057Seed and Grow: Augmenting Statistically Generated Summary Sentences using Schematic Word Patterns
D08-1063Mention Detection Crossing the Language Barrier
D08-1097Question Classification using Head Words and their Hypernyms
D08-1101Relative Rank Statistics for Dialog Analysis
I08-1008Name Origin Recognition Using Maximum Entropy Model and Diverse Features
I08-1048Learning a Stopping Criterion for Active Learning for Word Sense Disambiguation and Text Classification
I08-1060Bilingual Synonym Identification with Spelling Variations
I08-2122Towards Data and Goal Oriented Analysis: Tool Inter-operability and Combinatorial Comparison
I08-4011A Two-Stage Approach to Chinese Part-of-Speech Tagging
I08-4012NOKIA Research Center Beijing Chinese Word Segmentation System for the SIGHAN Bakeoff 2007
I08-4014BUPT Systems in the SIGHAN Bakeoff 2007
I08-4026A Study of Chinese Lexical Analysis Based on Discriminative Models
I08-4030CRF-based Hybrid Model for Word Segmentation NER and even POS Tagging
L08-1019Computational Models for Event Type Classification in Context
L08-1133Evaluation of Context-Dependent Phrasal Translation Lexicons for Statistical Machine Translation
L08-1208Approximating Learning Curves for Active-Learning-Driven Annotation
L08-1227Relationships between Nursing Converstaions and Activities
P08-1002Distributional Identification of Non-Referential Pronouns
P08-1033Hedge Classification in Biomedical Texts with a Weakly Supervised Selection of Keywords
P08-1056Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER
P08-1115Generalizing Word Lattice Translation
P08-2001Language Dynamics and Capitalization using Maximum Entropy
W08-0206The Evolution of a Statistical NLP Course
W08-0404Generalizing Local Translation Models
W08-0504Evaluating the Effects of Treebank Size in a Practical Application for Parsing




Top Similar Papers
By Title
ID Title
I05-3025A Maximum Entropy Approach to Chinese Word Segmentation
N06-1013A Maximum Entropy Approach To Combining Word Alignments
W02-1116A Maximum Entropy Approach To HowNet-Based Chinese Word Sense Disambiguation
N03-2008A Maximum Entropy Approach To FrameNet Tagging
W03-0423Named Entity Recognition With A Maximum Entropy Approach
W96-0213A Maximum Entropy Model For Part-Of-Speech Tagging
A97-1004A Maximum Entropy Approach To Identifying Sentence Boundaries
W03-1025A Maximum Entropy Chinese Character-Based Parser
C04-1179FrameNet-Based Semantic Parsing Using Maximum Entropy Models
W02-0401Using Maximum Entropy For Sentence Extraction


By Abstract
ID Title


By Full Text
ID Title
W97-0304Text Segmentation Using Exponential Models
H94-1028The Candide System For Machine Translation
J93-2003The Mathematics Of Statistical Machine Translation: Parameter Estimation
W05-0835A Recursive Statistical Translation Model
P97-1048A Model Of Lexical Attraction And Repulsion
J01-2004Probabilistic Top-Down Parsing And Language Modeling
W00-1320A Statistical Model For Parsing And Word-Sense Disambiguation
D08-1023Probabilistic Inference for Machine Translation
W08-0333Fast Easy and Cheap: Construction of Statistical Machine Translation Models with MapReduce
W02-1020User-Friendly Text Prediction For Translators


By Co-citation
ID Title Num Co-citations
J93-2004Building A Large Annotated Corpus Of English: The Penn Treebank 30
J93-2003The Mathematics Of Statistical Machine Translation: Parameter Estimation 29
W96-0213A Maximum Entropy Model For Part-Of-Speech Tagging 25
P02-1038Discriminative Training And Maximum Entropy Models For Statistical Machine Translation 22
W97-0301A Linear Observed Time Statistical Parser Based On Maximum Entropy Models 20
P02-1040Bleu: A Method For Automatic Evaluation Of Machine Translation 17
W02-2018A Comparison Of Algorithms For Maximum Entropy Parameter Estimation 17
J03-1002A Systematic Comparison Of Various Statistical Alignment Models 15
P03-1021Minimum Error Rate Training In Statistical Machine Translation 15
A97-1004A Maximum Entropy Approach To Identifying Sentence Boundaries 15


Citation Summary
Citing sentences
P08-1056 1 12:204 These belong to two main categories based on machine learning (Bikel et al., 1997; Borthwick, 1999; McCallum and Li, 2003) and language or domain specific rules (Grishman, 1995; Wakao et al., 1996).
P08-1056 2 36:204 Given a set of features and a training corpus, the MaxEnt estimation process produces a model in which every feature fi has a weight i. We can compute the conditional probability as (Berger et al., 1996): p(o|h) = 1Z(h) productdisplay i ifi(h,o) (1) Z(h) = summationdisplay o productdisplay i ifi(h,o) (2) The conditional probability of the outcome is the product of the weights of all active features, normalized over the products of all the features.
P06-1129 3 89:161 The maximum entropy model (Berger et al. , 1996) provides us with a well-founded framework for this purpose, which has been extensively used in natural lan guage processing tasks ranging from part-ofspeech tagging to machine translation.
W08-0206 4 47:179 For instance, for Maximum Entropy, I picked (Berger et al., 1996; Ratnaparkhi, 1997) for the basic theory, (Ratnaparkhi, 1996) for an application (POS tagging in this case), and (Klein and Manning, 2003) for more advanced topics such as optimization and smoothing.
N04-1039 5 6:180 1 Introduction Conditional Maximum Entropy (maxent) models have been widely used for a variety of tasks, including language modeling (Rosenfeld, 1994), part-of-speech tagging, prepositional phrase attachment, and parsing (Ratnaparkhi, 1998), word selection for machine translation (Berger et al. , 1996), and finding sentence boundaries (Reynar and Ratnaparkhi, 1997).
W05-1510 6 21:201 The forest representation was obtained by adopting chart generation (Kay, 1996; Car93 roll et al. , 1999) where ambiguous candidates are packed into an equivalence class and mapping a chart into a forest in the same way as parsing.
W05-1510 7 80:201 2.3 Probabilistic models for generation with HPSG Some existing studies on probabilistic models for HPSG parsing (Malouf and van Noord, 2004; Miyao and Tsujii, 2005) adopted log-linear models (Berger et al. , 1996).
W03-1021 8 52:164 We should note from equation 4 that the neural network model is similar in functional form to the maximum entropy model (Berger et al. , 1996) except that the neural network learns the feature functions by itself from the training data.
P03-1061 9 32:260 One is to find unknown words from corpora and put them into a dictionary (e.g. , (Mori and Nagao, 1996)), and the other is to estimate a model that can identify unknown words correctly (e.g. , (Kashioka et al. , 1997; Nagata, 1999)).
P03-1061 10 82:260 We implemented this model within an ME modeling framework (Jaynes, 1957; Jaynes, 1979; Berger et al. , 1996).
J99-1004 11 71:455 Among the most widely studied is the Gibbs distribution (Mark, Miller, and Grenander 1996; Mark et al. 1996; Mark 1997; Abney 1997).
J99-1004 12 50:455 The theory has been applied in probabilistic language modeling (Mark, Miller, and Grenander 1996; Mark et al. 1996; Johnson 1998), natural language processing (Berger, Della Pietra, and Della Pietra 1996; Della Pietra, Della Pietra, and Lafferty 1997), as well as computational vision (Zhu, Wu, and Mumford 1997).
P98-2214 13 25:151 As a model learning method, we adopt the maximum entropy model learning method (Della Pietra et al. , 1997; Berger et al. , 1996).
C08-1079 14 70:160 3 Implementation 3.1 Pronoun resolution model We built a machine learning based pronoun resolution engine using a Maximum Entropy ranker model (Berger et al., 1996), similar with Denis and Baldridges model (Denis and Baldridge, 2007).
W03-1007 15 67:203 3.2 Maximum Entropy ME models implement the intuition that the best model will be the one that is consistent with the set of constrains imposed by the evidence, but otherwise is as uniform as possible (Berger et al. , 1996).
W97-1005 16 54:223 The approach made use of a maximum entropy model (Berger et al. , 1996) formulated from frequency information for various combinations of the observed features.
W97-1005 17 46:223 Statistical and information theoretic approaches (Hindle and Rooth, 1993), (Ratnaparkhi et al. , 1994),(Collins and Brooks, 1995), (Franz, 1996) Using lexical collocations to determine PPA with statistical techniques was first proposed by (Hindle and Rooth, 1993).
W06-1643 18 151:186 We performed feature selection by incrementally growing a log-linear model with order0 features f(x,yt) using a forward feature selection procedure similar to (Berger et al. , 1996).
P03-1055 19 107:180 The other main difference is the apparently nonlocal nature of the problem, which motivates our choice of a Maximum Entropy (ME) model for the tagging task (Berger et al. , 1996).
I05-3031 20 7:141 As the taskisanimportantprecursortomanynaturallanguage processing systems, it receives a lot of attentions in the literature for the past decade (Wu and Tseng, 1993; Sproat et al. , 1996).
W07-0604 21 100:187 (2006), but we use a maximum entropy classifier (Berger et al. , 1996) to determine parser actions, which makes parsing extremely fast.
N07-1030 22 40:194 Model parameters are estimated using maximum entropy (Berger et al. , 1996).
W03-0401 23 17:185 Recently used machine learning methods including maximum entropy models (Berger et al. , 1996) and support vector machines (Vapnik, 1995) provide grounds for this type of modeling, because it allows various dependent features to be incorporated into the model without the independence assumption.
W03-0401 24 111:185 A possible solution to this problem is to directly estimate p(A|w) by applying a maximum entropy model (Berger et al. , 1996).
W03-0401 25 160:185 The parsing algorithm was CKY-style parsing with beam thresholding, which was similar to ones used in (Collins, 1996; Clark et al. , 2002).
W07-2202 26 50:258 The disambiguation model of Enju is based on a feature forest model (Miyao and Tsujii, 2002), which is a log-linear model (Berger et al. , 1996) on packed forest structure.
W03-0420 27 16:80 Thus, we obtain the following second-order model: a36a39a38a41a40 a17 a5a7 a42a4 a5a7 a44 a8 a5a57 a15a27a58 a7 a36a39a38a41a40 a17a20a15a59a42a17 a15a41a49 a7 a7 a60 a4 a5a7 a44 a8 ma61a63a62a65a64a33a66 a5a57 a15a27a58 a7a68a67 a40 a17 a15 a42a17 a15a50a49 a7 a15a50a49a51a48 a60 a4 a15a27a47a55a48 a15a50a49a54a48 a44 a11 A well-founded framework for directly modeling the posterior probability a67 a40 a17 a15 a42a17 a15a50a49 a7 a15a50a49a54a48 a60 a4 a15a12a47a55a48 a15a50a49a54a48 a44 is maximum entropy (Berger et al. , 1996).
W03-0420 28 4:80 1 Introduction In this paper, we present an approach for extracting the named entities (NE) of natural language inputs which uses the maximum entropy (ME) framework (Berger et al. , 1996).
P01-1042 29 16:167 In statistical computational linguistics, maximum conditional likelihood estimators have mostly been used with general exponential or maximum entropy models because standard maximum likelihood estimation is usually computationally intractable (Berger et al. , 1996; Della Pietra et al. , 1997; Jelinek, 1997).
W03-1020 30 54:154 The goal of each selection stage is to select the feature f that maximizes the gain of the log likelihood, where the a and gain of f are derived through following steps: Let the log likelihood of the model be -= yx xZysump pL,, )(/|log()( ~ and the empirical expectation of feature f be E p (f)= p (x,y)f(x,y) x,y With the approximation assumption in Berger et al (1996)s paper, the un-normalized component and the normalization factor of the model have the following recursive forms: )|()|( aa exysumxysum SfS = | Z f + The approximate gain of the log likelihood is computed by G Sf (a)L(p Sf a )-L(p S ) =- p (x)(logZ Sf,a (x) x /Z S (x)) +aE p (f) (1) The maximum approximate gain and its corresponding a are represented as: )(max),(~ a fS GfSL =D maxarg f 3 A Fast Feature Selection Algorithm The inefficiency of the IFS algorithm is due to the following reasons.
W03-1020 31 8:154 1 Introduction Maximum Entropy (ME) modeling has received a lot of attention in language modeling and natural language processing for the past few years (e.g. , Rosenfeld, 1994; Berger et al 1996; Ratnaparkhi, 1998; Koeling, 2000).
W03-1020 32 134:154 A more refined algorithm, the incremental feature selection algorithm by Berger et al (1996), allows one feature being added at each selection and at the same time keeps estimated parameter values for the features selected in the previous stages.
W03-1020 33 51:154 In contrast to what is shown in Berger et al 1996s paper, here is how the different values in this variant of the IFS algorithm are computed.
C00-1061 34 13:187 1, 2 show the examples of w~rious transliterations in KTSET 2.0(Park et al. , 1996).
A00-2026 35 149:168 Our approach differs from the corpus-based surface generation approaches of (Langkilde and Knight, 1998) and (Berger et al. , 1996).
A00-2026 36 18:168 There are more sophisticated surface generation packages, such as FUF/SURGE (Elhadad and Robin, 1996), KPML (Bateman, 1996), MUMBLE (Meteer et al. , 1987), and RealPro (Lavoie and Rambow, 1997), which produce natural language text from an abstract semantic representation.
A00-2026 37 20:168 The only trainable approaches (known to the author) to surface generation are the purely statistical machine translation (MT) systems such as (Berger et al. , 1996) and the corpus-based generation system described in (Langkilde and Knight, 1998).
A00-2026 38 44:168 The form of the maximum entropy probability model is identical to the one used in (Berger et al. , 1996; Ratnaparkhi, 1998): k f$(wi,wi-1,wi-2,at~ri) YIj=I Otj p(wilwi-l, wi-2,attri) = Z(Wi-l, wi-2, attri) k to t j=l where wi ranges over V t3 .stop.
A00-2026 39 47:168 The features used in NLG2 are described in the next section, and the feature weights aj, obtained from the Improved Iterative Scaling algorithm (Berger et al. , 1996), are set to maximize the likelihood of the training data.
A00-2026 40 21:168 The MT systems of (Berger et al. , 1996) learn to generate text in the target language straight from the source language, without the aid of an explicit semantic representation.
W02-0811 41 81:196 For the maximum entropy classifier, we estimate the weights by maximizing the likelihood of a heldout set, using the standard IIS algorithm (Berger et al. , 1996).
C00-1064 42 9:273 Thus, a lot of alignment techniques have been suggested at; the sentence (Gale et al. , 1993), phrase (Shin et al. , 1996), nomt t)hrase (Kupiec, 1993), word (Brown et al. , 1993; Berger et al. , 1996; Melamed, 1997), collocation (Smadja et al. , 1996) and terminology level.
C00-1064 43 18:273 Wu (1996) adopted chammls that eliminate syntactically unlikely alignments and Wang et al.
C00-1064 44 107:273 4 Maximum Entropy To explain our method, we l)riefly des(:ribe the con(:ept of maximum entrol)y. Recently, many al)lnoaches l)ased on the maximum entroi)y lnodel have t)een applied to natural language processing (Berger eL al. , \]994; Berger et al. , 1996; Pietra et al. , 1997).
C00-1064 45 151:273 We referred to the studies of (Berger et al. , 1996; Pietra e.t al. , 1997).
W04-0860 46 18:118 The supervised methods are based on Maximum Entropy (ME) (Lau et al. , 1993; Berger et al. , 1996; Ratnaparkhi, 1998), neural network using the Learning Vector Quantization algorithm (Kohonen, 1995) and Specialized Hidden Markov Models (Pla, 2000).
W03-1018 47 7:182 1 Introduction The maximum entropy model (Berger et al. , 1996; Pietra et al. , 1997) has attained great popularity in the NLP field due to its power, robustness, and successful performance in various NLP tasks (Ratnaparkhi, 1996; Nigam et al. , 1999; Borthwick, 1999).
W00-0729 48 14:54 In the last few years there has been an increasing interest in applying MaxEnt models for NLP applications (Ratnaparkhi, 1998; Berger et al. , 1996; Rosenfeld, 1994; Ristad, 1998).
W08-1302 49 23:177 2 Background: MaxEnt Models Maximum Entropy (MaxEnt) models are widely used in Natural Language Processing (Berger et al., 1996; Ratnaparkhi, 1997; Abney, 1997).
P08-2001 50 42:91 Themodeling approachhere describedis discriminative, and is based on maximum entropy (ME) models, firstly applied to natural language problems in (Berger et al., 1996).
P05-1017 51 126:221 Another interesting point is the relation to maximum entropy model (Berger et al. , 1996), which is popular in the natural language processing community.
W02-2018 52 17:125 A conditional maximum entropy model q(xjw) for p has the parametric form (Berger et al. , 1996; Chi, 1998; Johnson et al. , 1999): q(xjw) = exp T f (x) y2Y(w) exp(T f (y)) (1) where is a d-dimensional parameter vector and T f (x) is the inner product of the parameter vector and a feature vector.
W02-2018 53 7:125 In natural language processing, recent years have seen ME techniques used for sentence boundary detection, part of speech tagging, parse selection and ambiguity resolution, and stochastic attribute-value grammars, to name just a few applications (Abney, 1997; Berger et al. , 1996; Ratnaparkhi, 1998; Johnson et al. , 1999).
W02-2018 54 79:125 Finally, it should be noted that in the current implementation, we have not applied any of the possible optimizations that appear in the literature (Lafferty and Suhm, 1996; Wu and Khudanpur, 2000; Lafferty et al. , 2001) to speed up normalization of the probability distribution q. These improvements take advantage of a models structure to simplify the evaluation of the denominator in (1).
P02-1063 55 6:161 Various learning models have been studied such as Hidden Markov models (HMMs) (Rabiner and Juang, 1993), decision trees (Breiman et al. , 1984) and maximum entropy models (Berger et al. , 1996).
P98-2140 56 82:149 We also do not require a newly added feature to be either atomic or a collocation of an atomic feature with a feature already included into the model as it was proposed in (Della Pietra et al. , 1995) (Berger et al. , 1996).
P98-2140 57 110:149 We adopted the stop condition suggested in (Berger et al. , 1996) the maximization of the likelihood on a cross-validation set of samples which is unseen at the parameter estimation.
P98-2140 58 20:149 To 848 make feature ranking computationally tractable in (Della Pietra et al. , 1995) and (Berger et al. , 1996) a simplified process proposed: at the feature ranking stage when adding a new feature to the model, all previously computed parameters are kept fixed and, thus, we have to fit only one new constraint imposed by the candidate feature.
W06-2922 59 10:94 Using Maximum Entropy (Berger, et al. 1996) classifiers I built a parser that achieves a throughput of over 200 sentences per second, with a small loss in accuracy of about 23 %.
W97-0319 60 41:269 164 and Itai, 1990; Dagan et al. , 1995; Kennedy and Boguraev, 1996a; Kennedy and Boguraev, 1996b).
W97-0319 61 8:269 Figure 1 exhibits this scenario with a typical IE system such as SRI's FASTUS system (Hobbs et al. , 1996).
W97-0319 62 79:269 (1996) show that this model is a member of an exponential family with one parameter for each constraint, specifically a model of the form 1 ~ I~ (x,~) p(yl ) = E' in which z(x) = eZ, Y The parameters A1, , An are Lagrange multipliers that impose the constraints corresponding to the chosen features fl, -,fnThe term Z(x) normalizes the probabilities by summing over all possible outcomes y. Berger et al.
W07-2059 63 6:119 Exponential family models are a mainstay of modern statistical modeling (Brown, 1986) and they are widely and successfully used for example in text classification (Berger et al. , 1996).
N07-1010 64 84:197 Once the set of features functions are selected, algorithm such as improved iterative scaling (Berger et al. , 1996) or sequential conditional generalized iterative scaling (Goodman, 2002) can be used to find the optimal parameter values of fkg and fig.
N07-1010 65 81:197 3 Implementation 3.1 Feature Structure To implement the twin model, we adopt the log linear or maximum entropy (MaxEnt) model (Berger et al. , 1996) for its flexibility of combining diverse sources of information.
W05-1520 66 14:40 2.2 Maximum Entropy Our next approach is the Maximum Entropy (Berger et al. , 1996) classification approach.
P06-2093 67 64:213 Several algorithms have been proposed in the literature that try to find the best splits, see for instance (Berger et al. , 1996).
I08-2122 68 98:113 Uses Maximum Entropy (Berger et al., 1996) classification, trained on JNLPBA (Kim et al., 2004) (NER).
W03-0417 69 7:168 State-of-theart machine learning techniques including Support Vector Machines (Vapnik, 1995), AdaBoost (Schapire and Singer, 2000) and Maximum Entropy Models (Ratnaparkhi, 1998; Berger et al. , 1996) provide high performance classifiers if one has abundant correctly labeled examples.
C04-1067 70 37:135 The candidates of unknown words can be generated by heuristic rules(Matsumoto et al. , 2001) or statistical word models which predict the probabilities for any strings to be unknown words (Sproat et al. , 1996; Nagata, 1999).
C04-1067 71 72:135 In the above equation, P(ti) and P(wi;t) are estimated by the maximum-likelihood method, and the probability of a POC tag ti, given a character wi (P(tijwi;ti 2 TPOC)) is estimated using ME models (Berger et al. , 1996).
W07-1027 72 13:39 Maximum Entropy Modeling (MaxEnt) (Berger et al. , 1996) and Support Vector Machine (SVM) (Vapnik, 1995) were used to build the classifiers in our solution.
J00-3003 73 542:607 Suhm and Waibel (1994) and Eckert, Gallwitz, and Niemann (1996) each condition a recognizer LM on left-to-right DA predictions and are able to 366 Stolcke et al. Dialogue Act Modeling show reductions in word error rate of 1% on task-oriented corpora.
J00-3003 74 536:607 Computational approaches to prosodic modeling of DAs have aimed to automatically extract various prosodic parameters--such as duration, pitch, and energy patterns--from the speech signal (Yoshimura et al. \[1996\]; Taylor et al. \[1997\]; Kompe \[1997\], among others).
J00-3003 75 92:607 Automatic segmentation of spontaneous speech is an open research problem in its own right (Mast et al. 1996; Stolcke and Shriberg 1996).
J00-3003 76 473:607 (1996), Warnke et al.
J00-3003 77 518:607 The idea caught on very quickly: Suhm and Waibel (1994), Mast et aL (1996), Warnke et al.
P08-1002 78 163:251 For classi cation, we use a maximum entropy model (Berger et al., 1996), from the logistic regression package in Weka (Witten and Frank, 2005), with all default parameter settings.
W02-0813 79 24:111 Under the maximum entropy framework (Berger et al. , 1996), evidence from different features can be combined with no assumptions of feature independence.
W97-0121 80 227:275 We adopted the stop condition suggested in Berger et al. 1996 the maximization of the likelihood on a cross-validation set of samples which is unseen at the parameter esti~_tion.
W97-0121 81 51:275 Our method uses assumptions similar to Berger et al. 1996 but is naturally suitable for distributed parallel computations.
W97-0121 82 91:275 Berger et al. 1996 presented a way of computing conditional maximum entropy models directly by modifying equation 6 as follows (now instead of w we will explicitly use (x, y) ): i ~Cx~) = ~ f~(~, y) * ~(~, y) ~ ~ .~(~, y) * ~(~) * pCy I ~) = p(xk) (9) x6X yEY xEX yEY where ~(x, y) is an empirical probability of a joint configuration (w) of certain instantiated factor I variables with certain instantiated behavior variables.
W97-0121 83 199:275 First as the configuration space we can use only the reference nodes (w) from the lattice which makes it similar to the method of Berger et al. 1996 described in section 2.1.
W97-0121 84 109:275 To make feature ranking computationally tractable in Della Pietra et al. 1995 and Berger et al. 1996 a simplified process proposed: at the feature ranking stage when adding a new feature to the model all previously computed parameters are kept fixed and, thus, we have to fit only one new constraint imposed by a candidate feature.
P06-2089 85 78:169 One such approach is maximum entropy classification (Berger et al. , 1996), which we use in the form of a library implemented by Tsuruoka1 and used in his classifier-based parser (Tsuruoka and Tsujii, 2005).
W98-1118 86 188:231 Other recent work has applied M.E. to language modeling (Rosenfeld, 1994), machine translation (Berger et al. , 1996), and reference resolution (Kehler, 1997).
W98-1118 87 30:231 More complete discussions of M.E. as applied to computational linguistics, including a description of the M.E. estimation procedure can be found in (Berger et al. , 1996) and (Della Pietra et al. , 1995).
W98-1118 88 28:231 This allows us to compute the conditional probability as follows (Berger et al. , 1996): P(flh) = ~i~ '(h'I) (2) Z~(h) Z~(h) = ~I~I~ '(h'~) (a) ff i The maximum entropy estimation technique guarantees that for every feature gi, the expected value of gi according to the M.E. model will equal the empirical expectation of gi in the training corpus.
W98-1118 89 142:231 Clearly a more sophisticated feature selection routine such as the ones in (Berger et al. , 1996), or (Berger and Printz, 1998) would be required in this case.
W02-1002 90 65:249 Unconstrained CL corresponds exactly to a conditional maximum entropy model (Berger et al. , 1996; Lafferty et al. , 2001).
N07-2043 91 49:82 For the classifier, we used the OpenNLP MaxEnt implementation (maxent.sourceforge.net) of the maximum entropy classification algorithm (Berger et al. 1996).
N07-2043 92 12:82 To reduce the knowledge engineering burden on the user in constructing and porting an IE system, unsupervised learning has been utilized, e.g. Riloff (1996), Yangarber et al.
N06-1013 93 8:176 Maximum entropy (ME) models have been used in bilingual sense disambiguation, word reordering, and sentence segmentation (Berger et al. , 1996), parsing, POS tagging and PP attachment (Ratnaparkhi, 1998), machine translation (Och and Ney, 2002), and FrameNet classification (Fleischman et al. , 2003).
N06-1013 94 24:176 Given a collection of facts, ME chooses a model consistent with all the facts, but otherwise as uniform as possible (Berger et al. , 1996).
N04-1037 95 51:189 Maximum Entropy Modeling As previously indicated, the weight-based scheme of L&L suggests MaxEnt modeling (Berger et al. , 1996) as a particularly natural choice for a machine learning approach.
P07-1079 96 5:260 1 Introduction Several efficient, accurate and robust approaches to data-driven dependency parsing have been proposed recently (Nivre and Scholz, 2004; McDonald et al. , 2005; Buchholz and Marsi, 2006) for syntactic analysis of natural language using bilexical dependency relations (Eisner, 1996).
P07-1079 97 34:260 The disambiguation model of this parser is based on a maximum entropy model (Berger et al. , 1996).
C04-1017 98 17:160 In previous research on splitting sentences, many methods have been based on word-sequence characteristics like N-gram (Lavie et al. , 1996; Berger et al. , 1996; Nakajima and Yamamoto, 2001; Gupta et al. , 2002).
W02-0301 99 103:341 We use the maximum entropy tagging method described in (Kazama et al. , 2001) for the experiments, which is a variant of (Ratnaparkhi, 1996) modified to use HMM state features.
W02-0301 100 23:341 Support Vector Machines (SVMs) (Vapnik, 1995) and Maximum Entropy (ME) method (Berger et al. , 1996) are powerful learning methods that satisfy such requirements, and are applied successfully to other NLP tasks (Kudo and Matsumoto, 2000; Nakagawa et al. , 2001; Ratnaparkhi, 1996).
P02-1038 101 36:155 An especially well-founded framework for doing this is maximum entropy (Berger et al. , 1996).
P08-1115 102 15:179 Formally, the approach we take can be thought of as a noisier channel, where an observed signal o gives rise to a set of source-language strings fprime F(o) and we seek e = arg maxe max fprimeF(o) Pr(e,fprime|o) (2) = arg maxe max fprimeF(o) Pr(e)Pr(fprime|e,o) (3) = arg maxe max fprimeF(o) Pr(e)Pr(fprime|e)Pr(o|fprime).(4) Following Och and Ney (2002), we use the maximum entropy framework (Berger et al., 1996) to directly model the posterior Pr(e,fprime|o) with parameters tuned to minimize a loss function representing 1012 the quality only of the resulting translations.
P05-1027 103 46:223 216 The Maximum Entropy Principle (Berger et al. , 1996) is to nd a model p = argmax pC H(p), which means a probability model p(y|x) that maximizes entropy H(p).
W08-1130 104 2:37 We use discourse-level feature predicates in a maximum entropy classifier (Berger et al., 1996) with binary and n-class classification to select referring expressions from a list.
W08-1130 105 12:37 These feature functions fi were used to train a maximum entropy classifier (Berger et al., 1996) (Le, 2004)thatassignsaprobabilitytoaREregiven context cx as follows: p(re| cx) = Z(cx)exp nsummationdisplay i=1 ifi(cx,re) where Z(cx) is a normalizing sum and the i are the parameters (feature weights) learned.
P07-1020 106 109:189 This logistic regression is also called Maxent as it finds the distribution with maximum entropy that properly estimates the average of each feature over the training data (Berger et al. , 1996).
W00-0707 107 16:89 In previous work (Foster, 2000), I described a Maximum Entropy/Minimum Divergence (MEMD) model (Berger et al. , 1996) for p(w\[hi, s) which incorporates a trigram language model and a translation component which is an analog of the well-known IBM translation model 1 (Brown et al. , 1993).
W00-0707 108 26:89 For a given choice of q and f, the IIS algorithm (Berger et al. , 1996) can be used to find maximum likelihood values for the parameters ~.
W06-1314 109 64:173 We apply a maximum entropy (maxent) model (Berger et al. , 1996) to this task.
W06-1314 110 22:173 96 Research on DA classification initially focused on two-party conversational speech (Mast et al. , 1996; Stolcke et al. , 1998; Shriberg et al. , 1998) and, more recently, has extended to multi-party audio recordings like the ICSI corpus (Shriberg et al. , 2004).
N03-2008 111 34:81 (Berger et al. , 1996).
N07-1046 112 63:233 This sequential property is well suited to HMMs (Vogel et al. , 1996), in which the jumps from the current aligned position can only be forward.
N07-1046 113 128:233 With hand-labeled data, {m} can be learnt via generalized iterative scaling algorithm (GIS) (Darroch and Ratcliff, 1972) or improved iterative scaling (IIS) (Berger 367 et al. , 1996).
W06-1617 114 36:156 Since its introduction to the Natural Language Processing (NLP) community (Berger et al. , 1996), ME-based classifiers have been shown to be effective in various NLP tasks.
W05-1505 115 123:197 For a more detailed introduction to maximum entropy estimation see (Berger et al. , 1996).
W04-1802 116 87:141 Figures 1 and 2 present best results in the learning experiments for the complete set of patterns used in the collocation approach, over two of our evaluation corpora.11 Type Positions Tags/Words Features Accuracy Precision Recall GIS 1 W 1254 0.97 0.96 0.98 IIS 1 T 136 0.95 0.96 0.94 NB 1 T 136 0.88 0.97 0.84 9 see Rish, 2001, Ratnaparkhi, 1997 and Berger et al, 1996 for a formal description of these algorithms.
P04-1018 117 68:216 Effective training algorithm exists (Berger et al. , 1996) once the set of features a42 a57 a16 a1a33a8 a71a54a8 a71a100a85a68a5 a53 is selected.
P04-1018 118 67:216 We use maximum entropy model (Berger et al. , 1996) for both the mention-pair model (9) and the entity-mention model (8): a83a84a1a86a85a88a87 a43 a44 a71 a43 a16 a5a13a7 a55a35a34a23a36 a6a35a37 a6a39a38a40a6a42a41 a31a44a43a3a45a31 a6 a45a46a48a47a24a49 a50 a1 a43 a44 a71 a43 a16 a5 a71 (10) a83a84a1a4a85 a87 a55 a81 a71 a43 a16 a5a13a7 a55a35a34 a36 a6 a37 a6a39a38a40a6a42a41 a11a7a32 a45a31 a6 a45a46a48a47 a49 a50 a1 a55a39a81 a71 a43 a16 a5 a71 (11) wherea57 a16 a1a51a8 a71a52a8 a71a90a85a73a5 is a feature and a53 a16 is its weight; a50 a1a33a8 a71a54a8a5 is a normalizing factor to ensure that (10) or (11) is a probability.
P05-1057 119 12:247 Heuristic approaches obtain word alignments by using various similarity functions between the types of the two languages (Smadja et al. , 1996; Ker and Chang, 1997; Melamed, 2000).
P05-1057 120 39:247 An especially well-founded framework is maximum entropy (Berger et al. , 1996).
P05-1057 121 11:247 Statistical approaches, which depend on a set of unknown parameters that are learned from training data, try to describe the relationship between a bilingual sentence pair (Brown et al. , 1993; Vogel and Ney, 1996).
H05-1083 122 27:218 660 2 Statistical Coreference Resolution Model Our coreference system uses a binary entity-mention model PL( je, m) (henceforth link model ) to score the action of linking a mention m to an entity e. In our implementation, the link model is computed as PL(L = 1je, m) max mprimee PL(L = 1je, mprime, m), (1) where mprime is one mention in entity e, and the basic model building block PL(L = 1je, mprime, m) is an exponential or maximum entropy model (Berger et al. , 1996): PL(Lje, mprime, m) = exp braceleftbig summationtext i igi(e, m prime, m, L)bracerightbig Z(e, mprime, m), (2) where Z(e, mprime, m) is a normalizing factor to ensure that PL( je, mprime, m) is a probability, fgi(e, mprime, m, L)g are features and fig are feature weights.
C00-2124 123 86:195 For every class the weights of the active features are combined and the best scoring class is chosen (Berger et al. , 1996).
D07-1051 124 25:189 optimization approaches which aim at selecting those examples that optimize some (algorithm-dependent) objective function, such as prediction variance (Cohn et al. , 1996), and heuristic methods with uncertainty sampling (Lewis and Catlett, 1994) and query-by-committee (QBC) (Seung et al. , 1992) just to name the most prominent ones.
D07-1051 125 80:189 4.2 Classifier and Features For our AL framework we decided to employ a Maximum Entropy (ME) classifier (Berger et al. , 1996).
D07-1051 126 26:189 AL has already been applied to several NLP tasks, such as document classification (Schohn and Cohn, 2000), POS tagging (Engelson and Dagan, 1996), chunking (Ngai and Yarowsky, 2000), statistical parsing (Thompson et al. , 1999; Hwa, 2000), and information extraction (Lewis and Catlett, 1994; Thompson et al. , 1999).
W03-0425 127 20:71 The model weights are trained using the improved iterative scaling algorithm (Berger et al. , 1996).
W03-0425 128 4:71 (1999), a robust risk minimization classifier, based on a regularized winnow method (Zhang et al. , 2002) (henceforth RRM) and a maximum entropy classifier (Darroch and Ratcliff, 1972; Berger et al. , 1996; Borthwick, 1999) (henceforth MaxEnt).
C04-1112 129 71:246 The statistical classifier used in the experiments reported in this paper is a maximum entropy classifier (Berger et al. , 1996; Ratnaparkhi, 1997b).
C04-1112 130 87:246 Furthermore, good results have been produced in other areas of NLP research using maximum entropy techniques (Berger et al. , 1996; Koeling, 2001; Ratnaparkhi, 1997a).
A00-1019 131 42:153 Techniques for weakening the independence assumptions made by the IBM models 1 and 2 have been proposed in recent work (Brown et al. , 1993; Berger et al. , 1996; Och and Weber, 98; Wang and Waibel, 98; Wu and Wong, 98).
J04-4002 132 79:482 Here, we use the hidden Markov model (HMM) alignment model (Vogel, Ney, and Tillmann 1996) and Model 4 of Brown et al.
W03-1718 133 132:221 The training algorithm we used is the improved iterative scaling (IIS) described in (Berger et al, 1996)3.
W98-1117 134 7:161 Its applications range from sentence boundary disambiguation (Reynar and Ratnaparkhi, 1997) to part-of-speech tagging (Ratnaparkhi, 1996), parsing (Ratnaparkhi, 1997) and machine translation (Berger et al. , 1996).
W06-2928 135 74:104 4 The Dependency Labeler 4.1 Classifier We used a maximum entropy classifier (Berger et al. , 1996) to assign labels to the unlabeled dependencies produced by the Bayes Point Machine.
H05-1059 136 91:175 3 Maximum Entropy Classifier For local classifiers, we used a maximum entropy model which is a common choice for incorporating various types of features for classification problems in natural language processing (Berger et al. , 1996).
H05-1059 137 31:175 A common choice for the local probabilistic classifier is maximum entropy classifiers (Berger et al. , 1996).
W96-0213 138 12:123 Previous uses of this model include language modeling(Lau et al. , 1993), machine translation(Berger et al. , 1996), prepositional phrase attachment(Ratnaparkhi et al. , 1994), and word morphology(Della Pietra et al. , 1995).
W07-1033 139 64:179 is the previous BIO tag, S is the target sentence, and fj and lj are feature functions and parameters of a log-linear model (Berger et al. , 1996).
P05-1037 140 126:218 5.4 Maximum Entropy Maximum entropy has been proven to be an effective method in various natural language processing applications (Berger et al. , 1996).
W08-0504 141 56:116 (2006), but we use a maximum entropy classifier (Berger et al., 1996) to determine parser actions, which makes parsing considerably faster.
W07-0401 142 9:352 Many reordering constraints have been used for word reorderings, such as ITG constraints (Wu, 1996), IBM constraints (Berger et al. , 1996) and local constraints (Kanthak et al. , 2005).
N07-1001 143 22:201 We report results on the Boston University (BU) Radio Speech Corpus (Ostendorf et al. , 1995) and Boston Directions Corpus (BDC) (Hirschberg and Nakatani, 1996), two publicly available speech corpora with manual ToBI annotations intended for experiments in automatic prosody labeling.
N07-1001 144 112:201 The best prosodic label sequence is then, L = argmax L nproductdisplay i P(li|) (6) To estimate the conditional distribution P(li|) we use the general technique of choosing the maximum entropy (maxent) distribution that estimates the average of each feature over the training data (Berger et al. , 1996).
C08-1143 145 109:207 6.2 Experimental Settings We utilize a maximum entropy (ME) model (Berger et al., 1996) to design the basic classifier for WSD and TC tasks.
W05-0509 146 61:209 It can be proven that the probability distribution p satisfying the above assumption is the one with the highest entropy, is unique and has the following expone ntial form (Berger et al. 1996): (1) = = k j cajf jcZcap 1 ),( )( 1)|( a where Z(c) is a normalization factor, fj(a,c) are the values of k features of the pair (a,c) and correspond to the linguistic cues of c that are relevant to predict the outcome a. Features are extracted from the training data and define the constraints that the probabilistic model p must satisfy.
C00-2126 147 69:259 This allows us to compute the conditional probability as follows (Berger et al. , 1996): ag~ (h .f) P(/Ih)1L ' (2) Z (h) ct i .
D07-1019 148 85:176 One is how to learn a statistical model to estimate the conditional probability , and the other is how to generate confusion set C of a given query q 4.1 Maximum Entropy Model for Query Spelling Correction We take a feature-based approach to model the posterior probability . Specifically we use the maximum entropy model (Berger et al. , 1996) for this task: = exp , =1 exp( (, ) =1 ) (2) where exp( (, ) =1 ) is the normalization factor; , is a feature function defined over query q and correction candidate c, while is the corresponding feature weight.
P06-2063 149 73:265 Maximum Entropy models implement the intuition that the best model is the one that is consistent with the set of constraints imposed by the evidence but otherwise is as uniform as possible (Berger et al. , 1996).
P04-1020 150 74:216 (In our experiments, we use maximum entropy classification (MaxEnt) (Berger et al. , 1996) to train this probability model).
W07-2057 151 16:99 We utilize the OpenNLP MaxEnt implementation2 of the maximum entropy classification algorithm (Berger et al. , 1996) to train classification models for each lemma and part-of-speech combination in the training corpus.
C08-2016 152 32:111 When we have a junction tree for each document, we can efficiently perform belief propagation in order to compute argmax in Equation (1), or the marginal probabilities of cliques and labels, necessary for the parameter estimation of machine learning classifiers, including perceptrons (Collins, 2002), and maximum entropy models (Berger et al., 1996).
C08-2016 153 67:111 In the following experiments, we run two machine learning classifiers: Bayes Point Machines (BPM) (Herbrich et al., 2001), and the maximum entropy model (ME) (Berger et al., 1996).
P05-1031 154 107:254 MAXENT, Zhang Les C++ implementation8 of maximum entropy modelling (Berger et al. , 1996).
P08-1033 155 81:191 2.4 Maximum Entropy Classifier Maximum Entropy Models (Berger et al., 1996) seek to maximise the conditional probability of classes, given certain observations (features).
D08-1097 156 60:239 2.2 Maximum Entropy Models Maximum entropy (ME) models (Berger et al., 1996; Manning and Klein, 2003), also known as 928 log-linear and exponential learning models, provide a general purpose machine learning technique for classification and prediction which has been successfully applied to natural language processing including part of speech tagging, named entity recognition etc. Maximum entropy models can integrate features from many heterogeneous information sources for classification.
P02-1025 157 159:161 One solution would be to apply the maximum entropy estimation technique (MaxEnt (Berger et al. , 1996)) to all of the three components of the SLM, or at least to the CONSTRUCTOR.
P06-2018 158 91:184 4.2 Cast3LB Function Tagging For the task of Cast3LB function tag assignment we experimented with three generic machine learning algorithms: a memory-based learner (Daelemans and van den Bosch, 2005), a maximum entropy classifier (Berger et al. , 1996) and a Support Vector Machine classifier (Vapnik, 1998).
C00-1060 159 39:217 We report that our parsing framework achieved high accuracy (88.6%) in dependency analysis of Japanese with a combination of an underspecified HPSG-based Japanese grammar, SLUNG (Mitsuishi et al. , 1998) and the maximum entropy method (Berger et al. , 1996).
C00-1060 160 64:217 2.2 Statistical Approaches with a grmnnmr There have been nlally l)rOl)osals tbr statistical t'rameworks particularly designed tbr 1)arsers with hand-crafted grmnmars (Schal)es, 1992; Briscoe and Carroll, 1993; Abney, 1996; Inui et al. , 1!)97).
P01-1027 161 53:148 In this work we use the following contextual information: a3 Target context: As in (Berger et al. , 1996) we consider a window of 3 words to the left and to the right of the target word considered.
P01-1027 162 43:148 The resulting model has an exponential form with free parameters a102 a91 a24a94a93 a8 a87 a24 a10a11a10a11a10 a24a46a95 . The parameter values which maximize the likelihood for a given training corpus can be computed with the socalled GIS algorithm (general iterative scaling) or its improved version IIS (Pietra et al. , 1997; Berger et al. , 1996).
P01-1027 163 23:148 (Berger et al. , 1996) applies this approach to the so-called IBM Candide system to build context dependent models, compute automatic sentence splitting and to improve word reordering in translation.
P01-1027 164 26:148 Other authors have applied this approach to language modeling (Rosenfeld, 1996; Martin et al. , 1999; Peters and Klakow, 1999).
P01-1027 165 24:148 Similar techniques are used in (Papineni et al. , 1996; Papineni et al. , 1998) for socalled direct translation models instead of those proposed in (Brown et al. , 1993).
P06-1089 166 147:221 p0(t|w) is calculated by ME models as follows (Berger et al. , 1996): p0(t|w)= 1Y(w) exp braceleftBigg Hsummationdisplay h=1 hgh(w,t) bracerightBigg, (20) 709 Language Features English Prefixes of 0 up to four characters, suffixes of 0 up to four characters, 0 contains Arabic numerals, 0 contains uppercase characters, 0 contains hyphens.
P06-1089 167 12:221 There have been many studies on POS guessing of unknown words (Mori and Nagao, 1996; Mikheev, 1997; Chen et al. , 1997; Nagata, 1999; Orphanos and Christodoulakis, 1999).
P06-1089 168 155:221 The features we use are shown in Table 2, which are based on the features used by Ratnaparkhi (1996) and Uchimoto et al.
W03-1025 169 57:199 Each component model takes the exponential form: a37a55a38a57a56 a51 a42a6a44a59a58a60a56 a61 a51a64a63a65a53a67a66 a53 a45a46a70 a71a16a72a21a73a75a74a77a76a79a78a81a80 a78a16a82a11a78 a38a83a44a59a58a60a56a84a61 a51a64a63a65a53a67a66 a53 a58a60a56 a51 a45a86a85 a87 a38a83a44a59a58a60a56a84a61 a51a64a63a65a53a67a66 a53 a45 a58 (2) where a87 a38a83a44a59a58a60a56 a61 a51a41a63a65a53a67a66 a53 a45 is a normalization term to ensure that a37a55a38a57a56 a51a42a6a44a88a58a60a56a62a61 a51a41a63a65a53a67a66 a53 a45 is a probability, a82a11a78 a38a83a44a59a58a60a56 a61 a51a64a63a65a53a67a66 a53 a58a60a56 a51 a45 is a feature function (often binary) and a80 a78 is the weight ofa82a21a78 . Given a set of features and a corpus of training data, there exist ef cient training algorithms (Darroch and Ratcliff, 1972; Berger et al. , 1996) to nd the optimal parameters a89 a80 a78a14a90 . The art of building a maximum entropy parser then reduces to choosing good features.
W03-1025 170 17:199 There are multiple studies (Wu and Fung, 1994; Sproat et al. , 1996; Luo and Roukos, 1996) showing that the agreement between two (untrained) native speakers is about upper a15 a12a14a7 to lower a0a4a12a14a7.
W03-1025 171 175:199 Chinese word segmentation is a well-known problem that has been studied extensively (Wu and Fung, 1994; Sproat et al. , 1996; Luo and Roukos, 1996) and it is known that human agreement is relatively low.
D07-1111 172 62:130 The first LR model for each language uses maximum entropy classification (Berger et al. , 1996) to determine possible parser actions and their probabilities4.
P03-1012 173 99:202 It has been observed that words close to each other in the source language tend to remain close to each other in the translation (Vogel et al. , 1996; Ker and Change, 1997).
P03-1012 174 187:202 Maximum entropy can be used to improve IBM-style translation probabilities by using features, such as improvements to P(f|e) in (Berger et al. , 1996).
P03-1012 175 8:202 For example, alignments can be used to learn translation lexicons (Melamed, 1996), transfer rules (Carbonell et al. , 2002; Menezes and Richardson, 2001), and classifiers to find safe sentence segmentation points (Berger et al. , 1996).
W05-1514 176 99:207 6 Phrase Recognition with a Maximum Entropy Classifier For the candidates which are not filtered out in the above two phases, we perform classification with maximum entropy classifiers (Berger et al. , 1996).
N03-1004 177 51:187 These distributions are modeled using a maximum entropy formulation (Berger et al. , 1996), using training data which consists of human judgments of question answer pairs.
P06-1112 178 151:260 (Berger et al. , 1996) gave a good description of ME model.
P06-1042 179 105:147 7However, the algorithms shares many common points with iterative algorithm that are known to converge and that have been proposed to find maximum entropy probability distributions under a set of constraints (Berger et al. , 1996).
N07-1009 180 17:194 But without the global normalization, the maximumlikelihood criterion motivated by the maximum entropy principle (Berger et al. , 1996) is no longer a feasible option as an optimization criterion.
W00-0714 181 27:82 We have used the Improved Iterative Scaling algorithm (IIS) (Berger et al. , 1996).
N04-1001 182 46:189 Algorithm 1 The RRM Decoding Algorithm foreacha26a29a27a67a42 foreacha68 a1a20a23a69a10a11a10a12a10a45 a60 a48a22a70a26a22a71 a1a73a72a2a25 a57a38a50 a7 a56 a48a54a57 a64a74a30 a57 a31a33a26a17a34 a5a11a75 a60a77a76a74a76 a31a78a26a35a34a66a79a81a80a83a82a38a84a69a85a86a80a24a87a88a48 a60 a48 a70a26a61a71 Somewhat similarly, the MaxEnt algorithm has an associated set of weights a31a33a89 a48a54a57 a34a48a90a50 a7a53a52a54a52a54a52a15 a57a38a50 a7a58a52a54a52a54a52 a25, which are estimated during the training phase so as to maximize the likelihood of the data (Berger et al. , 1996).
N04-1001 183 21:189 For mention detection we use approaches based on Maximum Entropy (MaxEnt henceforth) (Berger et al. , 1996) and Robust Risk Minimization (RRM henceforth) 1For a description of the ACE program see http://www.nist.gov/speech/tests/ace/.
J05-1003 184 166:603 Feature selection methods have been proposed in the maximum-entropy literature by several authors (Ratnaparkhi, Roukos, and Ward 1994; Berger, Della Pietra, and Della Pietra 1996; Della Pietra, Della Pietra, and Lafferty 1997; Papineni, Roukos, and Ward 1997, 1998; McCallum 2003; Zhou et al. 2003; Riezler and Vasserman 2004).
J05-1003 185 543:603 More recent work (McCallum 2003; Zhou et al. 2003; Riezler and Vasserman 2004) has considered methods for speeding up the feature selection methods described in Berger, Della Pietra, and Della Pietra (1996), Ratnaparkhi (1998), and Della Pietra, Della Pietra, and Lafferty (1997).
J05-1003 186 537:603 6.4 Feature Selection Methods A number of previous papers (Berger, Della Pietra, and Della Pietra 1996; Ratnaparkhi 1998; Della Pietra, Della Pietra, and Lafferty 1997; McCallum 2003; Zhou et al. 2003; Riezler and Vasserman 2004) describe feature selection approaches for log-linear models applied to NLP problems.
N03-1028 187 51:169 (2001) used iterative scaling algorithms for CRF training, following earlier work on maximumentropy models for natural language (Berger et al. , 1996; Della Pietra et al. , 1997).
N03-1028 188 20:169 The sequential classi cation approach can handle many correlated features, as demonstrated in work on maximum-entropy (McCallum et al. , 2000; Ratnaparkhi, 1996) and a variety of other linear classi ers, including winnow (Punyakanok and Roth, 2001), AdaBoost (Abney et al. , 1999), and support-vector machines (Kudo and Matsumoto, 2001).
H05-1022 189 64:196 The bigram translation probability relies on word context, known to be helpful in translation (Berger et al. , 1996), to improve the identification of target phrases.
H05-1022 190 39:196 The bigram translation probability t2(f|f,e) specifies the likelihood that target word f is to follow f in a phrase generated by source word e. 170 2.1 Properties of the Model and Prior Work The formulation of the WtoP alignment model was motivated by both the HMM word alignment model (Vogel et al. , 1996) and IBM Model-4 with the goal of building on the strengths of each.
H05-1022 191 47:196 In fact, the WtoP model is a segmental Hidden Markov Model (Ostendorf et al. , 1996), in which states emit observation sequences.
H05-1022 192 34:196 We use a simple, single parameter distribution, with = 8.0 throughout P(K|m,e) = P(K|m,l) K Word-to-Phrase Alignment Alignment is a Markov process that specifies the lengths of phrases and their alignment with source words P(aK1,hK1,K1 |K,m,e) = Kproductdisplay k=1 P(ak,hk,k|ak1,k1,e) = Kproductdisplay k=1 p(ak|ak1,hk;l)d(hk)n(k;eak) The actual word-to-phrase alignment (ak) is a firstorder Markov process, as in HMM-based word-toword alignment (Vogel et al. , 1996).
W08-2139 193 94:144 The maximum entropy classier (Berger et al, 1996) used is Le Zhang's Maximum Entropy Modeling Toolkit and the L-BFGS parameter estimation algorithm with gaussian prior smoothing (Chen and Rosenfeld, 1999).
C08-1142 194 111:203 We utilize maximum entropy (MaxEnt) model (Berger et al., 1996) to design the basic classifier used in active learning for WSD and TC tasks.
C02-1064 195 98:207 We implemented these models within an maximum entropy framework (Berger et al. , 1996; Ristad, 1997; Ristad, 1998).
P05-2024 196 95:135 We employ loglinear models (Berger et al. , 1996) for the disambiguation.
E06-2002 197 13:77 By introducing the hidden word alignment variable a, the following approximate optimization criterion can be applied for that purpose: e = argmaxe Pr(e | f) = argmaxe summationdisplay a Pr(e,a | f) argmaxe,a Pr(e,a | f) Exploiting the maximum entropy (Berger et al. , 1996) framework, the conditional distribution Pr(e,a | f) can be determined through suitable real valued functions (called features) hr(e,f,a),r = 1R, and takes the parametric form: p(e,a | f) exp Rsummationdisplay r=1 rhr(e,f,a)} The ITC-irst system (Chen et al. , 2005) is based on a log-linear model which extends the original IBM Model 4 (Brown et al. , 1993) to phrases (Koehn et al. , 2003; Federico and Bertoldi, 2005).
E06-2002 198 22:77 Hence, either the best translation hypothesis is directly extracted from the word graph and output, or an N-best list of translations is computed (Tran et al. , 1996).
P06-1073 199 104:241 579 The MaxEnt algorithm associates a set of weights (ij)i=1nj=1m with the features, which are estimated during the training phase to maximize the likelihood of the data (Berger et al. , 1996).
P06-1073 200 89:241 Our appoach is based on Maximum Entropy (MaxEnt henceforth) technique (Berger et al. , 1996).
W97-0301 201 116:134 6 Comparison With Previous Work The two parsers which have previously reported the best accuracies on the Penn Treebank Wall St. Journal are the bigram parser described in (Collins, 1996) and the SPATTER parser described in (Jelinek et al. , 1994; Magerman, 1995).
C08-1083 202 34:166 Preparing an aligned abbreviation corpus, we obtain the optimal combination of the features by using the maximum entropy framework (Berger et al., 1996).
C08-1083 203 44:166 We directly model the conditional probability of the alignment a, given x and y, using the maximum entropy framework (Berger et al., 1996), P(a|x,y) = exp{F(a,x,y)}summationdisplay aC(x,y) exp{F(a,x,y)} .
P04-1014 204 26:189 They use a conditional model, based on Collins (1996), which, as the authors acknowledge, has a number of theoretical deficiencies; thus the results of Clark et al. provide a useful baseline for the new models presented here.
P04-1014 205 64:189 Setting the gradient to zero yields the usual maximum entropy constraints (Berger et al. , 1996), except that in this case the empirical values are themselves expectations (over all derivations leading to each gold standard dependency structure).
C98-2186 206 33:86 Then, to solve p, C C in equation (8) is equivalent to solve A. that maximize the loglikelihood: = E (x)log &(x) + Z A f(f ) x i (10) k* = argmax ~(k) Such A. can be solved by one of the numerical algorithm called the Improved Iteratire Scaling Algorithm (Berger et al., 1996).
C98-2186 207 41:86 This algorithm is called the Basic Feature Selection (Berger et al., 1996).
C98-2186 208 12:86 Therefore, estimating a natural language model based on the maximum entropy (ME) method (Pietra et al., 1995; Berger et al., 1996) has been highlighted recently.
W06-1619 209 46:171 Previous studies (Abney, 1997; Johnson et al. , 1999; Riezler et al. , 2000; Malouf and van Noord, 2004; Kaplan et al. , 2004; Miyao and Tsujii, 2005) defined a probabilistic model of unification-based grammars including HPSG as a log-linear model or maximum entropy model (Berger et al. , 1996).
P02-1002 210 5:218 1 Introduction Conditional Maximum Entropy models have been used for a variety of natural language tasks, including Language Modeling (Rosenfeld, 1994), partof-speech tagging, prepositional phrase attachment, and parsing (Ratnaparkhi, 1998), word selection for machine translation (Berger et al. , 1996), and finding sentence boundaries (Reynar and Ratnaparkhi, 1997).
W04-0859 211 13:113 Our systems use both corpus-based and knowledge-based approaches: Maximum Entropy(ME) (Lau et al. , 1993; Berger et al. , 1996; Ratnaparkhi, 1998) is a corpus-based and supervised method based on linguistic features; ME is the core of a bootstrapping algorithm that we call re-training inspired This paper has been partially supported by the Spanish Government (CICyT) under project number TIC-2003-7180 and the Valencia Government (OCyT) under project number CTIDIB-2002-151 by co-training (Blum and Mitchell, 1998); Relevant Domains (RD) (Montoyo et al. , 2003) is a resource built from WordNet Domains (Magnini and Cavaglia, 2000) that is used in an unsupervised method that assigns domain and sense labels; Specification Marks(SP) (Montoyo and Palomar, 2000) exploits the relations between synsets stored in WordNet (Miller et al. , 1993) and does not need any training corpora; Commutative Test (CT) (Nica et al. , 2003), based on the Sense Discriminators device derived from EWN (Vossen, 1998), disambiguates nouns inside their syntactic patterns, with the help of information extracted from raw corpus.
W06-3108 212 67:203 In the case of two orientation classes, cj,j is defined as: cj,j = braceleftbigg left, if j < j right, if j > j (4) Then, the reordering model has the form p(cj,j|fJ1,eI1,i,j) A well-founded framework for directly modeling the probability p(cj,j|fJ1,eI1,i,j) is maximum entropy (Berger et al. , 1996).
C98-2135 213 82:149 We also do not require a newly added feature to be either atomic or a collocation of an atomic feature with a feature already included into the model as it was proposed in (Della Pietra et al., 1995) (Berger et al., 1996).
C98-2135 214 110:149 We adopted the stop condition suggested in (Berger et al., 1996) the maximization of the likelihood on a cross-validation set of samples which is unseen at the parameter estimation.
C98-2135 215 19:149 To 848 make feature ranking computationally tractable in (Della Pietra et al., 1995) and (Berger et al., 1996) a simplified process proposed: at the feature ranking stage when adding a new feature to the model, all previously computed parameters are kept fixed and, thus, we have to fit only one new constraint imposed by the candidate feature.
P98-2191 216 41:80 We build a subset S C ~" incrementally by iterating to adjoin a feature f E ~" which maximizes loglikelihood of the model to S. This algorithm is called the Basic Feature Selection (Berger et al. , 1996).
P98-2191 217 13:80 Therefore, estimating a natural language model based on the maximum entropy (ME) method (Pietra et al. , 1995; Berger et al. , 1996) has been highlighted recently.
P98-2191 218 32:80 Then, to solve p. E C in equation (8) is equivalent to solve h. that maximize the loglikelihood: = (x)log zj,(z) + x i (10) h. = argmax kV(h) Such h. can be solved by one of the numerical algorithm called the Improved Iteratire Scaling Algorithm (Berger et al. , 1996).
P06-2109 219 55:230 2.2 Maximum Entropy Model The maximum entropy model (Berger et al. , 1996) estimates a probability distribution from training data.
C08-1041 220 69:197 The maximum entropy approach (Berger et al., 1996) is known to be well suited to solve the classification problem.
W05-0612 221 130:219 When labeled training data is available, we can use the Maximum Entropy principle (Berger et al. , 1996) to optimize the weights.
W04-1007 222 128:211 First, two maximum entropy classifiers (Berger et al. , 1996) are applied, where the first predicts clause start labels and the second predicts clause end labels.
D08-1063 223 48:196 The {ij}j=1m weights are estimated during the training phase to maximize the likelihood of the data (Berger et al., 1996).
D08-1063 224 21:196 The classification is performed with a statistical approach, built around the maximum entropy (MaxEnt) principle (Berger et al., 1996), that has the advantage of combining arbitrary types of information in making a classification decision.
W04-0701 225 85:208 models implement the intuition that the best model will be the one that is consistent with the set of constrains imposed by the evidence, but otherwise is as uniform as possible (Berger et al. , 1996).
I08-1060 226 85:163 Then, we build a classier learned by training data, using a maximum entropy model (Berger et al., 1996) and the features related to spelling variations in Table 3.
I08-1060 227 44:163 There are other types of variations for phrases; for example, insertion, deletion or substitution of words, and permutation of words such as view point and point of view are such variations (Daille et al., 1996).
W02-1011 228 76:181 5.2 Maximum Entropy Maximum entropy classiflcation (MaxEnt, or ME, for short) is an alternative technique which has proven efiective in a number of natural language processing applications (Berger et al. , 1996).
W02-1011 229 140:181 However, feature/class functions are traditionally deflned as binary (Berger et al. , 1996); hence, explicitly incorporating frequencies would require difierent functions for each count (or count bin), making training impractical.
D07-1082 230 105:195 We utilize a maximum entropy (ME) model (Berger et al. , 1996) to design the basic classifier used in active learning for WSD.
W08-2130 231 6:122 In this paper a discriminative parser is proposed to implement maximum entropy (ME) models (Berger, et al., 1996) to address the learning task.
I08-1048 232 85:153 We utilize a maximum entropy (ME) model (Berger et al., 1996) to design the basic classifier used in active learning for WSD.
W02-2019 233 13:62 Maximum entropy models (Jaynes, 1957; Berger et al. , 1996; Della Pietra et al. , 1997) are a class of exponential models which require no unwarranted independence assumptions and have proven to be very successful in general for integrating information from disparate and possibly overlapping sources.
E06-2015 234 19:98 2.2 Learning Algorithm For learning coreference decisions, we used a Maximum Entropy (Berger et al. , 1996) model.
A00-2031 235 28:129 This is concordant with the usage in the maximum entropy literature (Berger et al. , 1996).
P07-1113 236 50:226 Weusemaximumentropy models (Berger et al. , 1996), which are particularly well-suited for tasks (like ours) with many overlapping features, to harness these linguistic insights by using features in our models which encode, directly or indirectly, the linguistic correlates to SE types.
W07-0413 237 54:156 The probability distributions of these binary classifiers are learnt using maximum entropy model (Berger et al. , 1996; Haffner, 2006).
W98-0701 238 38:168 To determine the tree head-word we used a set of rules similar to that described by (Magerman, 1995)(Jelinek et al. , 1994) and also used by (Collins, 1996), which we modified in the following way: The head of a prepositional phrase (PP-IN NP) was substituted by a function the name of which corresponds to the preposition, and its sole argument corresponds to the head of the noun phrase NP.
W98-0701 239 117:168 , i.e.: (ll) Lj = ~ maz(zi(j, u)) i=I where xi(j,u)E Qi and max(xi(j,u)) is the highest score in the line of the matrix Qi which corresponds to the head word sense j. n is the number of modifiers of the head word h at the current tree level, and k i Lj = j~l Lj where k is the number of senses of the head word h. The reason why gj (I0) is calculated as a sum of the best scores (ll), rather than by using the traditional maximum likelihood estimate (Berger et al. , 1996)(Gah eta\[.
A97-1056 240 41:185 Because their joint distributions have such closed-form expressions, the parameters can be estimated directly from the training data without the need for an iterative fitting procedure (as is required, for example, to estimate the parameters of maximum entropy models; (Berger et al. , 1996)).
A97-1056 241 75:185 The significance of G 2 based on the exact conditional distribution does not rely on an asymptotic approximation and is accurate for sparse and skewed data samples (Pedersen et al. , 1996) 4.2 Information criteria The family of model evaluation criteria known as information criteria have the following expression: IC,~ = G 2 ~ x dof (3) where G ~ and dof are defined above.
A97-1056 242 23:185 (Pedersen et al. , 1996) and (Zipf, 1935)).
A97-1056 243 167:185 Maximum Entropy models have been used to express the interactions among multiple feature variables (e.g. , (Berger et al. , 1996)), but within this framework no systematic study of interactions has been proposed.
A97-1056 244 164:185 However, the Naive Bayes classifier has been found to perform well for word-sense disambiguation both here and in a variety of other works (e.g. , (Bruce and Wiebe, 1994a), (Gale et al. , 1992), (Leacock et al. , 1993), and (Mooney, 1996)).
A97-1056 245 83:185 5 Experimental Data The sense-tagged text and feature set used in these experiments are the same as in (Bruce et al. , 1996).
W01-0712 246 85:210 For every class the weights of the active features are combined and the best scoring class is chosen (Berger et al. , 1996).
P05-1066 247 54:229 A number of other re532 searchers (Berger et al. , 1996; Niessen and Ney, 2004; Xia and McCord, 2004) have described previous work on preprocessing methods.
P05-1066 248 55:229 (Berger et al. , 1996) describe an approach that targets translation of French phrases of the form NOUN de NOUN (e.g. , conflit dinteret).
P05-1066 249 13:229 For this reason there is currently a great deal of interest in methods which incorporate syntactic information within statistical machine translation systems (e.g. , see (Alshawi, 1996; Wu, 1997; Yamada and Knight, 2001; Gildea, 2003; Melamed, 2004; Graehl and Knight, 2004; Och et al. , 2004; Xia and McCord, 2004)).
P05-1066 250 41:229 2.1.2 Research on Syntax-Based SMT A number of researchers (Alshawi, 1996; Wu, 1997; Yamada and Knight, 2001; Gildea, 2003; Melamed, 2004; Graehl and Knight, 2004; Galley et al. , 2004) have proposed models where the translation process involves syntactic representations of the source and/or target languages.
D07-1077 251 30:287 2 Related Work A number of researchers (Brown et al. , 1992; Berger et al. , 1996; Niessen and Ney, 2004; Xia and McCord, 2004; Collins et al. , 2005) have described approaches that preprocess the source language input in SMT systems.
N06-1025 252 47:213 3.2 Learning Algorithm For learning coreference decisions, we used a Maximum Entropy (Berger et al. , 1996) model.
P04-1085 253 45:168 We use maximum entropy modeling (Berger et al. , 1996) to directly model the conditional probability a17a19a18a20a2a21a15a23a22a24a26a25, where each a27a5a15 in a24a29a28a30a18a31a27a32a4a33a6a7a8a9a8a9a8a9a6a23a27a34a11a14a25 is an observation associated with the corresponding speaker a2 a15 . a27 a15 is represented here by only one variable for notational ease, but it possibly represents several lexical, durational, structural, and acoustic observations.
P04-1085 254 65:168 Speaker ranking accuracy Table 2 summarizes the accuracy of our statistical ranker on the test data with different feature sets: the performance is 89.39% when using all feature sets, and reaches 90.2% after applying Gaussian smoothing and using incremental feature selection as described in (Berger et al. , 1996) and implemented in the yasmetFS package.6 Note that restricting ourselves to only backward looking features decreases the performance significantly, as we can see in Table 2.
W07-2208 255 43:192 Previous studies (Abney, 1997; Johnson et al. , 1999; Riezler et al. , 2000; Malouf and van Noord, 2004; Kaplan et al. , 2004; Miyao and Tsujii, 2005) defined a probabilistic model of unification-based grammars including HPSG as a log-linear model or maximum entropy model (Berger et al. , 1996).
W07-2208 256 10:192 This was overcome by a probabilistic model which provides probabilities of discriminating a correct parse tree among candidates of parse trees in a log-linear model or maximum entropy model (Berger et al. , 1996) with many features for parse trees (Abney, 1997; Johnson et al. , 1999; Riezler et al. , 2000; Malouf and van Noord, 2004; Kaplan et al. , 2004; Miyao and Tsujii, 2005).
P01-1003 257 19:155 Using the ME principle, we can combine information from a variety of sources into the same language model (Berger et al. , 1996; Rosenfeld, 1996).
W06-2601 258 7:196 1 Introduction The Maximum Entropy (ME) statistical framework (Darroch and Ratcliff, 1972; Berger et al. , 1996) has been successfully deployed in several NLP tasks.
W06-2601 259 98:196 6 Parameter Estimation From the duality of ME and maximum likelihood (Berger et al. , 1996), optimal parameters for model (3) can be found by maximizing the log-likelihood function over a training sample {(xt,yt) : t = 1,,N}, i.e.: = argmax Nsummationdisplay t=1 logp(yt|xt).
W06-2601 260 12:196 Despite ME theory and its related training algorithm (Darroch and Ratcliff, 1972) do not set restrictions on the range of feature functions1, popular NLP text books (Manning and Schutze, 1999) and research papers (Berger et al. , 1996) seem to limit them to binary features.
W05-1304 261 38:122 In this paper we adopt a maximum entropy model (Berger et al. , 1996) to estimate the local probabilities a28 a14 a1 a25 a19a1 a25a30a29 a2 a9a22a21 since it can incorporate diverse types of features with reasonable computational cost.
W06-1633 262 42:199 Based on the data seen, a maximum entropy model (Berger et al. , 1996) offers an expression (1) for the probability that there exists coreference C between a mention mi and a mention mj.
W05-0627 263 12:96 In our SRL system, we select maximum entropy (Berger et al. , 1996) as a classi er to implement the semantic role labeling system.
D08-1047 264 14:211 (1) Here, the candidate generator gen(s) enumerates candidates of destination (correct) strings, and the scorer P(t|s) denotes the conditional probability of the string t for the given s. The scorer was modeled by a noisy-channel model (Shannon, 1948; Brill and Moore, 2000; Ahmad and Kondrak, 2005) and maximum entropy framework (Berger et al., 1996; Li et al., 2006; Chen et al., 2007).
W05-0709 265 22:216 Both systems are built around from the maximum-entropy technique (Berger et al. , 1996).
W05-0709 266 109:216 The principle of maximum entropy states that when one searches among probability distributions that model the observed data (evidence), the preferred one is the one that maximizes the entropy (a measure of the uncertainty of the model) (Berger et al. , 1996).
W05-0709 267 144:216 where mk is one mention in entity e, and the basic model building block PL(L = 1je, mk, m) is an exponential or maximum entropy model (Berger et al. , 1996).
W05-0709 268 83:216 SEP/epsilon a/A# epsilon/# a/epsilon a/epsilon b/epsilon b/B UNK/epsilon c/C b/epsilon c/BC e/+E epsilon/+ d/epsilon d/epsilon epsilon/epsilon b/AB# b/A#B# e/+DE c/epsilon d/BCD e/+D+E Figure 1: Illustration of dictionary based segmentation finite state transducer 3.1 Bootstrapping In addition to the model based upon a dictionary of stems and words, we also experimented with models based upon character n-grams, similar to those used for Chinese segmentation (Sproat et al. , 1996).
C04-1179 269 34:187 3 Maximum Entropy ME models implement the intuition that the best model is the one that is consistent with the set of constraints imposed by the evidence, but otherwise is as uniform as possible (Berger et al. 1996).
P05-1061 270 139:290 We use a standard maximum entropy classifier (Berger et al. , 1996) implemented as part of MALLET (McCallum, 2002).
C04-1204 271 70:135 Following recent research about disambiguation models on linguistic grammars (Abney, 1997; Johnson et al. , 1999; Riezler et al. , 2002; Clark and Curran, 2003; Miyao et al. , 2003; Malouf and van Noord, 2004), we apply a log-linear model or maximum entropy model (Berger et al. , 1996) on HPSG derivations.
P07-1096 272 165:214 Following (Ratnaparkhi, 1996; Collins, 2002; Toutanova et al. , 2003; Tsuruoka and Tsujii, 2005), 765 Feature Sets Templates Error% A Ratnaparkhis 3.05 B A + [t0,t1],[t0,t1,t1],[t0,t1,t2] 2.92 C B + [t0,t2],[t0,t2],[t0,t2,w0],[t0,t1,w0],[t0,t1,w0], [t0,t2,w0], [t0,t2,t1,w0],[t0,t1,t1,w0],[t0,t1,t2,w0] 2.84 D C + [t0,w1,w0],[t0,w1,w0] 2.78 E D + [t0,X = prefix or suffix of w0],4 < |X| 9 2.72 Table 2: Experiments on the development data with beam width of 3 we cut the PTB into the training, development and test sets as shown in Table 1.
P07-1096 273 201:214 766 System Beam Error% (Ratnaparkhi, 1996) 5 3.37 (Tsuruoka and Tsujii, 2005) 1 2.90 (Collins, 2002) 2.89 Guided Learning, feature B 3 2.85 (Tsuruoka and Tsujii, 2005) all 2.85 (Gimenez and M`arquez, 2004) 2.84 (Toutanova et al. , 2003) 2.76 Guided Learning, feature E 1 2.73 Guided Learning, feature E 3 2.67 Table 4: Comparison with the previous works According to the experiments shown above, we build our best system by using feature set E with beam width B = 3.
W03-1013 274 158:181 We have implemented a parallel version of our GIS code using the MPICH library (Gropp et al. , 1996), an open-source implementation of the Message Passing Interface (MPI) standard.
C98-2209 275 25:156 As a model learning method, we adopt the maximum entropy model learning method (Della Pietra et al., 1997; Berger et al., 1996).
W05-1511 276 21:178 Probabilistic models where probabilities are assigned to the CFG backbone of the unification-based grammar have been developed (Kasper et al. , 1996; Briscoe and Carroll, 1993; Kiefer et al. , 2002), and the most probable parse is found by PCFG parsing.
W05-1511 277 38:178 Previous studies (Abney, 1997; Johnson et al. , 1999; Riezler et al. , 2000; Miyao et al. , 2003; Malouf and van Noord, 2004; Kaplan et al. , 2004; Miyao and Tsujii, 2005) defined a probabilistic model of unification-based grammars as a log-linear model or maximum entropy model (Berger et al. , 1996).
P06-1071 278 7:208 1 Introduction Conditional Maximum Entropy (CME) modeling has received a great amount of attention within natural language processing community for the past decade (e.g. , Berger et al. , 1996; Reynar and Ratnaparkhi, 1997; Koeling, 2000; Malouf, 2002; Zhou et al. , 2003; Riezler and Vasserman, 2004).
P06-1071 279 28:208 2.1 Conditional Maximum Entropy Model The goal of CME is to find the most uniform conditional distribution of y given observation x, ( )xyp, subject to constraints specified by a set of features ()yxf i,, where features typically take the value of either 0 or 1 (Berger et al. , 1996).
P06-1071 280 37:208 This leads to a good amount of work in this area (Ratnaparkhi et al. , 1994; Berger et al. , 1996; Pietra et al, 1997; Zhou et al. , 2003; Riezler and Vasserman, 2004) In the most basic approach, such as Ratnaparkhi et al.
P03-1015 281 159:206 We used the Maximum Entropy approach5 (Berger et al. , 1996) as a machine learner for this task.
N06-1026 282 99:213 Maximum Entropy models implement the intuition that the best model is the one that is consistent with the set of constraints imposed by the evidence but otherwise is as uniform as possible (Berger et al. 1996).
H05-1012 283 61:201 (Berger et al. , 1996)), 1We are overloading the word state to mean Arabic word position.
H05-1012 284 19:201 These IBM models and more recent refinements (Moore, 2004) as well as algorithms that bootstrap from these models like the HMM algorithm described in (Vogel et al. , 1996) are unsupervised algorithms.
W08-0404 285 26:179 Maximum entropy estimation for translation of individual words dates back to Berger et al (1996), and the idea of using multi-class classifiers to sharpen predictions normally made through relative frequency estimates has been recently reintroducedundertherubricofwordsensedisambiguation and generalized to substrings (Chan et al 2007; Carpuat and Wu 2007a; Carpuat and Wu 2007b).
N04-2003 286 74:227 3 Feature selection Berger et al (1996) proposed an iterative procedure of adding news features to feature set driven by data.
N04-2003 287 24:227 A major issue in MaxEnt training is how to select proper features and determine the feature targets (Berger et al. , 1996; Jebara and Jaakkola, 2000).
E99-1026 288 53:210 This allows us to compute the conditional probability as follows (Berger et al. , 1996): P(flh) YIia\[ '(n'l) z~(h) (2) ~,i (3) I i The maximum entropy estimation technique guarantees that for every feature gi, the expected value of gi according to the M.E. model will equal the empirical expectation of gi in the training corpus.
E99-1026 289 162:210 Other methods that have been proposed are one based on using the gain (Berger et al. , 1996) and an approximate method for selecting informative features (Shirai et al. , 1998a), and several criteria for feature selection were proposed and compared with other criteria (Berger and Printz, 1998).
W06-0301 290 96:178 As a learning algorithm for our classification model, we used Maximum Entropy (Berger et al. , 1996).
N06-2036 291 32:73 The algorithm employs the OpenNLP MaxEnt implementation of the maximum entropy classification algorithm (Berger et al. 1996) to develop word sense recognition signatures for each lemma which predicts the most likely sense for the lemma according to the context in which the lemma occurs.
C02-2019 292 19:123 One is to find unknown words from corpora and put them into a dictionary (e.g. , (Mori and Nagao, 1996)), and the other is to estimate a model that can identify unknown words correctly (e.g. , (Kashioka et al. , 1997; Nagata, 1999)).
C02-2019 293 39:123 (1) Here has(h,x) is a binary function that returns true if the history h has feature x.Inour experiments, we focused on such information as whether or not a string is found in a dictionary, the length of the string, what types of characters are used in the string, and what part-of-speech the adjacent morpheme is. Given a set of features and some training data, the M.E. estimation process produces a model, which is represented as follows (Berger et al. , 1996; Ristad, 1997; Ristad, 1998): P(f|h)= producttext i g i (h,f) i Z (h) (2) Z (h)= summationdisplay f productdisplay i g i (h,f) i.
W00-0704 294 15:121 We will provide a more detailed and systematic comparison between MAXIMUM ENTROPY MODELING (aatnaparkhi, 1996) and MEMORY BASED LEARNING (Daelemans et al. , 1996) for morpho-syntactic disambiguation and we investigate whether earlier observed differences in tagging accuracy can be attributed to algorithm bias, information source issues or both.
W00-0704 295 57:121 A word is considered to be known when it has an ambiguous tag (henceforth ambitag) attributed to it in the LEXICON, which is compiled in the same way as for the MBT-tagger (Daelemans et al. , 1996).
W07-1110 296 37:167 (Dahl et al. , 1987; Hull and Gomez, 1996) use hand-coded slot-filling rules to determine the semantic roles of the arguments of a nominalization.
W07-1110 297 110:167 5.2 Maximum Entropy Model We use the Maximum Entropy (ME) Model (Berger et al. , 1996) for our classification task.
I08-1008 298 71:162 3 MaxEnt Model and Features 3.1 MaxEnt Model for NOR The principle of maximum entropy (MaxEnt) model is that given a collection of facts, choose a model consistent with all the facts, but otherwise as uniform as possible (Berger et al., 1996).
P06-1026 299 92:189 However, in order to cope with the prediction errors of the classi er, we approximate a74a51a18a77a76 a28 with an a80 -gram language model on sequences of the re ned tag labels: a38a58a39 a41 a81 a43a82a44a47a46a83a48a47a50a75a44a15a52 a53a9a54a49a84 a53a9a54a83a84a49a85a9a86a13a87a89a88a91a90 a55a57a56 a38a40a39 a81 a59a60a42a61 (2) a92 a44a47a46a83a48a47a50a75a44a15a52 a53a9a54 a84 a53a9a54a83a84a49a85a9a86a13a87a89a88a91a90 a93 a94a96a95 a55a57a56a98a97a66a99 a95 a59a100a27a61 (3) In order to estimate the conditional distribution a101 a18a20a19a15a21 a1 a68 a72 a28 we use the general technique of choosing the maximum entropy (maxent) distribution that properly estimates the average of each feature over the training data (Berger et al. , 1996).
C02-1143 300 14:130 Under the maximum entropy framework (Berger et al. , 1996), evidence from different features can be combined with no assumptions of feature independence.
C02-1143 301 106:130 We used a maximummatching algorithm and a dictionary compiled from the CTB (Sproat et al. , 1996; Xue, 2001) to do segmentation, and trained a maximum entropy part-ofspeech tagger (Ratnaparkhi, 1998) and TAG-based parser (Bikel and Chiang, 2000) on the CTB to do tagging and parsing.4 Then the same feature extraction and model-training was done for the PDN corpus as for the CTB.
W03-0505 302 133:233 The first two phases are approached as straightforward classification in a maximum entropy framework (Berger et al. , 1996).
C00-1082 303 37:172 a.2 Maximum-entropy method The maximum-entropy method is useful with sparse data conditions and has been used by many researchers (Berger et al. , 1996; Ratnaparkhi, 1996; Ratnaparkhi, 1997; Borthwick el; al. , 1998; Uchimoto et al. , 1999).
I05-2046 304 46:140 Given a set of features and a training corpus, the ME estimation process produces a model in which every feature fi has a weight i. From (Berger et al. , 1996), we can compute the conditional probability as: p(o|h) = 1Z(h)productdisplay i fi(h,o)i (2) Z(h) =summationdisplay o productdisplay i fi(h,o)i (3) The probability is given by multiplying the weights of active features (i.e. , those fi(h,o) = 1).
I05-2046 305 75:140 The MBT POS tagger (Daelemans et al. , 1996) is used to provide POS information.
P05-1020 306 75:212 We consider three learning algorithms, namely, the C4.5 decision tree induction system (Quinlan, 1993), the RIPPER rule learning algorithm (Cohen, 1995), and maximum entropy classification (Berger et al. , 1996).
Copyright © Univ. of Mich. and the CLAIR Group at the Univ. of Mich.
All information provided herein should be considered tentative and still under construction. Further analysis and correction is still being performed. Please remember that all statistics contained herein are the results of independent research and should not be considered a statement of fact regarding any of the papers, authors, or other entities they refer to.