SI 760 / EECS 597 / LING 702 Language and Information

Winter 2004

Mondays, 1:10-3:55 PM

412 West Hall




Course Description
A survey of quantitative techniques used in language and information studies. Students will learn how to explore and analyze textual data in the context of corpus-based and Web-based information retrieval systems. At the conclusion of the course, students will be able to work as information designers and analysts





Dragomir R. Radev

3080 West Hall Connector

Office Hours: TBA



January: 12, 19, 26
February: 2, 9, 16,
March: 1, 8, 15, 22, 29
April: 5, 12, 19 (presentations), 26 (final)




Required books:

1.      Manning and Schütze. Foundations of Statistical Natural Language Processing.

      MIT Press. 1999.

      2. Oakes. Statistics for Corpus Linguistics. Edinburgh University Press 1998.


Reference readings:

      1. Jurafsky and Martin. Speech and Language Processing. Prentice-Hall 2000.

2.      Cover and Thomas. Elements of Information Theory. John Wiley and Sons 1991.


Additional readings:

      Several research articles as well as some software documentation will be handed out.





1. The computational study of Language. Linguistic Fundamentals. 
2. Mathematical and Probabilistic Fundamentals. Descriptive Statistics. Measures of central tendency. The z score. Hypothesis testing. 
3. Information theory. Entropy, joint entropy, conditional entropy. Relative entropy and mutual information. Chain rules. The entropy of English.  
4. Working with corpora. N-grams.
5. Language models. Hidden Markov Models. Noisy channel models. Applications to Part-of-speech tagging and other problems.
6. Cluster analysis. Distributional clustering.
7. Collocations. Syntactic criteria for collocability. 
8. Literary detective work. The statistical analysis of writing style.  
9. Text summarization. Cross-document structure theory. 
10. Lexical semantics. WordNet
11. Information Extraction. Question Answering. 
12. Word sense disambiguation 
13. Lexical acquisition. 
14. Paraphrase acquisition
15. Possible additional topics: Text alignment. Statistical machine translation.  Discourse segmentation.   





Assignments (25%)

Project (30%)

Survey paper (15%)

Final (30%)