Computational Linguistics and Information Retrieval (CLAIR)

University of Michigan

For more information please contact: 

Dragomir Radev (radev@umich.edu) or Jahna Otterbacher (jahna@umich.edu)

Cross-document Structure Theory (CST)

CST is a functional theory for multi-document discourse structure.   It is used to describe the semantic connections among units of topically related documents such as "paraphrase" or "contradiction."  It is related to Rhetorical Structure Theory (RST), however, since it describes relationships that holds across multiple documents rather than across spans of text within the same document, it makes no assumptions about authors' intentions in creating cohesion in texts.   

Cross-document Structure Theory Bank (CSTBank)

CSTBank is a corpus of document clusters manually annotated for CST relationships.  It contains clusters of documents created in a variety of ways (e.g. manually and automatically clustered documents) and is organized by families, which describe the text sources and clustering methods used to group documents by their respective topics.  Eventually, CST relation judgments (sentrels) and  sentence utility judgments (sentjudges) will be available for all clusters in the CSTBank.

Data sources for CSTBank

Phase I of CSTBank - includes links to the downloads for publicly available data

Sample data

Taxonomy of CST relationships [pdf] [ps]

Annotation guidelines [pdf] [ps]

If you use CSTBank, please cite this bib entry: CST.bib

Related publications