The dataset we used for this project comes from the ACL Anthology Network(AAN), a related project at the University of Michigan.

About the Data

The ACL Anthology Network was built from the original pdf files available from the ACL Anthology. Using open source OCR technologies, in-house clean-up scripts, and often tedious manual labor, a web interface was developed that allowed for the annotation of individual references from each paper. A team of student research assistants manually matched references to existing ACL ID's returned using a keyword matching algorithm. Those citations deemed to refer to ACL papers but which were not automatically matched were marked for post-processing.

Annotated Datasets

  • Single Paper Summarization (Release 2010)
    • citations to 25 highly cited papers from 5 different domains: Text Summarization, Question Answering, Machien Translation, Textual Entailment, and Dependency Parsing.
    • Each dataset has a "*.txt" file that has 1 citation per line, and a "*.ann" file that has lines of the following format: < fact id > < tab > < nugget >
    • To detect which nuggets/facts a citation contains, one should perform basic string matching.
  • Survey Generation (as explained in Mohammad et, al 2009)
    • 10 QA papers, 16 DP papers.
    • Annotated ciations, abstracts, and full papers.
    • A number of human written survey (length 250 words) for eeach topic
    • More details in Mohammad et, al 2009
    • Use "" to evaluate a given summary using the nugget based pyramid score.
