Relational Classification Dataset

This classification dataset contains 380 scientific publications from AAN manually classified into three research areas ("Machine Translation", "Dependency Parsing" and "Summarization"). This is a relational dataset since we have included metadata information for the papers which includes citation information, authorship information, venue information and year of publication.
Here is a description of the files included.
	|-----metadata.txt    Contains the id, title, authorship, venue and the class information for all the papers.  
	|-----papers_text     This directory contains the full text of the 380 papers. We obtained this text by converting 
	|		      the PDF of the paper to text using PDFBox.  
	|-----citations.txt   The file contains citations between ALL the papers in the AAN data set not just the citations 
			      between the 380 papers in the dataset. This is because many link/citation similarity 
			      measures like cocitation or coupling compute similarity between two papers using citations
			      between other papers.

Here is a complete README which explains the selection process for the publications, annotation process and the format of the different files.

Click here to download this data set.

Papers that have used this dataset
  1. Pradeep Muthukrishnan, Dragomir R. Radev, and Qiaozhu Mei. Simultaneous similarity learning and feature-weight learning for document clustering. In Sixth Textgraphs Workshop at ACL. 2011.
  2. Pradeep Muthukrishnan, Dragomir R. Radev, and Qiaozhu Mei. Edge weight regularization over multiple graphs for similarity learning. In IEEE ICDM. 2010.
Back to Datasets