[Mead] LDC releases the SummBank corpus

From: radev@umich.edu
Date: Mon Dec 22 2003 - 22:28:07 EST


http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T16

SummBank 1.0

  
Item Name: SummBank 1.0
Authors: Dragomir Radev, Simone Teufel, Horacio Saggion, Wai Lam,
John Blitzer, Arda Celebi, Elliott Drabek, Danyu Liu, Hong Qi, Tim
Allison
LDC Catalog No.: LDC2003T16
ISBN: 1-58563-274-0
Data Type: text
Data Source(s): varied
Application(s): cross-lingual information retrieval, summarization
Language(s): Chinese, English
Language ID(s): CHN, ENG
Distribution: 3 DVD(s).
Membership Year(s): 2003
Non-member Price: N/A (members only)
Non-member License: yes
Online documentation: yes

 
Introduction
SummBank 1.0 was produced by Linguistic Data Consortium (LDC) catalog
number LDC2003T16 and ISBN 1-58563-274-0.

SummBank 1.0 contains the data created for the Summer 2001 Johns
Hopkins Workshop which focused on text summarization in a
cross-lingual information retrieval framework. For more information
about the Johns Hopkins summer workshop on Text Summarization please
visit its website. The goal of the corpus is to gather together a
corpus of original documents and summaries which can be used as gold
standards by the documents summarization community.

The source of the data consists of 18,147 aligned bilingual (Cantonese
and English) article pairs from the Information Services Department of
the Hong-Kong Special Administrative Region of the People's Republic
of China, which were published by the LDC in 2000 as Hong Kong News
Parallel Text.

Data
This corpus contains 40 news clusters in English and Chinese, 360
multi-document, human-written non-extractive summaries, and nearly 2
million single document and multi-document extracts created by
automatic and manual methods. The summarizer that was developed during
the workshop is called MEAD; updated versions of the software are
available from the MEAD website.

This distribution includes roughly 2 million text files, totalling
approximately 13GB uncompressed. The text files are encoded either as
utf-8 for English or GB or Big-5 for Chinese.

Updates
Additional information, updates, bug fixes may be available on the
SummBank website.

Content Copyright
Portions ) 1997-2000 The Government of the Hong Kong Special
Administrative Region (HKSAR), ) 2000, 2003 Trustees of the University
of Pennsylvania
 

_______________________________________________
Mead mailing list
Mead@lists.si.umich.edu
http://lists.si.umich.edu/mailman/listinfo/mead



This archive was generated by hypermail 2b30 : Tue Jun 09 2009 - 05:00:06 EDT