MEAD: Re: maillist for MEAD & question on MEAD

From: Dragomir Radev (radev@si.umich.edu)
Date: Thu Mar 21 2002 - 00:19:16 EST


Hi,

You have to write your own script for that. Web pages come in so many
different formats and styles that it is impossible to write a standard
conversion tool.

You need to go through the following steps:

1. HTML to text --> extract tags, JavaScript, etc.
2. sentence boundary identification
3. conversion to docsent format

There are free tools on the web for 1 and 2. For 1. you can use a
number of tools such as lynx. Check www.cpan.org for a Perl
implementation of 2.

Drago

Xiang Ji wrote:
>
>
> Dr.Radev,
>
> I downloaded and tried the MEAD305 recently.
> But I can't find anything in the MEAD305 that
> can convert a regular document into the docsent
> format which is accept by MEAD305. For example,
> how to convert a web news article into a test data
> like GA3? Thank you for your help and module providing.
>
> Best,
> Xiang
>
>
> _________________________________________________________________
> MSN Photos is the easiest way to share and print your photos:
> http://photos.msn.com/support/worldwide.aspx
>
>

-- 
Dragomir R. Radev                                         radev@umich.edu
Assistant Professor of Information, Electrical Engineering and
Computer Science, and Linguistics, the University of Michigan, Ann Arbor
Phone: 734-615-5225   Fax: 734-764-2475    http://www.si.umich.edu/~radev
-
To unsubscribe: send "unsubscribe mead" to majordomo@si.umich.edu



This archive was generated by hypermail 2b30 : Tue Jun 09 2009 - 05:00:05 EDT