Hi,
You have to write your own script for that. Web pages come in so many
different formats and styles that it is impossible to write a standard
conversion tool.
You need to go through the following steps:
1. HTML to text --> extract tags, JavaScript, etc.
2. sentence boundary identification
3. conversion to docsent format
There are free tools on the web for 1 and 2. For 1. you can use a
number of tools such as lynx. Check www.cpan.org for a Perl
implementation of 2.
Drago
Xiang Ji wrote:
>
>
> Dr.Radev,
>
> I downloaded and tried the MEAD305 recently.
> But I can't find anything in the MEAD305 that
> can convert a regular document into the docsent
> format which is accept by MEAD305. For example,
> how to convert a web news article into a test data
> like GA3? Thank you for your help and module providing.
>
> Best,
> Xiang
>
>
> _________________________________________________________________
> MSN Photos is the easiest way to share and print your photos:
> http://photos.msn.com/support/worldwide.aspx
>
>
-- Dragomir R. Radev radev@umich.edu Assistant Professor of Information, Electrical Engineering and Computer Science, and Linguistics, the University of Michigan, Ann Arbor Phone: 734-615-5225 Fax: 734-764-2475 http://www.si.umich.edu/~radev - To unsubscribe: send "unsubscribe mead" to majordomo@si.umich.edu
This archive was generated by hypermail 2b30 : Tue Jun 09 2009 - 05:00:05 EDT