Re: MEAD: Question about *.docsent

From:
Date: Thu Mar 21 2002 - 09:46:07 EST


I just answered this question on the mailing list. Didn't you get the
answer?

Here it is one more time:

Hi,

You have to write your own script for that. Web pages come in so many
different formats and styles that it is impossible to write a standard
conversion tool.

You need to go through the following steps:

1. HTML to text --> extract tags, JavaScript, etc.
2. sentence boundary identification
3. conversion to docsent format

There are free tools on the web for 1 and 2. For 1. you can use a
number of tools such as lynx. Check www.cpan.org for a Perl
implementation of 2.

Drago

Drago

Yee Gu wrote:
>
>
> Hi,mead team,
>
> I noticed that *.docsent files are xml files using docsent.dtd.
> My question is how to transfer normal text files, txt,html,...,to
> *.docsent files in mead's way.
> Is there special perl scripts programmed for this processing?
>
> many thanks !
>
> yee
>
>
> _________________________________________________________________
> Send and receive Hotmail on your mobile device: http://mobile.msn.com
>
> -
> To unsubscribe: send "unsubscribe mead" to majordomo@si.umich.edu
>

-- 
Dragomir R. Radev                                         radev@umich.edu
Assistant Professor of Information, Electrical Engineering and
Computer Science, and Linguistics, the University of Michigan, Ann Arbor
Phone: 734-615-5225   Fax: 734-764-2475    http://www.si.umich.edu/~radev



This archive was generated by hypermail 2b30 : Tue Jun 09 2009 - 05:00:05 EDT