I just answered this question on the mailing list. Didn't you get the
answer?
Here it is one more time:
Hi,
You have to write your own script for that. Web pages come in so many
different formats and styles that it is impossible to write a standard
conversion tool.
You need to go through the following steps:
1. HTML to text --> extract tags, JavaScript, etc.
2. sentence boundary identification
3. conversion to docsent format
There are free tools on the web for 1 and 2. For 1. you can use a
number of tools such as lynx. Check www.cpan.org for a Perl
implementation of 2.
Drago
Drago
Yee Gu wrote:
>
>
> Hi,mead team,
>
> I noticed that *.docsent files are xml files using docsent.dtd.
> My question is how to transfer normal text files, txt,html,...,to
> *.docsent files in mead's way.
> Is there special perl scripts programmed for this processing?
>
> many thanks !
>
> yee
>
>
> _________________________________________________________________
> Send and receive Hotmail on your mobile device: http://mobile.msn.com
>
> -
> To unsubscribe: send "unsubscribe mead" to majordomo@si.umich.edu
>
-- Dragomir R. Radev radev@umich.edu Assistant Professor of Information, Electrical Engineering and Computer Science, and Linguistics, the University of Michigan, Ann Arbor Phone: 734-615-5225 Fax: 734-764-2475 http://www.si.umich.edu/~radev
This archive was generated by hypermail 2b30 : Tue Jun 09 2009 - 05:00:05 EDT