Data Sets

The following datasets are available:

* Real-Web dataset containing hash values of the content of 353,739 web pages collected over a period of six months (Feb. 1999 - July 1999). [ history.all.gz ]

* Same real-web dataset formated in three columns (web_site, web_page, change_history). Change history is a sequence of bits: 1 means that the specific page has changed between the respective visits and 0 means that it remained the same (e.g. 10000 means that the page changed the second time we visited it i.e. on March). [ history.all.norm.gz ]

* Synthetic dataset containing info for 300,000 pages in three columns (web_site, web_page, change_history) over 200 visiting cycles. The change frequency of the pages follows a normal distribution. [ synthetic.all.norm.gz ]

* Sample collection of blogs from UCLA used in lexical networks research. This data set also includes generated cosine values and lexical networks for the data. Includes instructions for processing with Clairlib. [ lexnets-R1000.tar.gz ]

* Lexical networks generated from small sample from the 2004 Document Understanding Conference. Includes instructions for processing with Clairlib. [ lexnets-duc04t4.tar.gz ]

[ About | Research | People | News ]
Copyright (c) 2007 - Computer Science Department University of California Los Angeles