My favorite corpora

Here are my favorite corpora:

Enron email
CIA world factbook
DBLP: papers in CS
US congressional speeches
AOL queries
Netflix recommendations
IMDB
PUBMED: biomedical paper abstracts
Wikipedia
ACL Anthology
DOTGOV: download of .GOV
biocreative: biomedical papers
WT100G: 100GB download of the web
Google n-grams
webfreq
SMS corpus
Citeseer
DMOZ
corpus of paraphrases
multilingual parallel parliamentary proceedings
textual entailment corpus
question answering corpus
summarization corpus
various text classification corpora (Reuters-21578, 20NG)
Peekaboom

Comments are closed.