Category
page 1English corpora
British National Corpus
100-million-word text corpus of samples of written and spoken English from a wide range of sources
Brown Corpus
data set of American English in 1961
BookCorpus
BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords. It was the main corpus used to train the initial GPT model by OpenAI, and has been used as training data for other early large language models including Google's BERT. The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.
Corpus of Contemporary American English
a more than 560-million-word corpus of American English