Category

English corpora

page 1

British National Corpus

100-million-word text corpus of samples of written and spoken English from a wide range of sources

data set of American English in 1961

BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords. It was the main corpus used to train the initial GPT model by OpenAI, and has been used as training data for other early large language models including Google's BERT. The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.

Corpus of Contemporary American English

a more than 560-million-word corpus of American English