text corpus

Also known as corpus, corpora, language corpus

large and structured set of texts being the basis for linguistic research

Described at

Corpora - English Language: a short guide to online resources - Oxford LibGuides at Oxford University

libguides.bodleian.ox.ac.uk →

Does 'wicked' generally mean 'good' or 'bad'? Has this meaning changed over time? Does the use differ between different kinds of text? Do different (kinds of) speakers use the word in the same way? Specialised corpora can be used to examine or compare different language varieties, such as language from a particular area, covering a certain genre or text type, produced by particular language users, etc. Corpora can be synchrone (covering one time) or diachrone (covering several time periods), consist of different media (written or spoken language) and be composed of different languages. Annotated corpora have extra information added, usually linguistic information (part-of-speech, lemmata) or metadata (information about the material in the corpus, speakers/authors, situation, extra-linguistic information etc). There are corpora that can be consulted online, via a custom-built interface, and ones that you explore with stand-alone tools that you install on your computer. This corpus is based on the Proceedings of the Old Bailey , published from 1674 to 1913. The 2163 volumes contain almost 134 million words. Since the proceedings were taken down in shorthand by scribes in the courtroom, the verbatim passages are arguably as near as we can get to the spoken word of the period. The material thus offers the rare opportunity of analyzing spoken language in a period that has been neglected both with regard to the compilation of primary linguistic data and the description of the structure, variability, and change of English. The Oxford Text Archive (OTA) contains many useful Corpora available to download. Some examples include: Downloading these Corpora from the OTA will give you files that will need to be used in software that can process Corpora - we recommend AntConc . You will need to download AntConc and then load your files into it. The creators of AntConc have created extensive guides on video, and we would recommend that you work your way through these to understand all the functions before beginning to undertake analysis. British National Corpus (20th century English) A big corpus of written and spoken (transcribed) material from different genres. Considered a standard reference. Available via different tools AntConc is a freeware corpus analysis toolkit for concordancing and text analysis.

~6 min read

Article

In linguistics and natural language processing, a corpus (pl.: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corpus linguistics for statistical hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

Overview