Skip to content
Category

Corpus linguistics

page 1
text corpus
large and structured set of texts being the basis for linguistic research
corpus linguistics
branch of linguistics that studies language through examples contained in real texts
Zipf's law
probability distribution
hapax legomenon
word that occurs only once in a language's written record, an author's corpus, or a text
collocation
In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.
Q533822
WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. It can thus be seen as a combination and extension of a dictionary and thesaurus. Its primary use is in automatic text analysis and artificial intelligence applications. It was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download. The latest offici
n-gram
An '''n-gram' is a sequence of n adjacent symbols in a particular order. The symbols may be n'' adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from a text corpus or speech corpus.
part-of-speech tagging
process of identifying the grammatical type of words in a text
bitext
text placed alongside its translation or translations
Lingua Libre
Words pronunciation and signs recording tool by Wikimedians
Google Ngram Viewer
online search engine
topic model
type of model
treebank
thumb|upright=1.35|right|Most syntactic treebanks annotate variants of either Phrase structure grammar|phrase structure (left) or dependency structure (right).
speech corpus
speech audio files and text transcriptions
co-occurrence
In linguistics, co-occurrence or cooccurrence (in older texts often shown with diacritic as coöccurrence) is an above-chance frequency of ordered occurrence of two adjacent terms in a text corpus. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idiomatic expression. Corpus linguistics and its statistical analyses can reveal (regularity of) patterns of co-occurrences within a language and enable the working out of typical collocations for its lexical items.
concordancer
A concordancer is a computer program that automatically constructs a concordance. The output of a concordancer may serve as input to a translation memory system for computer-assisted translation, or as an early step in machine translation.
FrameNet
FrameNet is a group of online lexical databases based upon the theory of meaning known as Frame semantics, developed by linguist Charles J. Fillmore. The project's fundamental notion is simple: most words' meanings may be best understood in terms of a semantic frame, which is a description of a certain kind of event, connection, or item and its actors.
Global Language Monitor
American media analytics company
keyword
word which occurs in a text more often than we would expect to occur by chance alone
Zipf–Mandelbrot law
discrete probability distribution