Category

Corpus linguistics

page 1

large and structured set of texts being the basis for linguistic research

corpus linguistics

branch of linguistics that studies language through examples contained in real texts

probability distribution

hapax legomenon

word that occurs only once in a language's written record, an author's corpus, or a text

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.

WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. It can thus be seen as a combination and extension of a dictionary and thesaurus. Its primary use is in automatic text analysis and artificial intelligence applications. It was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download. The latest offici

An '''n-gram' is a sequence of n adjacent symbols in a particular order. The symbols may be n'' adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from a text corpus or speech corpus.

part-of-speech tagging

process of identifying the grammatical type of words in a text

text placed alongside its translation or translations

Words pronunciation and signs recording tool by Wikimedians

Google Ngram Viewer

online search engine

thumb|upright=1.35|right|Most syntactic treebanks annotate variants of either Phrase structure grammar|phrase structure (left) or dependency structure (right).

speech audio files and text transcriptions

In linguistics, co-occurrence or cooccurrence (in older texts often shown with diacritic as coöccurrence) is an above-chance frequency of ordered occurrence of two adjacent terms in a text corpus. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idiomatic expression. Corpus linguistics and its statistical analyses can reveal (regularity of) patterns of co-occurrences within a language and enable the working out of typical collocations for its lexical items.

A concordancer is a computer program that automatically constructs a concordance. The output of a concordancer may serve as input to a translation memory system for computer-assisted translation, or as an early step in machine translation.

FrameNet is a group of online lexical databases based upon the theory of meaning known as Frame semantics, developed by linguist Charles J. Fillmore. The project's fundamental notion is simple: most words' meanings may be best understood in terms of a semantic frame, which is a description of a certain kind of event, connection, or item and its actors.

Global Language Monitor

American media analytics company

word which occurs in a text more often than we would expect to occur by chance alone

Zipf–Mandelbrot law

discrete probability distribution