Category

Computational linguistics

page 1

machine translation

use of software for language translation

natural language processing

field of computer science and linguistics

computational linguistics

interdisciplinary field

optical character recognition

computer recognition of visual text

speech synthesis

artificial production of human speech

speech recognition

automatic conversion of spoken language into text

large and structured set of texts being the basis for linguistic research

Levenshtein distance

computer science metric for string similarity

probability distribution

process of analysing text to extract information from it

Hamming distance

number of bits that differ between two strings

Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. It is available in more than 400 languages. Its name comes from the Japanese phrase , meaning 'for example'. It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as "Tatoebans". It is run by Association Tatoeba, a French non-profit organization funded through donations.

word-sense disambiguation

problem of natural language processing; identifying which sense of a word (has multiple meanings) is used in a sentence

neural machine translation

approach to machine translation in which a large neural network is trained to maximize translation performance

ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar

WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. It can thus be seen as a combination and extension of a dictionary and thesaurus. Its primary use is in automatic text analysis and artificial intelligence applications. It was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download. The latest offici

An '''n-gram' is a sequence of n adjacent symbols in a particular order. The symbols may be n'' adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from a text corpus or speech corpus.

handwriting recognition

ability of a computer to receive and interpret intelligible handwritten input

technique in natural language processing that represents words as vectors in a continuous vector space

In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

Lemmatization (or less commonly lemmatisation) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

automatic summarization

computer-based method for shortening a text

question answering

research area in computer science

named-entity recognition

extraction of named entity mentions in unstructured text into pre-defined categories

part-of-speech tagging

process of identifying the grammatical type of words in a text

mobile translation

a device common to works to offer an instant translation of any language. As a convention, it is used to remove the problem of translating between alien languages

gesture recognition

topic in language and computer science

foundation model

artificial intelligence model paradigm

artificial intelligence content detection

algorithms to detect AI-generated content

Google Ngram Viewer

online search engine

pattern that estimates the exponentially diminishing returns of extending a search for references in science journals

thumb|upright=1.35|right|Most syntactic treebanks annotate variants of either Phrase structure grammar|phrase structure (left) or dependency structure (right).

BabelNet is a multilingual lexical-semantic knowledge graph, ontology and encyclopedic dictionary developed at the NLP group of the Sapienza University of Rome under the supervision of Roberto Navigli. BabelNet was automatically created by linking Wikipedia to the most popular computational lexicon of the English language, WordNet. The integration is done using an automatic mapping and by filling in lexical gaps in resource-poor languages by using statistical machine translation. The result is an encyclopedic dictionary that provides concepts and named entities lexicalized in many languages an

machine-readable dictionary

dictionary stored as machine (computer) data

Universal Networking Language

declarative formal language that represents semantic data in texts

Google Neural Machine Translation

system developed by Google to increase fluency and accuracy in Google Translate

word frequency list

list of words with their frequency

language identification

Determination of language from a text sample

Association for Computational Linguistics

learned society and publisher

Zeta distribution

probability distribution on the integers in which the probability of a number is inversely proportion to a fixed power of the number

Trigrams are a special case of the n-gram, where n is 3. They are often used in natural language processing for performing statistical analysis of texts and in cryptography for control and use of ciphers and codes. See results of analysis of "Letter Frequencies in the English Language".

distributional semantics

research area in semantic similarities between linguistic items

intelligent character recognition

computer recognition of written text

Lexical Markup Framework

ISO standard for Natural Language Processing (NLP) lexicons and Machine Readable Dictionaries (MRD)

semantic role labeling

Process in natural language processing

semantic similarity

Metric in computational linguistics

voice activity detection

technique used in speech processing in which the presence or absence of human speech is detected

Culturomics is a form of computational lexicology that studies human behavior and cultural trends through the quantitative analysis of digitized texts. Researchers data mine large digital archives to investigate cultural phenomena reflected in language and word usage. The term is an American neologism first described in a 2010 Science article called Quantitative Analysis of Culture Using Millions of Digitized Books, co-authored by Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden.

grammar induction

machine learning process

information provided by direct observation

interlingual machine translation

type of machine translation

heuristic for distinct words in a document

speech-generating device

augmenting speech device

FrameNet is a group of online lexical databases based upon the theory of meaning known as Frame semantics, developed by linguist Charles J. Fillmore. The project's fundamental notion is simple: most words' meanings may be best understood in terms of a semantic frame, which is a description of a certain kind of event, connection, or item and its actors.

Zipf–Mandelbrot law

discrete probability distribution

natural-language user interface

type of computer human interface

Text Retrieval Conference

Annual meeting focused on measuring the quality of search engines, recommender engines, and algorithms for text retrieval

classical algorithm for word sense disambiguation

real world object such as persons, locations, organizations, products, etc, that can be denoted with a proper name; it can be abstract or have a physical existence