Category

Natural language processing

page 1

natural language processing

field of computer science and linguistics

natural language

language naturally spoken by humans, as opposed to "constructed" and "formal" languages

large language model

language model built with very large amounts of texts

software used to detect misspelled words, in a document

information retrieval

activity of obtaining information resources relevant to an information need from a collection of information resources

prompt engineering

creation or optimization of a prompt to be given to an artificial intelligence model

process of analysing text to extract information from it

Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. It is available in more than 400 languages. Its name comes from the Japanese phrase , meaning 'for example'. It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as "Tatoebans". It is run by Association Tatoeba, a French non-profit organization funded through donations.

unary operation on sets of strings, used in regular expressions for "zero or more repetitions"

word-sense disambiguation

problem of natural language processing; identifying which sense of a word (has multiple meanings) is used in a sentence

An '''n-gram' is a sequence of n adjacent symbols in a particular order. The symbols may be n'' adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from a text corpus or speech corpus.

technique in natural language processing that represents words as vectors in a continuous vector space

In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

predictive text input technology for mobile phones

controlled language

subset of a natural language

Stylometry is the application of the study of linguistic style, usually to written language. It has also been applied successfully to music, paintings, chess, and source code.

retrieval-augmented generation

techniques that enable large-scale language models to retrieve and incorporate new information from external data sources

natural language understanding

subtopic of natural language processing in artificial intelligence

automatic summarization

computer-based method for shortening a text

document classification

problem in library science, information science and computer science

language technology

natural language processing and computational linguistics

information extraction

automatically extracting structured information from un- or semi-structured machine-readable documents, such as human language texts

bag-of-words model

model of text which uses a representation of text that is based on an unordered collection (a "bag") of words

Natural Language Toolkit

suite for natural language processing (NLP)

foundation model

artificial intelligence model paradigm

text-to-video model

machine learning model

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2.

grammar checker

computer program that verifies written text for grammatical correctness

Google Ngram Viewer

online search engine

latent semantic analysis

technique in natural language processing

artificial intelligence content detection

algorithms to detect AI-generated content

predictive text

input technology for mobile phone keypads

speech segmentation

process (mental or computational) of analyzing spoken natural language to identify its constituents

machine learning based toolkit for the processing of natural language text

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

reasoning language model

language models designed for reasoning tasks

language identification

Determination of language from a text sample

example-based machine translation

method of machine translation

phrase structure grammar

type of grammar based on constituent entities

Association for Computational Linguistics

learned society and publisher

history of machine translation

aspect of history

sequence-to-sequence learning

thumb|Animation of seq2seq with Recurrent neural network|RNN and attention mechanism Seq2seq is a family of machine learning approaches used for natural language processing. Originally developed by Lê Viết Quốc, a Vietnamese computer scientist and a machine learning pioneer at Google Brain, this framework has become foundational in many modern AI systems. Applications include language translation, image captioning, conversational models, speech recognition, and text summarization. Seq2seq uses sequence transformation: it turns one sequence into another sequence.

Trigrams are a special case of the n-gram, where n is 3. They are often used in natural language processing for performing statistical analysis of texts and in cryptography for control and use of ciphers and codes. See results of analysis of "Letter Frequencies in the English Language".

Lexical Markup Framework

ISO standard for Natural Language Processing (NLP) lexicons and Machine Readable Dictionaries (MRD)

in computer science, a rewrite rule specifying a substitution that can be recursively performed to generate new sequences

grammar induction

machine learning process

rule-based machine translation

type of machine translation

textual entailment

concept in natural language processing

ontology learning

automatic creation of ontologies

Text Retrieval Conference

Annual meeting focused on measuring the quality of search engines, recommender engines, and algorithms for text retrieval

real world object such as persons, locations, organizations, products, etc, that can be denoted with a proper name; it can be abstract or have a physical existence

classical algorithm for word sense disambiguation

natural-language user interface

type of computer human interface

the task of assigning a unique identity to entities mentioned in text

text simplification

automated process

language resource

thumb|Architecture of ELMo. It first processes input tokens into embedding vectors by an embedding layer (essentially a lookup table), then applies a pair of forward and backward LSTMs to produce two sequences of hidden vectors, then apply another pair of forward and backward LSTMs, and so on. thumb|How a token is transformed successively over increasing layers of ELMo. At the start, the token is converted to a vector by a linear layer, giving the embedding vector e_0. In the next layer, a forward LSTM produces a hidden vector h_{00}, while a backward LSTM produces another hidden vector h_{00r

Rhetorical Structure Theory

theory of text organization