Skip to content
Category

Natural language processing

page 1
natural language processing
field of computer science and linguistics
natural language
language naturally spoken by humans, as opposed to "constructed" and "formal" languages
large language model
language model built with very large amounts of texts
spell checker
software used to detect misspelled words, in a document
information retrieval
activity of obtaining information resources relevant to an information need from a collection of information resources
prompt engineering
creation or optimization of a prompt to be given to an artificial intelligence model
text mining
process of analysing text to extract information from it
Tatoeba
Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. It is available in more than 400 languages. Its name comes from the Japanese phrase , meaning 'for example'. It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as "Tatoebans". It is run by Association Tatoeba, a French non-profit organization funded through donations.
Kleene star
unary operation on sets of strings, used in regular expressions for "zero or more repetitions"
word-sense disambiguation
problem of natural language processing; identifying which sense of a word (has multiple meanings) is used in a sentence
n-gram
An '''n-gram' is a sequence of n adjacent symbols in a particular order. The symbols may be n'' adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from a text corpus or speech corpus.
word embedding
technique in natural language processing that represents words as vectors in a continuous vector space
stemming
In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.
T9
predictive text input technology for mobile phones
controlled language
subset of a natural language
stylometry
Stylometry is the application of the study of linguistic style, usually to written language. It has also been applied successfully to music, paintings, chess, and source code.
retrieval-augmented generation
techniques that enable large-scale language models to retrieve and incorporate new information from external data sources
natural language understanding
subtopic of natural language processing in artificial intelligence
automatic summarization
computer-based method for shortening a text
document classification
problem in library science, information science and computer science
language technology
natural language processing and computational linguistics
information extraction
automatically extracting structured information from un- or semi-structured machine-readable documents, such as human language texts
bag-of-words model
model of text which uses a representation of text that is based on an unordered collection (a "bag") of words
Natural Language Toolkit
suite for natural language processing (NLP)
foundation model
artificial intelligence model paradigm
text-to-video model
machine learning model
bigram
A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2.
grammar checker
computer program that verifies written text for grammatical correctness
Google Ngram Viewer
online search engine
latent semantic analysis
technique in natural language processing
artificial intelligence content detection
algorithms to detect AI-generated content
predictive text
input technology for mobile phone keypads
speech segmentation
process (mental or computational) of analyzing spoken natural language to identify its constituents
Apache OpenNLP
machine learning based toolkit for the processing of natural language text
Deeplearning4j
Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.
reasoning language model
language models designed for reasoning tasks
language identification
Determination of language from a text sample
example-based machine translation
method of machine translation
phrase structure grammar
type of grammar based on constituent entities
Association for Computational Linguistics
learned society and publisher
history of machine translation
aspect of history
sequence-to-sequence learning
thumb|Animation of seq2seq with Recurrent neural network|RNN and attention mechanism Seq2seq is a family of machine learning approaches used for natural language processing. Originally developed by Lê Viết Quốc, a Vietnamese computer scientist and a machine learning pioneer at Google Brain, this framework has become foundational in many modern AI systems. Applications include language translation, image captioning, conversational models, speech recognition, and text summarization. Seq2seq uses sequence transformation: it turns one sequence into another sequence.
trigram
Trigrams are a special case of the n-gram, where n is 3. They are often used in natural language processing for performing statistical analysis of texts and in cryptography for control and use of ciphers and codes. See results of analysis of "Letter Frequencies in the English Language".
Lexical Markup Framework
ISO standard for Natural Language Processing (NLP) lexicons and Machine Readable Dictionaries (MRD)
production
in computer science, a rewrite rule specifying a substitution that can be recursively performed to generate new sequences
grammar induction
machine learning process
rule-based machine translation
type of machine translation
Powerset
company
textual entailment
concept in natural language processing
ontology learning
automatic creation of ontologies
Text Retrieval Conference
Annual meeting focused on measuring the quality of search engines, recommender engines, and algorithms for text retrieval
named entity
real world object such as persons, locations, organizations, products, etc, that can be denoted with a proper name; it can be abstract or have a physical existence
Lesk algorithm
classical algorithm for word sense disambiguation
natural-language user interface
type of computer human interface
entity linking
the task of assigning a unique identity to entities mentioned in text
text simplification
automated process
language resource
METEOR
ELMo
thumb|Architecture of ELMo. It first processes input tokens into embedding vectors by an embedding layer (essentially a lookup table), then applies a pair of forward and backward LSTMs to produce two sequences of hidden vectors, then apply another pair of forward and backward LSTMs, and so on. thumb|How a token is transformed successively over increasing layers of ELMo. At the start, the token is converted to a vector by a linear layer, giving the embedding vector e_0. In the next layer, a forward LSTM produces a hidden vector h_{00}, while a backward LSTM produces another hidden vector h_{00r
Rhetorical Structure Theory
theory of text organization