large and structured set of texts being the basis for linguistic research
In linguistics and natural language processing, a corpus (pl.: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corpus linguistics for statistical hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
Overview
Discovered by embedding cosine similarity (sentence-transformers MiniLM, 384-dim).