medoid

Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression (where while the data is sparse the medoid need not be). These are also of interest whil

~16 min read

Article

25 sections

Contents

Definition
Clustering with medoids
Algorithms to compute the medoid of a set
Implementations
Medoids in text and natural language processing (NLP)
Text clustering
Text summarization
Sentiment analysis
Topic modeling
Techniques for measuring text similarity in medoid-based clustering
Cosine similarity
Jaccard similarity
Euclidean distance
Edit distance
Medoid applications in large language models
Medoids for analyzing large language model embeddings
Medoids for data selection and active learning
Medoids for model interpretability and safety
Real-world applications
Gene expression analysis
Social network analysis
Market segmentation
Anomaly detection
References
External links

For some data sets there may be more than one medoid, as with medians. A common application of the medoid is the k-medoids clustering algorithm, which is similar to the k-means algorithm but works when a mean or centroid is not definable. This algorithm basically works as follows. First, a set of medoids is chosen at random. Second, the distances to the other points are computed. Third, data are clustered according to the medoid they are most similar to. Fourth, the medoid set is optimized via an iterative process.