thumb|The sequence ATGG has two 3-mers: ATG and TGG. In bioinformatics, '''k-mers' are substrings of length k contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides (i.e. A, T, G, and C), k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k''-mer refers to all of a sequence's subsequences of length k, such that the sequence AGAT would have four monomers
thumb|The sequence ATGG has two 3-mers: ATG and TGG. In bioinformatics, '''k-mers' are substrings of length k contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides (i.e. A, T, G, and C), k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k''-mer refers to all of a sequence's subsequences of length k, such that the sequence AGAT would have four monomers (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). More generally, a sequence of length L will have L - k + 1 k-mers and there exist n^{k} total possible k-mers, where n is number of possible monomers (e.g. four in the case of DNA).
== Introduction == k-mers are simply length k subsequences. For example, all the possible k-mers of a DNA sequence are shown below: thumb|An example 8-mer spectrum for Escherichia coli|E. coli comparing 8-mers' frequency (i.e. multiplicities) with their number of occurrences.|alt=|440x440px {| class="wikitable" |+k-mers for GTAGAGCTGT !k !k-mers |- |1 |G, T, A, C |- |2 |GT, TA, AG, GA, AG, GC, CT, TG, GT |- |3 |GTA, TAG, AGA, GAG, AGC, GCT, CTG, TGT |- |4 |GTAG, TAGA, AGAG, GAGC, AGCT, GCTG, CTGT |- |5 |GTAGA, TAGAG, AGAGC, GAGCT, AGCTG, GCTGT |- |6 |GTAGAG, TAGAGC, AGAGCT, GAGCTG, AGCTGT |- |7 |GTAGAGC, TAGAGCT, AGAGCTG, GAGCTGT |- |8 |GTAGAGCT, TAGAGCTG, AGAGCTGT |- |9 |GTAGAGCTG, TAGAGCTGT |- |10 |GTAGAGCTGT |}
Discovered by embedding cosine similarity (sentence-transformers MiniLM, 384-dim).