Skip to contents

This function computes a pairwise distance matrix between aligned sequences using a string distance metric from the `stringdist` package. Ambiguous residues (e.g., "X", "x", "?") are removed before distance computation. By default, the hamming distance is compouted, but the stringdist::stringdist() documentation lists other possible methods and arguments.

Usage

dist_string(seqs, ambiguous_residues = "xX?", ...)

Arguments

seqs

A character vector or list of aligned sequences.

ambiguous_residues

A character vector of ambiguous residues to remove from each sequence before comparison. Defaults to c("x", "X", "?").

...

Additional arguments passed to stringdist::stringdist() (e.g., method = "hamming").

Value

A symmetric numeric matrix of pairwise normalized string distances. Each distance is normalized by the length of the cleaned first sequence in the pair.

Details

Only the lower triangle is explicitly computed, and the upper triangle is filled in by symmetry. This function assumes that all sequences are of equal length and aligned.

All distances are normalized by dividing by the aligned sequence length.

Examples

seqs <- c(
  "A/H1N1/South Carolina/1/1918" = "mktiialsyifclvlgqdfpgndnstat",
  "A/H3N2/Darwin/9/2021" = "mktiialsnilclvfaqkipgndnstat",
  "B/Sichuan/379/1999" = "drictgitssnsphvvktatqgevnvtg"
)
dist_string(seqs, method = "hamming")
#>                              A/H1N1/South Carolina/1/1918 A/H3N2/Darwin/9/2021
#> A/H1N1/South Carolina/1/1918                    0.0000000            0.2142857
#> A/H3N2/Darwin/9/2021                            0.2142857            0.0000000
#> B/Sichuan/379/1999                              1.0000000            1.0000000
#>                              B/Sichuan/379/1999
#> A/H1N1/South Carolina/1/1918                  1
#> A/H3N2/Darwin/9/2021                          1
#> B/Sichuan/379/1999                            0