
Compute Pairwise Normalized String Distances Between Aligned Sequences
Source:R/string-based.R
dist_string.Rd
This function computes a pairwise distance matrix between aligned sequences
using a string distance metric from the `stringdist` package. Ambiguous
residues (e.g., "X", "x", "?") are removed before distance computation.
By default, the hamming distance is compouted, but the
stringdist::stringdist()
documentation lists other possible methods
and arguments.
Arguments
- seqs
A character vector or list of aligned sequences.
- ambiguous_residues
A character vector of ambiguous residues to remove from each sequence before comparison. Defaults to
c("x", "X", "?")
.- ...
Additional arguments passed to
stringdist::stringdist()
(e.g.,method = "hamming"
).
Value
A symmetric numeric matrix of pairwise normalized string distances. Each distance is normalized by the length of the cleaned first sequence in the pair.
Details
Only the lower triangle is explicitly computed, and the upper triangle is filled in by symmetry. This function assumes that all sequences are of equal length and aligned.
All distances are normalized by dividing by the aligned sequence length.
Examples
seqs <- c(
"A/H1N1/South Carolina/1/1918" = "mktiialsyifclvlgqdfpgndnstat",
"A/H3N2/Darwin/9/2021" = "mktiialsnilclvfaqkipgndnstat",
"B/Sichuan/379/1999" = "drictgitssnsphvvktatqgevnvtg"
)
dist_string(seqs, method = "hamming")
#> A/H1N1/South Carolina/1/1918 A/H3N2/Darwin/9/2021
#> A/H1N1/South Carolina/1/1918 0.0000000 0.2142857
#> A/H3N2/Darwin/9/2021 0.2142857 0.0000000
#> B/Sichuan/379/1999 1.0000000 1.0000000
#> B/Sichuan/379/1999
#> A/H1N1/South Carolina/1/1918 1
#> A/H3N2/Darwin/9/2021 1
#> B/Sichuan/379/1999 0