Title: Using the Damerau-Levenshtein-Metric for string comparsion.
Question: Are two strings similar?
Answer:
What is Levenshtein Distance/Metric?
Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t.
If x,y are strings then the distances ist defined by
LD(x,y) = min(i)(#S(i)+#D(i)+#I(i) ) for all possible edit sessions i.
#S(I), #D(i), # I(i) are the number of substitutions, deletions and insertions required in the edit session i.
For example,
h If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
h If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.
The greater the Levenshtein distance, the more different the strings are.
Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965.
h Spell checking
h Speech recognition
h DNA analysis
h Plagiarism detection
The Damerau-Levenstein metric is a generalisation of the Levenstein metric. It assigns weights to the edit operations. It is defined as
DLD(x,y) = min(i)(#S(i)*Ws+#D(i)*Wd+#I(i)*Wi ) for all possible edit sessions i.
Again #S(I), #D(i), # I(i) are the number of substitutions, deletions and insertions required in the edit session i.
Ws, Wd, Wi are positive numbered weighting factors for the edit operations.
For more information search the web for [levenshtein Vphp] using www.google.com. levensthtein is a function in the PHP library. The string Vphp is used to exclude all the PHP manual pages from the search, that obscure the more interesting links.
With this metric and a quadruppel of numbers (the weights and a threshold value of likeness) you can implement a StringsAreLike function of your own taste. The below example I used to search for similar names (Meyer, Meier, Mayer, Mayr) in a database. This is usefull if you can not remeber the correct spelling of a persons name.
Implementation.
uses
math, sysutils;
const
ws=3; // weight for substitution
wi=1; // weight for insertion
wd=6; // weight for deleting
th=4;
function StringsAreLike(const s1,s2:string):boolean;
begin
result:= DamerauLevenshteinLike(s1,s2,ws,wi,wd)end;
function DamerauLevenshteinLike(const s1,s2:string;ws,wi,wd:integer):Integer;
VAR
i,j:Integer;
function Pp(x,y:Integer):Integer;
begin
if AnsiUpperCase(s1[x])=AnsiUpperCase(s2[y]) then Pp:=0 else Pp:=ws;
end;
var
Wmax:integer;
d:array of array of integer;
begin
Wmax:=Max(length(s1),length(s2))+1;
SetLength(d,Wmax,Wmax);
dec(Wmax);
d[0,0]:=0;
for j:=1 TO Wmax DO d[0,j]:=d[0,Pred(j)]+wi;
for i:=1 TO Wmax DO d[i,0]:=d[Pred(i),0]+wd;
for i:=1 TO Length(s1) DO
for j:=1 TO Length(s2) DO
d[i,j]:=MinIntValue([ d[Pred(i),Pred(j)]+Pp(i,j), //substitution
d[ i ,Pred(j)]+wi, //insertion
d[Pred(i), j ]+wd //deletion
]);
result:=d[Length(s1),Length(s2)];
SetLength(d,0);
end{DamerauLevenshteinLike};