Title: Soundex Revisited - Searching Strings by sound.
Question: How to match strings based on the way they sound & not on their spellings.
Answer:
This article is in continuation of my previous article (http://www.delphi3000.com/article.asp?id=1560) and represents an attempt at making the SoundEx() more versatile so as to theoratically accomodate languages other than English - the only restriction being that the language should use the ASCII character set. Another advantage is that the function can be "tuned" to peculiarities of a language e.g. "Knife" is pronounced as "Nife" in English. There is theoratically no limit to this "tunability" - of course with corresponding decrease in performance. But you can get amazing results which are better than what SoundEx() gives.
I have chosen to post a new article rather than update the original one since the original function has been modified quite significantly (in concept) thus making it different from the industry standard SoundEx() function - which was implemented in the original article.
Since the function now supports language "tuning", it can give different results than the industry standard SoundEx(). I have thus renamed the function to "Sound()". This also gives me the freedom to implement it differently.
Sound() returns the same value (M240) for each of Micael/Maical/Michael/Maichael. Additionally, since it has been (partially) tuned for English, it will give the same result (F500) for "Phone"/"Fone".
I guess the "Ultimate" Sound Matching logic will be based on phonemes - of which I currently know very little. If you help me by providing me details of phonemes that you may have, then I will make yet another attempt at improving "Sound()" even further...
I thank Toninho Nunes and Joe Meyer for providing me ideas & inputs respectively.
Please save the code below in a file called "Sounds.pas". You will need to include the file in your source (Uses Sounds) and then you will have access to the Sound() function.
{********************************************************************}
{* Description: Modified Soundex function in which it is attempted to include *}
{* language pecularities which theoratically makes it adaptable to languages *}
{* other than English - the only restriction being that the language in *}
{* question should use ASCII character set *}
{********************************************************************}
{* Date Created : 15-Nov-2000 *}
{* Last Modified : 16-Nov-2000 *}
{* Version : 0.10 *}
{* Author : Paramjeet Reen *}
{* eMail : Paramjeet.Reen@EudoraMail.com *}
{******************************************************************************}
{* This program is based on an algorithm that I had found in a magazine, *}
{* merged with an algorithm of a program posted by Joe Meyer. I do not *}
{* gurantee the fitness of this program in any way. Use it at your own risk. *}
{********************************************************************}
{* Category: Freeware. *}
{********************************************************************}
unit Sounds;
interface
//Returns a code for InpStr depending upon how it sounds.
function Sound(const InpStr :ShortString) :ShortString;
implementation
type
TReplacePos = (pStart, pMid, pEnd);
TReplacePosSet = set of TReplacePos;
const
{********************************************************************}
{* The following are selected letters of the alphabet which are divided *}
{* into their corresponding code (1-6). You might need to modify these for *}
{* different languages depending upon whether the language requires *}
{* alphabets other than the ones specified below *}
{********************************************************************}
Chars1 = ['B','P','F','V'];
Chars2 = ['C','S','K','G','J','Q','X','Z'];
Chars3 = ['D','T'];
Chars4 = ['L'];
Chars5 = ['M','N'];
Chars6 = ['R'];
procedure ReplaceStr(var InpStr :ShortString; const SubStr,WithStr :ShortString;
const ReplacePositions :TReplacePosSet);
var
i :Integer;
begin
if(pStart in ReplacePositions)then
begin
i := Pos(SubStr,InpStr);
if(i = 1)then
begin
Delete(InpStr,i,Length(SubStr));
Insert(WithStr,InpStr,i);
end;
end;
if(pMid in ReplacePositions)then
begin
i := Pos(SubStr,InpStr);
while(i 1)and(i begin
Delete(InpStr,i,Length(SubStr));
Insert(WithStr,InpStr,i);
i := Pos(SubStr,InpStr);
end;
end;
if(pEnd in ReplacePositions)then
begin
i := Pos(SubStr,InpStr);
if(i 1)and(i (Length(InpStr) - Length(SubStr)))then
begin
Delete(InpStr,i,Length(SubStr));
Insert(WithStr,InpStr,i);
end;
end;
end;
function Sound(const InpStr :ShortString) :ShortString;
var
vStr :ShortString;
PrevCh :Char;
CurrCh :Char;
i :Word;
begin
{********************************************************************}
{* Uppercase & remove invalid characters from given string *}
{********************************************************************}
{* Please have a long & hard look at this code if you have modified any of *}
{* the constants Chars1,Chars2 ... Chars6 by increasing the overall range *}
{* of alphabets *}
{********************************************************************}
vStr := '';
for i := 1 to Length(InpStr)do
case InpStr[i] of
'a'..'z': vStr := vStr + UpCase(InpStr[i]);
'A'..'Z': vStr := vStr + InpStr[i];
end; {case}
if(vStr '')then
begin
{**************************************************************************}
{* Language Tweaking Section *}
{********************************************************************}
{* Tweak for language peculiarities e.g. "CAt"="KAt", "KNIfe"="NIfe" *}
{* "PHone"="Fone", "PSYchology"="SIchology", "EXcel"="Xcel" etc... *}
{* You will need to modify these for different languages. Optionally, you *}
{* may choose not to have this section at all, in which case, the output *}
{* of Sound() will correspond to that of SoundEx(). Please note however *}
{* the importance of what you replace & the order in which you replace. *}
{********************************************************************}
{* Also, please note that the following replacements are targeted for the *}
{* English language & that too is subject to improvements *}
{********************************************************************}
ReplaceStr(vStr,'CA' ,'KA',[pStart,pMid,pEnd]); //arCAde = arKAde
ReplaceStr(vStr,'CL' ,'KL',[pStart,pMid,pEnd]); //CLass = Klass
ReplaceStr(vStr,'CK' ,'K' ,[pStart,pMid,pEnd]); //baCK = baK
ReplaceStr(vStr,'EX' ,'X' ,[pStart,pMid,pEnd]); //EXcel = Xcel
ReplaceStr(vStr,'X' ,'Z' ,[pStart]); //Xylene = Zylene
ReplaceStr(vStr,'PH' ,'F' ,[pStart,pMid,pEnd]); //PHone = Fone
ReplaceStr(vStr,'KN' ,'N' ,[pStart]); //KNife = Nife
ReplaceStr(vStr,'PSY','SI',[pStart]); //PSYche = SIche
ReplaceStr(vStr,'SCE','CE',[pStart,pMid,pEnd]); //SCEne = CEne
{********************************************************************}
{* String Assembly Section *}
{********************************************************************}
PrevCh := #0;
Result := vStr[1];
for i := 2 to Length(vStr) do
begin
if Length(Result) = 4 then break;
CurrCh := vStr[i];
if (CurrCh PrevCh) then
begin
if CurrCh in Chars1 then Result := Result + '1' else
if CurrCh in Chars2 then Result := Result + '2' else
if CurrCh in Chars3 then Result := Result + '3' else
if CurrCh in Chars4 then Result := Result + '4' else
if CurrCh in Chars5 then Result := Result + '5' else
if CurrCh in Chars6 then Result := Result + '6';
PrevCh := CurrCh;
end;
end;
end else Result := '';
while(Length(Result) end;
end.