Title: Extracting information from unformatted sources (HTML-pages, e-mails, etc)
Question: How can I extract the information from unformatted sources (HTML-pages, e-mails, etc).
Answer:
Do You want to write program for extracting weather forecast or currency rates or e-mails or whatsever You want from HTML-pages or other unformatted source ? Or do You need import data into Your database from old DB's ugly export format ?
There are two ways.
The tradional one - You must make full featured text parser. This is an awful peace of work ! For example, try to implement rules for e-mail addresses ;)
The second - look at the text from bird's eye view with help of regular expressions engine. Your application will be implemented very fast and will be robust and easy to support !
Unfortunately, Delphi component palette contains no TRegularExpression component. But there are some third-party implementations.
I'll use in my examples the TRegExpr (You can download it from http://anso.virtualave.net/delphi_stuff.htm).
Example 1
How to extract phone numbers from unformatted text (web-pages, e-mails, etc).
For example, we need only St-Petersburg (Russia) phones (city code 812).
procedure ExtractPhones (const AText : string; APhones : TStrings);
begin
with TRegExpr.Create do try
Expression := '(\+\d *)?(\((\d+)\) *)?(\d+(-\d*)*)';
if Exec (AText) then
REPEAT
if Match [3] = '812'
then APhones.Add (Match [4])
UNTIL not ExecNext;
finally Free;
end;
end;
For input text
"Hi !
Please call me at work (812)123-4567 or at home +7 (812) 12-345-67
truly yours .."
this procedure returns
APhones[0]='123-4567'
APhones[1]='12-345-67'
Example 2
Extracting currency rate from Russian Bank web page.
Create new project and place at the main form TBitBtn, TLabel and TNMHTTP components.
Add following code as BitBtn1 OnClick event handler (don't mind Russian letter - they need for Russian web-page parsing):
procedure TForm1.BitBtn1Click(Sender: TObject);
const
Template = '(?i) '
+ '.*\s*\s* .\s* . [^ + '(\d?\d)/(\d?\d)/(\d\d)\s*[\d.]+\s*([\d.]+)';
begin
NMHTTP1.Get ('http://win.www.citycat.ru/finance/finmarket/_CBR/');
with TRegExpr.Create do try
Expression := Template;
if Exec (NMHTTP1.Body) then begin
Label1.Caption := Format ('Russian rouble rate %s.%s.%s: %s',
[Match [2], Match [1], Match [3], Match [4]]);
end;
finally Free;
end;
end;
Now, then You click at the BitBtn1, programm connects to specified web-server and extract current rate.
Conclusion
"Free Your mind" ((c) The Matrix ;)) and You'll find many other tasks there regular expressions can save You incredible part of stupid coding work !
P.S. Sorry for terrible english. My native language is Pascal ;)